convert html tags in markdown to word equivalents #10840

Triquetra · 2025-05-12T17:57:35Z

Triquetra
May 12, 2025

I have some markdown (in Obsidian) that I'm trying to convert to Word (docx) format through pandoc. The markdown contains some html tags for underline and italics. At present these tags are removed and ignored. It would be preferable to have these tags represented as their equivalents in the target markup.

AI suggested the following LUA script, which does not do anything:

local function underline_filter (elem)
  if elem.tag == "u" then
    return pandoc.Underline(elem.content)
  end
end

return {
  { Inline = underline_filter },
}

Does anyone have a suggestion for how to handle these html tags within markdown?

Answered by rnwst

May 12, 2025

It is always a good idea to include a sample Markdown snippet and the exact command you're running!

Pandoc should parse the tags as RawInlines. Say test.md contains

<u>unarticulated</u> and <i>idiomatic</i> text

, then this is parsed via pandoc test.md -t native into the following AST:

[ Para
    [ RawInline (Format "html") "<u>"
    , Str "unarticulated"
    , RawInline (Format "html") "</u>"
    , Space
    , Str "and"
    , Space
    , RawInline (Format "html") "<i>"
    , Str "idiomatic"
    , RawInline (Format "html") "</i>"
    , Space
    , Str "text"
    ]
]

This Lua filter should do what you want:

---@param inlines Inlines
---@return Inlines | nil
function Inlines(inlines)
   -- …

View full answer

rnwst · 2025-05-12T22:52:00Z

rnwst
May 12, 2025
Sponsor

It is always a good idea to include a sample Markdown snippet and the exact command you're running!

Pandoc should parse the tags as RawInlines. Say test.md contains

<u>unarticulated</u> and <i>idiomatic</i> text

, then this is parsed via pandoc test.md -t native into the following AST:

[ Para
    [ RawInline (Format "html") "<u>"
    , Str "unarticulated"
    , RawInline (Format "html") "</u>"
    , Space
    , Str "and"
    , Space
    , RawInline (Format "html") "<i>"
    , Str "idiomatic"
    , RawInline (Format "html") "</i>"
    , Space
    , Str "text"
    ]
]

This Lua filter should do what you want:

---@param inlines Inlines
---@return Inlines | nil
function Inlines(inlines)
   -- go back to front to avoid problems with changing indices
   for i = #inlines, 1, -1 do
      if inlines[i].tag == 'RawInline' and inlines[i].text == '<u>' then
         for j = i + 1, #inlines, 1 do
            if inlines[j].tag == 'RawInline' and inlines[j].text == '</u>' then
               inlines[i] = pandoc.Underline({table.unpack(inlines, i+1, j-1)})
               for _ = i+1, j, 1 do
                  inlines:remove(i+1)
               end
               return inlines
            end
         end
      end
   end
end

Currently, it only handles the  elements. I'll leave it as an exercise to you to modify it so it also takes care of the  elements 😃
Btw, writing filters becomes a lot easier if you use pandoc-lua-types (shameless plug). I also recommend having a close look at the documentation.

1 reply

rnwst May 25, 2025
Sponsor

A version of this filter which parses both  and  elements is now available here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

convert html tags in markdown to word equivalents #10840

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

convert html tags in markdown to word equivalents #10840

Uh oh!

Triquetra May 12, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

rnwst May 12, 2025 Sponsor

Uh oh!

rnwst May 25, 2025 Sponsor

Triquetra
May 12, 2025

Replies: 1 comment 1 reply

rnwst
May 12, 2025
Sponsor

rnwst May 25, 2025
Sponsor