Victoria government logo

Tips on converting Word documents to HTML

If you follow these steps to clean up your source document and before using an online HTML converter, you will end up with much cleaner HTML.

This procedure was written to help content editors who need to create a HTML (web page) version of a report that has been provided as a Word document (.doc).

Start by cleaning up the Word doc as much as possible.

Note: this procedure was written for the desktop version of Word and some steps may not apply to the browser version.

1. Remove the automated table of contents

Word documents created with an automatic table of contents will have anchor links on all headings. On the published HTML page, these headings will behave like links (underline on hover, clickable). So we need to remove the anchor links.

Removing anchor links manually is fiddly. It is usually easier to remove the automated table of contents in the Word document, and then start working on the HTML conversion.

2. Delete extra spaces and breaks

You can quickly and easily clean up a few things in Word using Find and Replace, including:

  • section breaks
  • line breaks
  • extra paragraph breaks
  • extra spaces.

Use find and replace to remove extra paragraph breaks

  1. Click Ctrl + H to open the Find and Replace popup.
  2. Click the cursor into the 'Find' field.
  3. Click the More button and then the Special button at the bottom.
  4. Click on 'paragraph mark'. Click again so there are two (which looks like ^p^p).
  5. Click the cursor into the 'Replace' field.
  6. Add a single paragraph mark here (looks like ^p).
  7. Important: make sure you have no characters in the find and replace fields, not even a space. The fields need to be completely empty!
  8. Replace all. Do this a few times until there are no more instances.

Section breaks

Replace section breaks with a paragraph mark (replace ^% with ^p).

Section breaks are found in the Special dropdown on the Find and replace window.

Line breaks

Line breaks are found in the Special dropdown on the Find and replace window.

Extra spaces

Replace 2 spaces with 1 space. Just type the space key – twice in the Find field, once in the Replace field. Keep clicking Replace all until there are no double spaces left.

3. Remove extraneous bold formatting

If headings are formatted with bold, replace bold (Ctrl + B) with Heading 2 (via the Format dropdown).

Make sure the Find and Replace fields have nothing in them before hitting the replace key on the formatting changes! (There might be a space in there that you can't see.)

Go from top to bottom of the document twice to make sure you didn't miss any.

We do not use italics for report titles online. If italics are used for this purpose, change to no italics but keep title case. (We do use italics for primary legislation, legal cases and book titlesExternal Link .)

There should be no underlining applied, except for clickable links. Online, an underline implies a clickable hyperlink.

4. Fix or apply heading formatting

On the web, page headings are always Heading 1, so your publication content should start with Heading 2 and cascade from there. (H2 can be followed by another H2 or an H3. You can't skip a level.)

Correct heading formatting is important for accessibility.

Use the inbuilt Word heading styles you see in your toolbar. This is in the Styles section of the toolbar when the Home top menu selection is active.

You can click on the small arrow in the bottom right of the Styles pane and this will display all the styles on the side of your screen. If its not displaying headings 3 and below, you can change the settings. On the bottom of this panel, click the Options button and click the Select styles to show the dropdown. Choose All styles and OK.

Using find and replace to fix heading levels

If your document uses the wrong heading levels, you can use Find and Replace to fix them quickly.

If your document uses Headings 1 to 3, you need to change them to Headings 2 to 4.

Start with the lowest heading level in the document.

  1. Click Ctrl + H to open the Find and Replace popup.
  2. Click the cursor into the Find field.
  3. Click the More button to reveal more options.
  4. Click the Format dropdown to see options.
  5. Click on Style. A find style popup will appear. Type H to jump to the section on the list where headings are listed.
  6. Click on Heading 4.
  7. Click the cursor into the 'Replace' field.
  8. Repeat steps 4 and 5 and choose Heading 5.
  9. Click Replace all. Do it a couple of times to make sure you caught them all. (Depending on where your cursor is in the document, as Find and replace works down to the bottom of the document and you may need to start again from the start of the document.

5. Convert your cleaned-up Word content into HTML

Now your document is clean! You can use an online HTML conversion tool to convert the Word-formatted content into clean-ish HTML.

Ctrl + A to select all the content and Ctrl + C to copy it. Go to the tool and click Ctrl + V to paste it.

Once the content has been converted, select all and copy it so you can paste it into the CMS.

6. Clean up the HTML code

The above process is great for getting a Word doc with lots of formatting into HTML but the resulting code will still probably have:

  • empty span tags: <span> and </span>
  • language span tags: <html lang="__"> and </html>
  • strong tags on the headings: <strong> and </strong>

Remove all of these. If left in, they can mess up the styling of headings and fonts.

Note: we remove language tags from our HTML because language is usually set for the whole page.

Links to websites are OK.

Sometimes there are document links, so also do a search for '<a ' (include the space) to find these. These will look OK in the live page, but the link most likely won't work. If they are necessary, you'll have to manually download these documents and add them as media items in the CMS.

Copy the source code into Notepad and use Find and Replace (Ctrl + H) to clean out these extraneous tags.

Check and fix tables HTML

If they have a caption, this should be straight after the <table> tag, surrounded by code like this: <caption>your text</caption>.

Sometimes there's no <thead> section, it's all <tbody>. You can change this in the CMS by right-clicking anywhere in the table and you'll see the Table properties popup. Usually, your table header is the first row. This applies the table heading formatting so it displays well but it is also important for accessibility.

And sometimes there is a <thead> section but the cells are <td> instead of <th>.

7. Paste your HTML into the CMS and preview

Go to your CMS page. Add or open a basic text component.

Click on the Source button.

Paste in the code you just copied from the converter website.

Click the Source button again to toggle back to WYSIWYG view.

Save.

Click on the preview link and carefully cross-check your source document against the previewed page to check that:

  • headings are correctly applied
  • list formatting is correct
  • other special formatting such as callouts and tables look right
  • hyperlinks are correctly applied and working.

Reviewed 13 November 2022

Was this page helpful?