Converting a Word Document to HTML and then to PDF with wkhtmltopdf

Recently I had to convert Word Documents to PDFs for a system we integrate with. Getting PDFs rendering the way you want them to is surprisingly tricky. The system I was integrating with use wkhtmltopdf to generate its PDFs from HTML files. This article details the things I learnt as well as the workflow I ended up using to get this right.

Getting Important Information

If you are in a similar situation to me where you need to provide HTMLs for use on someone else's system then the following points are important to note:

Get the exact version of wkhtmltopdf used
Get the exact parameters used when the HTML is passed to the cli tool

This is essential as both these factors can and will drastically affect the way PDFs generate.

General Conversion Flow

The general process I used to convert documents from Word to PDF was:

Ensure the Word document has been finalized, making large changes later can be a headache
Use an online word 2 html converter (the HTML generated by MS Word is horrific and unwieldy)
- I copied the contents of my entire document and pasted it here. This gave me back clean markup without catering for images or fonts.
Paste into your HTML file
Clean up a bit more
- I found I had to find replace things like ths and tds to remove fixed padding and widths
- I removed HTML properties for locales, for example, some elements had an attribute with a locale like en_US on.
Add styling to your head and apply to any elements you need to target
- This you will need to work out based on what you need to render
- I found it easiest to define a blanket style for different levels of headings, tables and paragraphs.
- After targetting common elements you can then write targeted styles for the remaining elements you need to target that do not fall under these general categories.
Check changes as you go (this live reload setup will be described in details further in this article)

Workflow

Even though you are working with HTML and you would think the best/easiest would be to dev in a browser, it is, in fact, better to dev against a PDF reader. The reason for this is that:

It is hard to see the alignment you will need in the browser as you can not see items on a page with proper dimensions. For example, if you are targetting an A4 page, the browser widths you see will not match this.
It is hard to see when you are on a new page
wkhtmltopdf seems to use an older version of Chromium. As a result, what might work in a browser will not work when rendering to pdf.
- One good example of this is flexbox where you generally have to use the CSS properties prefixed with -webkit (e.g. display: flex; must be display: -webkit-box;)

Useful Styles

The following styles are generally useful for basically any type of document you are generating. You will need to adjust your font sizes and colours as needed for your document.

/*START :: from :: https://github.com/delight-im/HTML-Sheets-of-Paper/*/
html,
body {
  /* Reset the document's margin values */
  margin: 0;
  /* Reset the document's padding values */
  padding: 0;
  /* Use the platform's native font as the default */
  /*font-family: "Roboto", -apple-system, "San Francisco", "Segoe UI", "Helvetica Neue", sans-serif;*/
  font-family: Calibri, sans-serif;
  /* Define a reasonable base font size */
  font-size: 12pt;
  /* Styles for better appearance on screens only -- are reset to defaults in print styles later */
  /* Use a non-white background color to make the content areas stick out from the full page box */
  background-color: #ffff;
}

table {
  /* Avoid page breaks inside */
  page-break-inside: avoid;
}

/*END :: from :: https://github.com/delight-im/HTML-Sheets-of-Paper/*/

.page {
  page-break-after: always;
  position: relative;
  /*The below only work if wkhtmltopdf does not add padding in the cli params*/
  /*margin: 2cm;*/
  /*padding-top: 2cm;*/
  /*padding-bottom: 2cm;*/
  /*size: A4 portrait;*/
}

.page.first {
  padding-top: 2cm;
  margin: 0;
}

.page.no-break {
  page-break-after: auto;
}

a:link {
  color: #8c8c8c;
  text-decoration: underline;
}

html {
  font-family: Calibri, sans-serif;
  font-size: 100%;
  color: #282828;
}

body {
  font-size: 1em;
  font-family: Calibri, sans-serif;
  color: #282828;
}

h1,
h2,
h3 {
  color: #1bdcdc;
  font-weight: bold;
  font-family: 'Arial Rounded MT Bold', Arial, Helvetica, sans-serif;
  margin-top: 7px;
  margin-bottom: 7px;
}

img {
  height: auto;
}

h1 {
  font-size: 15px;
}

h2 {
  font-size: 10px;
}

h3 {
  font-size: 11px;
}

/*The below is useful if you want the same ul bullet style irrespective of bullet level*/
ul li {
  list-style: disc;
  padding: 0;
  margin: 0;
  line-height: 1em;
}

p {
  text-align: justify;
}

/*I cannot workout how to get column to work on wkhtmltopdf*/
.row {
  display: -webkit-box;
  display: flex;
  flex-direction: row;
  flex-wrap: wrap;
  width: 100%;

  -webkit-box-pack: center;
  justify-content: center;
}

/*The below is very useful if you are trying to work out where widths and heights are coming from*/
/*Simply add this as a class to the element you wish to debug*/
.debug {
  border: black 1px solid;
}

.body {
  font-size: 11px;
}

Live Reloading PDFs on Changes

This is probably the most important step as it tightens your feedback loop. I am not sure if all PDF applications can do this but Evince (the Ubuntu PDF viewer) will reload a PDF if it is deleted and re-added. In a nutshell, this is how the reloading flow works:

Run a script that listens for changes on one or more files
On change delete the associated PDF and then re-generate it in the same location

I used node for this as it is easier to pull in the different front end tooling we need.

First I installed the following dependencies:

yarn add -D onchange yarn

Onchange lets us track changes and run a script. Yarn is added so that we can use yarn from scripts instead of npm.

As the changes are sufficiently complex and long and I also needed to do this in multiple places, I moved the PDF generation to a bash file:

#!/usr/bin/env bash

set -e

# the below is useful as we can install a specific version of wkhtmltopdf in a directory called bin in the root of the project
# we can then use this in our scripts
wkhtmltopdfbin="./bin/wkhtmltopdf"

generate_pdf() {
  if [ $# -lt 2 ]; then
    echo 1>&2 "$0: not enough arguments."
    echo "Usage: generate_pdf ./path/to/pdf/html/ pdf-name.pdf"
    echo "Note: PDFs always placed under <projectRoot>/dist/pdf-name.pdf"
    exit 2
  fi

  path_to_html="$1"
  pdf_name="$2"

  echo "***********************"

  rm -fr ./dist/"$pdf_name"
  mkdir -p dist
  cat "$path_to_html" | $wkhtmltopdfbin --disable-smart-shrinking -T 2cm -L 2cm -R 2cm -B 2cm -s A4 - ./dist/"$pdf_name" || true
  echo "***********************"
}

if declare -f "$1" >/dev/null; then
  # call arguments verbatim
  "$@"
else
  # Show a helpful error
  echo "'$1' is not a known function name" >&2
  exit 1
fi

In the above, I have installed the version of wkhtmltopdf that I need in a folder called bin which is in the root of the project. I also used the same flags that the target system uses when I call wkhtmltopdf.

For the sake of clarity pretend I placed the above in a file called functions.sh under <projectRoot>/scripts/. I then added the following scripts to my package.json file:

{
  //...
  "scripts": {
    "gen-attachment": "bash ./scripts/functions.sh generate_pdf attachments/some-attachement/foo.html foo-test.pdf",
    "pdf-dev-attachement": "onchange attachments/some-attachment/foo.html -- yarn run gen-attachment"
  }
}

The above has 2 scripts:

gen-attachment: This is used to generate a pdf from a file called foo.html in the folder <rootOfProject>/attachement/
pdf-dev-attachment: This watches for changes with the file foo.html and runs the previous script when there is a change.

Embedding Images

I am not sure if it is due to the system I integrate with or if it is always a requirement with wkhtmltopdf but I had to embed all images using their base64 encoding. This is of the form:

<img
  style="width: 100px; padding:0; margin:0; float: right;"
  src="data:image/png;base64,<base64EncodingOfTheImage>"
  alt="main logo"
/>

To ease this process a bit I wrote the following script in the same bash file discussed previously:

base64_images() {
  for file in $(find attachments/ -type f \( -name "*.png" -o -name "*.jpg" \)); do
    result="data:image/png;base64,"$(base64 "$file" | tr -d "\n")
    echo "$result" | tr -d "\n" >"$file".base64.txt
  done
}

This is then called in the package.json as follows:

{
  //...
  "scripts": {
    "gen-base64-images": "bash ./scripts/functions.sh base64_images"
  }
}

This script:

search for all images in the folder <rootOfProject>/attachements/ of type png and jpg
it then base64 encodes them and prepends data:image/png;base64, to the result
the result is stripped of newlines at the end
the result is output to a file with the same name as the image but with the additional suffix .base64.txt

There probably is some way to use bundlers like webpack to do this for you. But I have not had a chance to work out how to fit this into my flow.

Conclusion

As you can see converting a Word Document to a PDF is not as trivial as it may seem from the outset. But using the steps mentioned above with the provided workflow will help tighten your feedback loop and help you finalize your PDFs as soon as possible.