Inside a docx file

< Blog

Let's take a peek inside a docx file.

Created on: 2019-01-22

While converting .docx file to .md using pandoc I faced a unique problem. Each time I converted the .docx file to .md, the images in the .md file where getting jumbled. While investigating I came across this GitHub issue and a fascinating comment about .docx files which says it is just a fancy zip file with xml mapping.

Let's dive in. Let's take a .docx file and start probing.

First let's download a .docx file:

wget https://calibre-ebook.com/downloads/demos/demo.docx

Then, we will need to change the file extention:

mv demo.docx demo.zip

Unzipping:

unzip demo.zip -d demo

Looking at the file tree:

ls -R demo

Gives us the following output:

demo:
'[Content_Types].xml'   customXml   docProps   _rels   word

demo/customXml:
item1.xml  item2.xml  itemProps1.xml  itemProps2.xml  _rels

demo/customXml/_rels:
item1.xml.rels  item2.xml.rels

demo/docProps:
app.xml  core.xml

demo/_rels:

demo/word:
document.xml  endnotes.xml  fonts  fontTable.xml  footnotes.xml  media  numbering.xml  _rels  settings.xml  styles.xml  theme  webSettings.xml

demo/word/fonts:
font1.odttf  font2.odttf  font3.odttf  font4.odttf  font5.odttf  font6.odttf

demo/word/media:
image1.gif  image2.png  image3.png  image4.png

demo/word/_rels:
document.xml.rels  fontTable.xml.rels  numbering.xml.rels

demo/word/theme:
theme1.xml

We will look more closely in all of this file in future!

Source