Converting docx to html using mammoth

At work we have different business cases on semantic data. For a client we have to parse doc files to store them as graph in Virtuoso.

We did an analysis among the 3 options (mammoth.js, Apache POI, docx4j). The result is that mammoth.js generates a quite clean html file compared to Apache POI and docx4j because they try to reproduce the same formatting transforming for example breaklines with empty paragraphs. Further Apache POI doesn’t have a converter for docx but only for doc (for the docx you need to use this extension: https://github.com/opensagres/xdocreport/wiki/XWPFConverterXHTML). Furthermore, Apache POI and docx4j generated not valid html such as hyperlinks in hyperlinks, therefore at the end we discarded these 2 tools.

Mammoth.js, which can be used via browser or via node.js, is created by Michael Williamson which implemented the same functionalities in Python and Java.

An implementation of mammoth.js can be found in the WordPress plugin, great way to transform your documents into posts.

Mammoth.js is licensed under BSD-2 license and it can be downloaded via npm.

Once your document is converted into by html mammoth.js, you can use Cheerio a Jquery implementation on nodejs to manipulate the dom and enriching with RDFa meta tags.

From there you will need a RDFa parser to transform the HTML+RDFa into RDF triples and store them in a triple store such as EasyRDF (which is used as core library in Drupal 8)

Our implementation can be found here: https://github.com/SEMICeu/e-legislation-pilot

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s