At work we have different business cases on semantic data. For a client we have to parse doc files to store them as graph in Virtuoso.
We did an analysis among the 3 options (mammoth.js, Apache POI, docx4j). The result is that mammoth.js generates a quite clean html file compared to Apache POI and docx4j because they try to reproduce the same formatting transforming for example breaklines with empty paragraphs. Further Apache POI doesn’t have a converter for docx but only for doc (for the docx you need to use this extension: https://github.com/opensagres/xdocreport/wiki/XWPFConverterXHTML). Furthermore, Apache POI and docx4j generated not valid html such as hyperlinks in hyperlinks, therefore at the end we discarded these 2 tools.
An implementation of mammoth.js can be found in the WordPress plugin, great way to transform your documents into posts.
Mammoth.js is licensed under BSD-2 license and it can be downloaded via npm.
Once your document is converted into by html mammoth.js, you can use Cheerio a Jquery implementation on nodejs to manipulate the dom and enriching with RDFa meta tags.
Our implementation can be found here: https://github.com/SEMICeu/e-legislation-pilot