Converting docx to html using mammoth

At work we have different business cases on semantic data. For a client we have to parse doc files to store them as graph in Virtuoso.

We did an analysis among the 3 options (mammoth.js, Apache POI, docx4j). The result is that mammoth.js generates a quite clean html file compared to Apache POI and docx4j because they try to reproduce the same formatting transforming for example breaklines with empty paragraphs. Further Apache POI doesn’t have a converter for docx but only for doc (for the docx you need to use this extension: Furthermore, Apache POI and docx4j generated not valid html such as hyperlinks in hyperlinks, therefore at the end we discarded these 2 tools.

Mammoth.js, which can be used via browser or via node.js, is created by Michael Williamson which implemented the same functionalities in Python and Java.

An implementation of mammoth.js can be found in the WordPress plugin, great way to transform your documents into posts.

Mammoth.js is licensed under BSD-2 license and it can be downloaded via npm.

Once your document is converted into by html mammoth.js, you can use Cheerio a Jquery implementation on nodejs to manipulate the dom and enriching with RDFa meta tags.

From there you will need a RDFa parser to transform the HTML+RDFa into RDF triples and store them in a triple store such as EasyRDF (which is used as core library in Drupal 8)

Our implementation can be found here:




Download attachments and page content from Confluence

For a project there is a need to move all the attachments in Confluence, including the page content, in a remote repository.

Confluence offers different API and the XML-RPC API can be still used. The API do not allow to download all the attachments in once but only page by page therefore the need for this python script which:

  1. Creates a folder for each page which includes
    1. the html export of the page
    2. all the attachments of the page
  2. Once all the pages are saved, it moves all the folders in the same hierarchical structure of the space

You can find the code my GitHub repository:

Main inspiration came from:

Static code analysis in Python with Jenkins

I am checking the code quality of a python script called which is in the root folder of my code .

To do so I installed the following packages (I am with Python 27 on Windows machine):

pip install pylint
easy_install -U clonedigger
pip install flake8

On Jenkins make sure you have setup Warning plugin and Violation plugin.

In Jenkins configuration, under Compiler Warning section we add a parser for flake8 with:

Name: flake8
Link name: Flake8 warnings
Trend report name: Flake8 warnings trend
Regular expression: ^(.*):([0-9]*):([0-9]*):(.[CE][0-9]*)(.*)$
Mapping script:
import hudson.plugins.warnings.parser.Warning
import hudson.plugins.analysis.util.model.Priority

String fileName =
String lineNumber =
String category =
String message =

return new Warning(fileName, Integer.parseInt(lineNumber), category, "PyFlakes Parser", message, Priority.NORMAL);;

Note from the regular expression the [CE] pattern in order to find Complexity and Error id problems. If you are not interested in the complexity just replace [CE] with E.

Now on the jenkins job configuration we add a build step “Execute Windows batch command” (you could do it similarly for Linux) with the following content:

rmdir /s /q output
mkdir output
pylint --msg-template="{path}:{line}: [{msg_id}({symbol}), {obj}] {msg}" --reports=y >> output/pylint.log
clonedigger --cpd-output -o output/clonedigger.xml
flake8 --max-complexity 12 --output-file output/flake8-output.txt

The batch creates an output folder where pylint, clonedigger and flake8 generate their files (if you want you could put the script inside in a bat file having as parameters the output folder and the file).

Add a post build action “Scan for compiler warnings”, select “scan workspace files” and with the following parameters:

File patterns: output/flake8-output.txt
Parser: flake8

Add a post build action “Report violations”  with the following parameters:

cpd: output/clonedigger.xml
pylint: output/pylint.log

Run the build and you should see the trend graph for Flake8 and the violations for cpd (clonedigger) and pylint!


If Flake8 exits with exit 1 failing your build, you might probably change the file:


At the end of the def main(): function you should have:

if exit_code > 0:
    raise SystemExit(exit_code > 0)

Which you can replace with:

raise SystemExit(0)

Now Flake8 should exit without problem.






Drupal and Open data

The “open data” topic is always in the air and, thanks to the open source world, it is becoming more and more spread.

When we want to manage open data we need to consider 2 types of software:

  1. linked data applications;
  2. data catalogs.

The first type is oriented towards the use of triple store databases specialized in store RDF triples, we find solutions like:

We find as well some REST API which aim to connect to triple stores:

The second type is oriented to the management of the open data:

Drupal has also a couple of interesting modules to connect to triple stores:

Another interesting module is RDFx which allows to manage RDF mapping with Drupal Content types  and it works in combination with the RESTWS module so you can have the RDF extraction of a content type by simply adding “.rdf” to your node (for example http://localhost/drupal/node/1.rdf)

Well if you find more open source applications, which might connect to Drupal, let me know :-)


Connect Drupal 7 and Github via Oauth

In a project where I am working on there is a need to connect Drupal with Github.

Github offers an interesting API which is based on the Oauth protocol like Google and Twitter.

The first thing to do is creating an account on Github and on the profile page add an application url even setup in localhost (like http://localhost/drupal). When done, you will receive a client id and a client secret which will be used by Drupal to connect as client.

I setup Drupal on my laptop via Xampp, which means creating quickly a user “drupal” with a database “drupal” via phpmyadmin and then extract Drupal in my htdocs folder (using drupal as folder name) and follow the instructions at http://localhost/drupal.

On Drupal you need download and setup the Oauth connector module which comes with some dependencies, in particular the Oauth module which can be used for Oauth2.

Once everything is setup you can start to configure and for this you can follow the steps in this guide for  Facebook:

with the data used from here (I just forked the original project):

as you can see there is a mapping to be done between the github profile attributes and the drupal profile attributes. Since the Github name attribute is optional I preferred to use the login attribute (you can see some attributes from my profile), so just change it in:

'name' => array(
'resource' => '',
'method post' => 0,
'field' => 'login',
'querypath' => FALSE,
'sync_with_field' => 'name',

Unfortunately the avatar will not be copied but you can contribute to the module :-)

When you will connect to Github you might experience 2 types of problem:

If everything goes fine, at the end you will have a github button at the login which redirect the user to Github and come back to Drupal.

Be careful that when you come back to Drupal you have a session cookie therefore the new user is authorized but, since Drupal by default blocks the user (because by default the administrator should authorize and the account should be confirmed by email), you cannot login with your account; also my Drupal doesn’t send emails. So I suggest to remove the cookie from your browser, login with the administrator and unblock the user (you will be able to login with the github account) and change the options in Drupal to allow people registering automatically without waiting that the administrator unblocks the user.

With the last changes you would expose Drupal to security issue but I didn’t try it a Drupal hosted on a website and with email confirmation.





Avoid XML Schema restrictions

I am back on XML Schema design and I do really like it !

One of the challenge that I am facing now is including or not the UBL metalanguage in my schemas (see some video on the oasis website), an Oasis standard, as you can deduce.

As you can see from this file:

UBL metalanguage has elements which are based on the UN/CEFACT elements, see:

So as you can see UN/CEFACT has ComplexType which are extended by adding optional attributes, while the UBL elements are sometime restrictions  on the attributes (by making them required) and sometimes the elements are just extended giving a freedom in a later moment to choose what to add.

Independently from  the current problem, it happens that we might not need all the attributes or simply all the elements for our schema.

So either we add restrictions or we copy the elements that we need in our schema but simplified.

Adding restrictions can be possible in two ways:

1) Restrict on the patterns, like the max length of an element, they are so called facets, see:

2) Restrict on the number of elements or attributes, in the case of the attributes  we need to make them prohibited

Conceptually both are restrictions based on an another type and this creates a problem in Object Oriented languages, like Java (hence the tittle of this post) which supports extensions (at the end a restriction is a sort of extension of a base object with some changes).

Jaxb is the official standard used for databinding XML elements to Java objects which relies on XJC for the conversion. Such standard adds Java annotations to associate XML elements to Java objects. Further, Jaxb relies on the validator of the marshaller by using the setSchema() method, see for example the post of Blaise Doughan, so in any moment you can validated your object against the schema to be sure before sending out your xml message.

For the facets, Jaxb doesn’t create annotations, there is still an open issue where 2 subgroups are working to solve it but still none of them are officially approved:

Other data bindings work on simple restrictions like enumerations (see Jibx) or numeric type and enumerations (see Xmlbeans, now archived).

You need also to consider that facets can change in the future (the max length could be extended from 100 characters to 200), so which impact has changing the xml schema? Since Jaxb doesn’t generate annotations for facets you don’t need to change the java objects but other databindings might be.

For the restriction on the the number of elements/attributes this CANNOT be reflected in a Object Oriented language because it is not possible to restrict an extended class on inherited properties. Therefore I suggest to copy simplified elements (by keep only those mandatory and removing the optionals), in this way:

  1. you have less dependency
  2. the developer has less method generated (just those needed)
  3. you keep compatibility at the minimum if you want to convert the copied object into the original objects

Therefore I would recommend to avoid restrictions if possible, you can keep them for the sake of validation but you need to think about the impact on the objects generated with the databinding libraries. Such recommandation is also expressed by the HP XML schema best practices (search for “restrictions for complex types”) and in Microsoft xml schema design pattern.




Mailman and OpenDJ

For a project I started to add mailing list functionality to the development infrastructure, as requirement the mailing list should make use of the LDAP service.

In a previous article I explained that I chose OpenDJ to manage LDAP users and groups and for the management of mailing list I chose Mailman to be used with Postfix. As alternative I could have chosen Sympa but since Mailman is packaged already in many distribution and it doesn’t have many requirements like Sympa has, I preferred to go for Mailman. The drawback is that Mailman doesn’t come with LDAP support and I found a script which I adapted and uploaded on my github account.

To manage mailing list in OpenDJ you could opt for 3 ways:

  1. use the mailGroup object class
  2. use the groupOfUniqueNames object class together with the extensibleObject class
  3. customize your schema

OpenDJ has the schema definition of mailGroup coming from the Solaris system which allows  to enter the mail of the group and the mail of the users by using the attribute mgrpRFC822MailMember.

But…most of the time you will already have your users recorded in LDAP (using most probably the inetOrgPerson) therefore, instead of directly inserting users mail you would like to insert the DN of each user, in this way even PHPLdapAdmin allows you to insert easily each user without pasting email addresses.

Therefore I went for groupOfUniqueNames which allows me to define users with their DN and, by adding the extensibleObject class, I have the possibility to add the mail attribute.

In this way if I have already defined a group I can use it as well for mailing list without replicating information (e.g. mail address for each user with mailGroup).

After that I installed and configured Mailman and, as described above, I used a perl script which:

  1. extracts those groups that are groupOfUniqueNames and have a mail attribute set (&(objectClass=groupOfUniqueNames)(mail=*))
  2. extract all the users uniqueMember belonging to the group which have an mail address set and are not disabled (&(mail=*)(!(ds-pwp-account-disabled=*)))
  3. creates a mailing list which will have as name the CN group attribute (saved inside the variable $listname)

If you execute the script with the command:


you will get an output like:

Processing list: jenkins-users-mail
Found member: emidio
List jenkins-users-mail does not exist.
Creating new list jenkins-users-mail.

Syncing jenkins-users-mail...
Added :

enjoy creating mailing lists !