Converting docx to html using mammoth

At work we have different business cases on semantic data. For a client we have to parse doc files to store them as graph in Virtuoso.

We did an analysis among the 3 options (mammoth.js, Apache POI, docx4j). The result is that mammoth.js generates a quite clean html file compared to Apache POI and docx4j because they try to reproduce the same formatting transforming for example breaklines with empty paragraphs. Further Apache POI doesn’t have a converter for docx but only for doc (for the docx you need to use this extension: Furthermore, Apache POI and docx4j generated not valid html such as hyperlinks in hyperlinks, therefore at the end we discarded these 2 tools.

Mammoth.js, which can be used via browser or via node.js, is created by Michael Williamson which implemented the same functionalities in Python and Java.

An implementation of mammoth.js can be found in the WordPress plugin, great way to transform your documents into posts.

Mammoth.js is licensed under BSD-2 license and it can be downloaded via npm.

Once your document is converted into by html mammoth.js, you can use Cheerio a Jquery implementation on nodejs to manipulate the dom and enriching with RDFa meta tags.

From there you will need a RDFa parser to transform the HTML+RDFa into RDF triples and store them in a triple store such as EasyRDF (which is used as core library in Drupal 8)

Our implementation can be found here:



Download attachments and page content from Confluence

For a project there is a need to move all the attachments in Confluence, including the page content, in a remote repository.

Confluence offers different API and the XML-RPC API can be still used. The API do not allow to download all the attachments in once but only page by page therefore the need for this python script which:

  1. Creates a folder for each page which includes
    1. the html export of the page
    2. all the attachments of the page
  2. Once all the pages are saved, it moves all the folders in the same hierarchical structure of the space

You can find the code my GitHub repository:

Main inspiration came from:

Static code analysis in Python with Jenkins

I am checking the code quality of a python script called which is in the root folder of my code .

To do so I installed the following packages (I am with Python 27 on Windows machine):

pip install pylint
easy_install -U clonedigger
pip install flake8

On Jenkins make sure you have setup Warning plugin and Violation plugin.

In Jenkins configuration, under Compiler Warning section we add a parser for flake8 with:

Name: flake8
Link name: Flake8 warnings
Trend report name: Flake8 warnings trend
Regular expression: ^(.*):([0-9]*):([0-9]*):(.[CE][0-9]*)(.*)$
Mapping script:
import hudson.plugins.warnings.parser.Warning
import hudson.plugins.analysis.util.model.Priority

String fileName =
String lineNumber =
String category =
String message =

return new Warning(fileName, Integer.parseInt(lineNumber), category, "PyFlakes Parser", message, Priority.NORMAL);;

Note from the regular expression the [CE] pattern in order to find Complexity and Error id problems. If you are not interested in the complexity just replace [CE] with E.

Now on the jenkins job configuration we add a build step “Execute Windows batch command” (you could do it similarly for Linux) with the following content:

rmdir /s /q output
mkdir output
pylint --msg-template="{path}:{line}: [{msg_id}({symbol}), {obj}] {msg}" --reports=y >> output/pylint.log
clonedigger --cpd-output -o output/clonedigger.xml
flake8 --max-complexity 12 --output-file output/flake8-output.txt

The batch creates an output folder where pylint, clonedigger and flake8 generate their files (if you want you could put the script inside in a bat file having as parameters the output folder and the file).

Add a post build action “Scan for compiler warnings”, select “scan workspace files” and with the following parameters:

File patterns: output/flake8-output.txt
Parser: flake8

Add a post build action “Report violations”  with the following parameters:

cpd: output/clonedigger.xml
pylint: output/pylint.log

Run the build and you should see the trend graph for Flake8 and the violations for cpd (clonedigger) and pylint!


If Flake8 exits with exit 1 failing your build, you might probably change the file:


At the end of the def main(): function you should have:

if exit_code > 0:
    raise SystemExit(exit_code > 0)

Which you can replace with:

raise SystemExit(0)

Now Flake8 should exit without problem.






Drupal and Open data

The “open data” topic is always in the air and, thanks to the open source world, it is becoming more and more spread.

When we want to manage open data we need to consider 2 types of software:

  1. linked data applications;
  2. data catalogs.

The first type is oriented towards the use of triple store databases specialized in store RDF triples, we find solutions like:

We find as well some REST API which aim to connect to triple stores:

The second type is oriented to the management of the open data:

Drupal has also a couple of interesting modules to connect to triple stores:

Another interesting module is RDFx which allows to manage RDF mapping with Drupal Content types  and it works in combination with the RESTWS module so you can have the RDF extraction of a content type by simply adding “.rdf” to your node (for example http://localhost/drupal/node/1.rdf)

Well if you find more open source applications, which might connect to Drupal, let me know :-)


Connect Drupal 7 and Github via Oauth

In a project where I am working on there is a need to connect Drupal with Github.

Github offers an interesting API which is based on the Oauth protocol like Google and Twitter.

The first thing to do is creating an account on Github and on the profile page add an application url even setup in localhost (like http://localhost/drupal). When done, you will receive a client id and a client secret which will be used by Drupal to connect as client.

I setup Drupal on my laptop via Xampp, which means creating quickly a user “drupal” with a database “drupal” via phpmyadmin and then extract Drupal in my htdocs folder (using drupal as folder name) and follow the instructions at http://localhost/drupal.

On Drupal you need download and setup the Oauth connector module which comes with some dependencies, in particular the Oauth module which can be used for Oauth2.

Once everything is setup you can start to configure and for this you can follow the steps in this guide for  Facebook:

with the data used from here (I just forked the original project):

as you can see there is a mapping to be done between the github profile attributes and the drupal profile attributes. Since the Github name attribute is optional I preferred to use the login attribute (you can see some attributes from my profile), so just change it in:

'name' => array(
'resource' => '',
'method post' => 0,
'field' => 'login',
'querypath' => FALSE,
'sync_with_field' => 'name',

Unfortunately the avatar will not be copied but you can contribute to the module :-)

When you will connect to Github you might experience 2 types of problem:

If everything goes fine, at the end you will have a github button at the login which redirect the user to Github and come back to Drupal.

Be careful that when you come back to Drupal you have a session cookie therefore the new user is authorized but, since Drupal by default blocks the user (because by default the administrator should authorize and the account should be confirmed by email), you cannot login with your account; also my Drupal doesn’t send emails. So I suggest to remove the cookie from your browser, login with the administrator and unblock the user (you will be able to login with the github account) and change the options in Drupal to allow people registering automatically without waiting that the administrator unblocks the user.

With the last changes you would expose Drupal to security issue but I didn’t try it a Drupal hosted on a website and with email confirmation.





Mailman and OpenDJ

For a project I started to add mailing list functionality to the development infrastructure, as requirement the mailing list should make use of the LDAP service.

In a previous article I explained that I chose OpenDJ to manage LDAP users and groups and for the management of mailing list I chose Mailman to be used with Postfix. As alternative I could have chosen Sympa but since Mailman is packaged already in many distribution and it doesn’t have many requirements like Sympa has, I preferred to go for Mailman. The drawback is that Mailman doesn’t come with LDAP support and I found a script which I adapted and uploaded on my github account.

To manage mailing list in OpenDJ you could opt for 3 ways:

  1. use the mailGroup object class
  2. use the groupOfUniqueNames object class together with the extensibleObject class
  3. customize your schema

OpenDJ has the schema definition of mailGroup coming from the Solaris system which allows  to enter the mail of the group and the mail of the users by using the attribute mgrpRFC822MailMember.

But…most of the time you will already have your users recorded in LDAP (using most probably the inetOrgPerson) therefore, instead of directly inserting users mail you would like to insert the DN of each user, in this way even PHPLdapAdmin allows you to insert easily each user without pasting email addresses.

Therefore I went for groupOfUniqueNames which allows me to define users with their DN and, by adding the extensibleObject class, I have the possibility to add the mail attribute.

In this way if I have already defined a group I can use it as well for mailing list without replicating information (e.g. mail address for each user with mailGroup).

After that I installed and configured Mailman and, as described above, I used a perl script which:

  1. extracts those groups that are groupOfUniqueNames and have a mail attribute set (&(objectClass=groupOfUniqueNames)(mail=*))
  2. extract all the users uniqueMember belonging to the group which have an mail address set and are not disabled (&(mail=*)(!(ds-pwp-account-disabled=*)))
  3. creates a mailing list which will have as name the CN group attribute (saved inside the variable $listname)

If you execute the script with the command:


you will get an output like:

Processing list: jenkins-users-mail
Found member: emidio
List jenkins-users-mail does not exist.
Creating new list jenkins-users-mail.

Syncing jenkins-users-mail...
Added :

enjoy creating mailing lists !

Export Crowd users and groups to ldap

I need to migrate users and groups created in Atlassian Crowd to OpenDJ, an open source LDAP server written in Java fork of OpenDS; in this way Crowd will act just as SSO and it will rely on OpenDJ to store users.

Crowd can rely on several LDAP servers but it doesn’t have tools to export users and groups to ldap so I created a couple of queries (my database is PostgreSQL) to export in ldif files:

COPY(select 'dn: cn=' || user_name || ',ou=users,dc=my,dc=website,dc=com' || chr(10) ||
'objectClass: inetOrgPerson' || chr(10) ||
'objectClass: organizationalPerson' || chr(10) ||
'objectClass: person' || chr(10) ||
'objectClass: top' || chr(10) ||
'mail: ' || lower_email_address || chr(10) ||
'uid: ' || user_name || chr(10) ||
'cn: ' || user_name || chr(10) ||
'givenName: ' || first_name || chr(10) ||
'sn: ' || last_name || chr(10) ||
'displayName: ' || display_name || chr(10) ||
'userPassword: ' || credential || chr(10) || chr(10)
from cwd_user) TO '/tmp/users.ldif';

Of course you can customize the query to adapt to your needs, have in mind that the userPassword most probably is stored in PKCS5S2 format (you can see that the passwords extracted start with {PKCS5S2} prefix, see Crowd hashes), which is the default for Crowd or called Atlassian-security. In this sense only ApacheDS currently (2.0.0-m16) supports the same format while for OpenDJ there is a change to be done to the pbkdf2 storage scheme. It is up to you if you prefer to reset passwords or try to keep them during the migration.

As small parenthesis I have been asking to those people of Crowd about their connector for ApacheDS, since their connector supports ApacheDS 1.5.x (not anymore maintained by ApacheDS developers) and they are not sure to spend time in updating such connector for version 2.0, see my question on Atlassian Answers. An interesting thing to know is that also WSO2 Identity Server has bundled ApacheDS 1.5.7, you can see it also in the source code but you can connect it to OpenDJ.

In case your Crowd was setup with SSHA1, you can also use Openldap, see FAQ.

For the groups the query would be something like:

COPY(select 'dn: cn=' || parent_name || 'BASE_GROUPS' || chr(10) ||
'cn: ' || parent_name || chr(10) ||
'objectClass: groupOfUniqueNames' || chr(10) ||
'objectClass: top' || chr(10) ||
'uniquemember:' array_to_string(array_agg(child_name),',')
from cwd_membership group by parent_name) TO '/tmp/groups.ldif';

As you can see I used the functions array_to_string and array_agg integrated in PostgreSQL to create the list of users. These users will not have their base (,ou=users,dc=my,dc=website,dc=com), so, after executing the query, you can apply some regular expression (I used Notepad++) to replace the regex “([,])” with “,ou=users,dc=my,dc=website,dc=com\nuniquemember: cn” which replaces all the “,” inserted with the query with their base. At the end, I replaced BASE_GROUPS with “,ou=groups,dc=my,dc=website,dc=com“.

You can import the ldif files with PHPLdapAdmin or directly in OpenDJ with import-ldif command, keep in mind that you need to allow pre encoded passwords before importing.

Enjoy the migration !