Selenium: testing PDF with Calibre

During my test activities I have arrived at the point that I need to test generated PDFs containing the same text that I have in a html page.

So I searched different PDF utilities that can be used through a command line to extract text out of a PDF, among these I found: pdftotext (a command line tool coming with xpdf), Calibre (an ebook manager/converter), Tika (used by Solr to index pdf content), pdfbox (a java application library that extract pdf).

Giving a better look Calibre is using poppler as library and pdftotext is using poppler as well while Tika is based on pdfbox library to extract text.

At the end I chose Calibre for different reasons:

  1. Calibre allows to extract hyperlinks from the pdf (main reason) when using a wiki syntax like textile (I didn’t find this possibility in pdftotext, tika or pdfbox) on the other side the generated text is with wiki syntax
  2. The development life cycle of Calibre is quite fast (compared with xpdf)
  3. Calibre points to release the application for Windows, Mac and Linux (with xpdf you can have different versions but the one for Mac is released on 2007)

Note that I don’t care the possibility to extract text from images, I know that with pdfimages command it is possible to extract them or other formats that tika can support.

I have to say that in terms of performance on the pdf I tested pdftotext is the fastest (0.01 sec), followed by pdfbox (1.7 sec), Calibre (2.3 sec) and then Tika (2.8 sec) on page , so if you don’t care about hyperlinks but just PDF text I would suggest to go to pdftotext.

After choosing Calibre I created a simple php application that:

  1. receives as parameter the URL of a PDF
  2. calls Calibre with different options on the PDF downloaded
  3. removes  some text (like empty lines, etc.)
  4. gives back an xml file with all the text generated.

I created then a selenium command that calls the php file (through an xmlhttp request) and search for a particular text (specified as input together with the url of the pdf file).

Php application (convert.php):

<?php
$content = file_get_contents($_GET['url']);
$filefrom = 'extract.pdf';
$fileto = preg_replace("/\.pdf$/","",$filefrom).".txt";
file_put_contents($filefrom,$content);
//$t_start = microtime(true);
//system("java -jar tika.jar -t ".escapeshellcmd($filefrom)." > ".escapeshellcmd($fileto),$ret);
system("ebook-convert ".escapeshellcmd($filefrom)." ".escapeshellcmd($fileto)."  >nul 2>&1",$ret);
//system("pdftotext -nopgbrk -raw ".escapeshellcmd($filefrom)." ".escapeshellcmd($fileto),$ret);
//system("java -jar pdfbox.jar ExtractText  ".escapeshellcmd($filefrom)." ".escapeshellcmd($fileto),$ret);
//$t_end = microtime(true);
if($ret==0){
  $value=file_get_contents($fileto);
  $value_empty_line=preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $value);
  $text=preg_replace("/(\[.*\]\n)/","",$value_empty_line);
  unlink($fileto);
  header('Content-type: text/xml');
  echo "<result>".htmlspecialchars($text)."</result>";
}
else{
  header('Content-type: text/xml');
  echo "<result>Convertion failed</result>";
}
?>

Selenium command (assertSearhInPDF):

Selenium.prototype.assertSearchInPDF = function(uri,names){
  var baseurl = globalStoredVars['host'];
  var params = [{"param" : "url","value" : storedVars[uri],}];
  var lista = "";
  for(var i=0; i<params.length; i++){
    lista +="&" + params[i].param + "=" + encodeURIComponent(params[i].value);
  }
  var indirizzo = baseurl+"/test/convert.php?"+lista;
  LOG.info( 'indirizzo = ' + indirizzo);
  var responseXML = chiamaWebService(indirizzo);
  LOG.info( 'response = ' + responseXML);
  var text = responseXML.getElementsByTagName('result')[0].firstChild.nodeValue;
  text = text.replace(/(\n)/g, " ");
  var array = names.split('|');
  var result=0;
  var length=array.length;
  LOG.info( 'text = ' + text);
  for (var i = 0; i < length; i++){
    if(text.indexOf(array[i]) !==-1){
      LOG.info( 'Found = ' + array[i]);
      result=result+1;
      text=text.substring(text.indexOf(array[i]));
    }
    else{
      LOG.info( 'Element ' + array[i]+' not found');
      break;
    }
  }
  if(result!=length)
    Assert.fail("Not all the elements have been found");
};

In particular the selenium command can accept a list of elements (separated by “|”) to searched in the respective order, if they are not found the command fails.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s