During my test activities I have arrived at the point that I need to test generated PDFs containing the same text that I have in a html page.
So I searched different PDF utilities that can be used through a command line to extract text out of a PDF, among these I found: pdftotext (a command line tool coming with xpdf), Calibre (an ebook manager/converter), Tika (used by Solr to index pdf content), pdfbox (a java application library that extract pdf).
Giving a better look Calibre is using poppler as library and pdftotext is using poppler as well while Tika is based on pdfbox library to extract text.
At the end I chose Calibre for different reasons:
- Calibre allows to extract hyperlinks from the pdf (main reason) when using a wiki syntax like textile (I didn’t find this possibility in pdftotext, tika or pdfbox) on the other side the generated text is with wiki syntax
- The development life cycle of Calibre is quite fast (compared with xpdf)
- Calibre points to release the application for Windows, Mac and Linux (with xpdf you can have different versions but the one for Mac is released on 2007)
Note that I don’t care the possibility to extract text from images, I know that with pdfimages command it is possible to extract them or other formats that tika can support.
I have to say that in terms of performance on the pdf I tested pdftotext is the fastest (0.01 sec), followed by pdfbox (1.7 sec), Calibre (2.3 sec) and then Tika (2.8 sec) on page , so if you don’t care about hyperlinks but just PDF text I would suggest to go to pdftotext.
After choosing Calibre I created a simple php application that:
- receives as parameter the URL of a PDF
- calls Calibre with different options on the PDF downloaded
- removes some text (like empty lines, etc.)
- gives back an xml file with all the text generated.
I created then a selenium command that calls the php file (through an xmlhttp request) and search for a particular text (specified as input together with the url of the pdf file).
Php application (convert.php):
<?php
$content = file_get_contents($_GET['url']);
$filefrom = 'extract.pdf';
$fileto = preg_replace("/\.pdf$/","",$filefrom).".txt";
file_put_contents($filefrom,$content);
//$t_start = microtime(true);
//system("java -jar tika.jar -t ".escapeshellcmd($filefrom)." > ".escapeshellcmd($fileto),$ret);
system("ebook-convert ".escapeshellcmd($filefrom)." ".escapeshellcmd($fileto)." >nul 2>&1",$ret);
//system("pdftotext -nopgbrk -raw ".escapeshellcmd($filefrom)." ".escapeshellcmd($fileto),$ret);
//system("java -jar pdfbox.jar ExtractText ".escapeshellcmd($filefrom)." ".escapeshellcmd($fileto),$ret);
//$t_end = microtime(true);
if($ret==0){
$value=file_get_contents($fileto);
$value_empty_line=preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $value);
$text=preg_replace("/(\[.*\]\n)/","",$value_empty_line);
unlink($fileto);
header('Content-type: text/xml');
echo "<result>".htmlspecialchars($text)."</result>";
}
else{
header('Content-type: text/xml');
echo "<result>Convertion failed</result>";
}
?>
Selenium command (assertSearhInPDF):
Selenium.prototype.assertSearchInPDF = function(uri,names){
var baseurl = globalStoredVars['host'];
var params = [{"param" : "url","value" : storedVars[uri],}];
var lista = "";
for(var i=0; i<params.length; i++){
lista +="&" + params[i].param + "=" + encodeURIComponent(params[i].value);
}
var indirizzo = baseurl+"/test/convert.php?"+lista;
LOG.info( 'indirizzo = ' + indirizzo);
var responseXML = chiamaWebService(indirizzo);
LOG.info( 'response = ' + responseXML);
var text = responseXML.getElementsByTagName('result')[0].firstChild.nodeValue;
text = text.replace(/(\n)/g, " ");
var array = names.split('|');
var result=0;
var length=array.length;
LOG.info( 'text = ' + text);
for (var i = 0; i < length; i++){
if(text.indexOf(array[i]) !==-1){
LOG.info( 'Found = ' + array[i]);
result=result+1;
text=text.substring(text.indexOf(array[i]));
}
else{
LOG.info( 'Element ' + array[i]+' not found');
break;
}
}
if(result!=length)
Assert.fail("Not all the elements have been found");
};
In particular the selenium command can accept a list of elements (separated by “|”) to searched in the respective order, if they are not found the command fails.

Recent Comments