AI cookbook competition - transcription for the openplaques project

Ian Ozsvald over at aicookbook has been doing some work using optical character recognition (OCR) to transcribe plaques for the openplaques group. His write-ups have been interesting so when he posted a challenge to the community to improve on his demo code I decided to give it a try.

The demo code was very much a proof of principle and its score of 709.3 was easy to beat. I managed to quickly get the score down to 44 and with a little more work reached 33.4. The score is a Levenshtein distance metric so the lower the better. I was hoping to get below 30 but in the end just didn't have time. I suspect it wouldn't take a lot of work to improve on my score. Here's what I've done so far . . .

Configure the system

All the work I've done was on an Ubuntu 10.04 installation and the instructions which follow will only deal with this environment. Beyond the base install I use three different packages:

Python Image Library
Used for pre-processing the images before submitting to tesseract
The OCR software used
Enchant spellchecker
Used for cleaning up the transcribed text

Their installation is straightforward using apt-get

$ sudo apt-get install python-imaging python-enchant tesseract-ocr tesseract-ocr-eng

Fetch images

The demo code written by Ian (available here) includes a script to fetch the images from flickr. It's as simple as running the following

$ python easy_blue_plaques.csv

Once the images are downloaded I suggest you go ahead and run the demo transcribing script. Again it's nice and simple

$ python easy_blue_plaques.csv

Then you can calculate the score using

$ python results.csv

Improving transcription

Ian had posted a number of good suggestions on the wiki for how to improve the transcription quality. I used four approaches:

Image preprocessing
Cropping the image and converting to black and white takes the score from 782 (the demo code produced a higher score on my system than it did for Ian) to 44.6
Restricting the characters tesseract will return
By restricting the character set used by tesseract to alphanumeric characters and a limited selection of punctuation characters further lowered the score from 44.6 to 35.7
Spell checking
Running the results from tesseract through a spell checker and filtering out some common errors brought the score down to 33.4

I'll post the entire script at the bottom of this post but want to highlight a few of the key elements first.

The first stage of cropping the image on the plaque is handled by the function crop_to_plaque which expects a python image library image object. The function then reduces the size of the image to speed up processing before looking for blue pixels. A blue pixel is assumed to be any pixel where the value of the blue channel is 20% higher than both the red and green channels. The number of blue pixels in each row and column of the image is counted and then the image is cropped down to the rows and columns where the number of blue pixels is greater than 15% of the height and width of the image. This value is based solely on experimentation and seemed to give good results for this selection of plaques.

The next stage of converting the image to black and white is handled by the function convert_to_bandl which again expects a python image library image object. The function converts any blue pixels to white and all other pixels to black. Ian has pointed out that this approach might be overly stringent and I might get better results using some grey as well. The result of running these two functions on three of the plaques is shown below.


The next step was limiting the character set used by tesseract. The easiest way to do this is to create a file in /usr/share/tesseract-ocr/tessdata/configs/ which I called goodchars with the following content.


That selection of characters seems to include all the characters present in the plaques. To use this limited character set the call to tesseract needs to be altered to

cmd = 'tesseract %s %s -l eng nobatch goodchars' % (filename_tif, filename_base)

Finally I perform a bunch of small clean up tasks. Firstly I fix the year ranges which frequently had extra spaces inserted and occasionally 1s appeared as i or l and 3 appeared as a parenthesis. These were fixed by a couple of regular expressions including one callback function (clean_years). Then I seperate the transcription out into individual words and fix a number of more issues including lone characters and duplicated characters before checking the spelling on any words of more than two characters.

Where next?

There is still lots of 'low hanging fruit' on this problem. At the moment the curved text at the top of the plaque and the small symbol at the bottom of the plaques is handled badly and I think the bad characters at the beginning and end of the transcriptions could be easily stripped out. The spelling corrections I make do overall reduce the error but they introduce some new errors. I suspect by being more selective in where spelling checks are made some of these introduced errors could be removed.

The entire script

import os
import sys
import csv
import urllib
from PIL import Image #
import ImageFilter
import enchant
import re

# This recognition system depends on:
# version 2.04, it must be installed and compiled already

# run it with 'cmdline> python easy_blue_plaques.csv'
# and it'll:
# 1) send images to tesseract
# 2) read in the transcribed text file
# 3) convert the text to lowercase
# 4) use a Levenshtein error metric to compare the recognised text with the
# human supplied transcription (in the plaques list below)
# 5) write error to file

# For more details see:

def load_csv(filename):
    """build plaques structure from CSV file"""
    plaques = []
    plqs = csv.reader(open(filename, 'rb'))#, delimiter=',')
    for row in plqs:
        image_url = row[1]
        text = row[2]
        # ignore id (0) and plaque url (3) for now
        last_slash = image_url.rfind('/')
        filename = image_url[last_slash+1:]
        filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
        filename = filename_base + '.tif'        
        root_url = image_url[:last_slash+1]
        plaque = [root_url, filename, text]
    return plaques

def levenshtein(a,b):
    """Calculates the Levenshtein distance between a and b
       Taken from:"""
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n
    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)
    return current[n]

def transcribe_simple(filename):
    """Convert image to TIF, send to tesseract, read the file back, clean and
    # read in original image, save as .tif for tesseract
    im =
    filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
    #Enhance contrast
    #contraster = ImageEnhance.Contrast(im)
    #im = contraster.enhance(3.0)
    im = crop_to_plaque(im)
    im = convert_to_bandl(im)
    filename_tif = 'processed' + filename_base + '.tif', 'TIFF')

    # call tesseract, read the resulting .txt file back in
    cmd = 'tesseract %s %s -l eng nobatch goodchars' % (filename_tif, filename_base)
    print "Executing:", cmd
    input_filename = filename_base + '.txt'
    input_file = open(input_filename)
    lines = input_file.readlines()
    line = " ".join([x.strip() for x in lines])
    # delete the output from tesseract

    # convert line to lowercase
    transcription = line.lower()
    #Remove gaps in year ranges
    transcription = re.sub(r"(\d+)\s*-\s*(\d+)", r"\1-\2", transcription)
    transcription = re.sub(r"([0-9il\)]{4})", clean_years, transcription)
    #Separate words
    d = enchant.Dict("en_GB")
    newtokens = []
    print 'Prior to post-processing: ', transcription
    tokens = transcription.split(" ")
    for token in tokens:
        if (token == 'i') or (token == 'l') or (token == '-'):
        elif token == '""':
        elif token == '--':
        elif len(token) > 2:
            if d.check(token):
                #Token is a valid word
                #Token is not a valid word
                suggestions = d.suggest(token)
                if len(suggestions) > 0:
                    #If the spell check has suggestions take the first one
    transcription = ' '.join(newtokens)

    return transcription
def clean_years (m):
    digits =
    year = []
    for digit in digits:
        if digit == 'l':
        elif digit == 'i':
        elif digit == ')':
    return ''.join(year)
def crop_to_plaque (srcim):
    scale = 0.25
    wkim = srcim.resize((int(srcim.size[0] * scale), int(srcim.size[1] * scale)))
    wkim = wkim.filter(ImageFilter.BLUR)
    width = wkim.size[0]
    height = wkim.size[1]
    #result = wkim.copy();
    highlight_color = (255, 128, 128)
    R,G,B = 0,1,2
    lrrange = {}
    for x in range(width):
        lrrange[x] = 0
    tbrange = {}
    for y in range(height):
        tbrange[y] = 0
    for x in range(width):    
        for y in range(height):
            point = (x,y)
            pixel = wkim.getpixel(point)
            if (pixel[B] > pixel[R] * 1.2) and (pixel[B] > pixel[G] * 1.2):
                lrrange[x] += 1
                tbrange[y] += 1
                #result.putpixel(point, highlight_color)
    left = 0
    right = 0 
    cutoff = 0.15      
    for x in range(width):
        if (lrrange[x] > cutoff * height) and (left == 0):
            left = x
        if lrrange[x] > cutoff * height:
            right = x

    top = 0
    bottom = 0
    for y in range(height):
        if (tbrange[y] > cutoff * width) and (top == 0):
            top = y
        if tbrange[y] > cutoff * width:
            bottom = y
    left = int(left / scale)
    right = int(right / scale)
    top = int(top / scale)
    bottom = int(bottom / scale)
    box = (left, top, right, bottom)
    region = srcim.crop(box)
    return region
def convert_to_bandl (im):
    width = im.size[0]
    height = im.size[1]
    white = (255, 255, 255)
    black = (0, 0, 0)
    R,G,B = 0,1,2
    for x in range(width):
        for y in range(height):
            point = (x,y)
            pixel = im.getpixel(point)
            if (pixel[B] > pixel[R] * 1.2) and (pixel[B] > pixel[G] * 1.2):
                im.putpixel(point, white)
                im.putpixel(point, black)
    return im

if __name__ == '__main__':
    argc = len(sys.argv)
    if argc != 2:
        print "Usage: python plaques.csv (e.g. \
        plaques = load_csv(sys.argv[1])

        results = open('results.csv', 'w')

        for root_url, filename, text in plaques:
            print "----"
            print "Working on:", filename
            transcription = transcribe_simple(filename)
            print "Transcription: ", transcription
            print "Text: ", text
            error = levenshtein(text, transcription)
            assert isinstance(error, int)
            print "Error metric:", error
            results.write('%s,%d\n' % (filename, error))

Predicting HIV Progression

About a month ago I came across Kaggle which provides a platform for prediction competitions. It's an interesting concept. Accurate predictions are very useful but designing systems to make such predictions is challenging. By engaging the public it's hoped that talent not normally available to the competition organiser will have a try at the problem and come up with a model which is superior to previous efforts.

Prediction is not exactly my area of expertise but I wanted to have a crack at one of the competitions currently running; predicting response to treatment in HIV patients. I haven't yet started developing a model but wanted to release the python framework I've put together to test ideas. It can be downloaded here.

I've included a number of demonstration prediction methods; randomly guessing, assuming all will respond or assuming none will respond. I suggest you start with one of these methods and then improve on it with your own attempt. The random method was my first submission which, at the time of writing, currently puts me in 30th position out of 33 teams. Improving on that shouldn't be difficult.

The usage of the framework isn't difficult.

>>> import bootstrap
>>> boot = bootstrap.Bootstrap("method_rand")
Mean score:  0.501801084135
Standard deviation:  0.0241816159815
Maximum:  0.544554455446
Minimum:  0.442386831276

During development you can use the bootstrap class to get an idea of how well your method works as demonstrated above. All the training data is split randomly into training and testing sets and then the method trained on the training set and assessed on the test set. This process is repeated, the default is 50 times, and the the scores returned. The score returned will be different to the score when you submit but hopefully should give you an indication of how well you're doing.

>>> import submission
>>> sub = submission.Submission("method_rand")

When you are satisfied with your method you can create the file needed for submission using the above code. In this case we are sticking with the random method. The submission file is submission1.csv. Hopefully this code is useful to you and you'll submit a prediction method yourself.

Read comments ...

Adding bcrypt-style password hashing to Zend_Auth with phpass

I've been using the Zend Framework to good effect on and off for a few months now and have found it very useful in rapidly bringing projects to completion. Many people feel Zend Framework is more a library than a framework and with good reason. There are few things it prevents you from doing but it's not ready to go 'out of the box' in the way some other frameworks are. One example is in the way passwords are stored in the database. The default is simply to store them in plain text.

The manual does cover hashing the password but even this isn't really ideal. There seems to be some consensus forming that the correct way to handle passwords is using bcrypt, a Blowfish-based hashing scheme. The most widely known demonstration of this within the PHP community is the phpass hashing framework. It has already been integrated into wordpress and phpBB. As such, I was in good company integrating it into my own projects.

The first step is making the phpass code available in your ZF project. The changes I made were minor, renaming the class and switching the PasswordHash function to __construct. I would encourage you to fetch the latest code from the phpass project page. The snippet of code I changed is below.

class Acai_Hash {
	var $itoa64;
	var $iteration_count_log2;
	var $portable_hashes;
	var $random_state;

	function __construct($iteration_count_log2, $portable_hashes)

The next step was altering the database table adapter in Zend_Auth. The code for this is below.


class Acai_Auth_Adapter_DbTable extends Zend_Auth_Adapter_DbTable

    public function __construct ($zendDb = null, $tableName = null, $identityColumn = null, $credentialColumn = null)
        //Get the default db adapter
        //From where?  It is not stored in the registry.
        if ($zendDb == null) {
            $zendDb = Zend_Registry::get('DbAdapter');

        //Set default values
        $tableName = $tableName ? $tableName : 'accounts';
        $identityColumn = $identityColumn ? $identityColumn : 'email';
        $credentialColumn = $credentialColumn ? $credentialColumn : 'password';


    protected function _authenticateCreateSelect()
        // get select
        $dbSelect = clone $this->getDbSelect();
            $this->_zendDb->quoteIdentifier($this->_identityColumn, true)
            . ' = ?', $this->_identity);

        return $dbSelect;

    protected function _authenticateValidateResult($resultIdentity)
        //Check that hash value is correct
        $hash = new Acai_Hash(8, false);
        $check = $hash->CheckPassword($this->_credential,

        if (!$check) {
            $this->_authenticateResultInfo['code'] =
            $this->_authenticateResultInfo['messages'][] =
                            'Supplied credential is invalid.';
            return $this->_authenticateCreateAuthResult();

        $this->_resultRow = $resultIdentity;

        $this->_authenticateResultInfo['code'] =
        $this->_authenticateResultInfo['messages'][] =
                            'Authentication successful.';
        return $this->_authenticateCreateAuthResult();

    public function getResultRowObject ($returnColumns = null, $omitColumns = null)
        if ($returnColumns || $omitColumns) {
            return parent::getResultRowObject($returnColumns, $omitColumns);
        } else {
            $omitColumns = array('password');
            return parent::getResultRowObject($returnColumns, $omitColumns);



Usage is just as for the standard Zend adapter.

$auth = Zend_Auth::getInstance();
$authAdapter = new Acai_Auth_Adapter_DbTable;
$result = $auth->authenticate($authAdapter);

if (!$result->isValid()) {

    //Bad credentials
} else {

    //Good credentials

When you're initially registering a new user a hash can be generated simply with the following code.

//Generate hash for password
$hash = new Acai_Hash(8, false);
$passwordHash = $hash->HashPassword($password);

Hopefully you can integrate the above code into your own projects. If you have any questions post them in the comments below and I'll try to answer them.

Email templates using Zend_Mail and Zend_View

In a recent project I was working on which was based on Zend Framework (ZF) I wanted to send out some fairly complex emails following user registrations and for various alerts. ZF has a decent class for sending emails which just left forming the text I wanted to send in each email. For each type of email much of the content would remain the same with just a few things changing like name and date. I could have mixed up these elements and strung them together in a variable. This would probably have become very messy very quickly in the same way that mixing html and logic together gets messy quickly.

The solution to the html/logic issue is to use templates and unsurprisingly the same approach also works well for email. There is no shortage of templating systems for PHP and I suspect most would work perfectly adequately for this task. As I was already using ZF though I decided to go with Zend_View.

The class which follows wraps Zend_Mail and Zend_View together. It's possible to quickly and simply assign variables to be used in the template and then when the time comes to send the email additional default variables can be included from a config file. The config file also includes the location where the templates are stored and the email address from which the email should be sent.

 * A template based email system
 * Supports the sending of multipart txt/html emails based on templates
 * @author Jonathan Street
class Acai_Mail

     * Variable registry for template values
    protected $templateVariables = array();

     * Template name
    protected $templateName;

     * Zend_Mail instance
    protected $zendMail;
     * Email recipient
    protected $recipient;
     * __construct
     * Set default options
    public function __construct ()
        $this->zendMail = new Zend_Mail();



     * Set variables for use in the templates
     * Magic function stores the value put in any variable in this class for
     * use later when creating the template
     * @param string $name  The name of the variable to be stored
     * @param mixed  $value The value of the variable
    public function __set ($name, $value)
        $this->templateVariables[$name] = $value;

     * Set the template file to use
     * @param string $filename Template filename
    public function setTemplate ($filename)
        $this->templateName = $filename;
     * Set the recipient address for the email message
     * @param string $email Email address
    public function setRecipient ($email)
        $this->recipient = $email;

     * Send the constructed email
     * @todo Add from name
    public function send ()
         * Get data from config
         * - From address
         * - Directory for template files
        $config = Zend_Registry::get('Config');
        $templateDir = $config->email->template->dir;
        $fromAddr = $config->email->from;
        $templateVars = $config->email->vars->toArray();

        foreach ($templateVars as $key => $value)
            //If a variable is present in config which has not been set
            //add it to the list
            if (!array_key_exists($key, $this->templateVariables))
                $this->{$key} = $value;
        //Build template
        //Check that template file exists before using
        $viewConfig = array('basePath' => $templateDir);
        $subjectView = new Zend_View($viewConfig);
        foreach ($this->templateVariables as $key => $value)
            $subjectView->{$key} = $value;
        try {
            $subject = $subjectView->render($this->templateName . '.subj');
        } catch (Zend_View_Exception $e) {
            $subject = false;
        $textView = new Zend_View($viewConfig);
        foreach ($this->templateVariables as $key => $value)
            $textView->{$key} = $value;
        try {
            $text = $textView->render($this->templateName . '.txt');
        } catch (Zend_View_Exception $e) {
            $text = false;
        $htmlView = new Zend_View($viewConfig);
        foreach ($this->templateVariables as $key => $value)
            $htmlView->{$key} = $value;
        try {
            $html = $htmlView->render($this->templateName . '.html'); 
        } catch (Zend_View_Exception $e) {
            $html = false;

        //Pass variables to Zend_Mail
        $mail = new Zend_Mail();
        if ($html !== false) {
        //Send email
        //$config = Zend_Registry::get('configuration');
        $transport = $config->email->transport;
        if($transport == 'Dev')
            $tr = new Acai_Mail_Transport_Dev;

You may want to make some changes to how the class fetches its default values depending on your setup.

Using the class is very simple. Here is the code I use to send a confirmation email to a new user.

$emailObj = new Acai_Mail;
$emailObj->activationLink = $actiUri;

Hopefully you find the above code of some use and can integrate it into your own projects. If you have any questions or suggestions please do post them in the comments below.

Presentation at BarcampNortheast3 - Improvised search for a private phpBB forum using PHP, MySQL and sphinx

Three weekends ago I went down to Newcastle to attend BarcampNortheast3. For anyone who doesn't know what a barcamp is Wikipedia provides a decent explanation

BarCamp is an international network of user generated conferences (or unconferences) - open, participatory workshop-events, whose content is provided by participants. The first BarCamps focused on early-stage web applications, and related open source technologies, social protocols, and open data formats.

The idea is that those attending also present. I decided to talk about the work I did setting up a dedicated search engine for a private phpBB powered forum as it is something not many people probably have ever needed to do. Below is the short powerpoint presentation I put together. Excessive use of powerpoint is discouraged so there isn't much in it. Further details are below.


phpBB does have its own search functionality so it's reasonable to ask why is anything else needed. Unfortunately the users of this forum had found that the site was slowing down. The problem was believed to be the search functionality and so it was set to only make the past year and future content searchable. This meant there was several years worth of content which didn't show up in the search engine. I had arrived to this community relatively recently and so had missed a lot of good content. Initially I had asked to have access to the database which would have made making the site fully searchable straightforward. Unfortunately the owner of the site didn't have the technical know how to feel safe granting me, a relative newcomer, access to the database.

3 steps

To get around this I decided to do what google does and fetch the site one html page at a time and get the content that way. The project could be broken down into 3 steps; create a mirror of the site locally, insert the content extracted from the local mirror into a database, and finally setup sphinx to index the content.

The Problems

As the site is password protected creating the local mirror required some additional work to handle login sessions. The html of the pages throws up several errors running it through the w3c validator which created problems for extracting content. Finally all the important information in the URL is in the query string.


Wget is a really nice tool for downloading information from the internet. It was relatively easy to cajole it into handling logging it. Unfortunately I soon realized it was repeatedly downloading the same content again and again. This was a problem with wget but reflected the redundant linking structure of phpBB. A topipc might have twenty posts and each post had a unique URL which pulled in all the content for the entire topic. Wget does allow you to filter the URLs you download but it doesn't filter on the query string which is what I needed.


In the end I created a custom script to handle crawling the site with Zend_HTTP handling the actual HTTP requests.

Scraping HTML

Running each downloaded page through the PHP Tidy extension and then feeding the resulting text to SimpleXML worked in most, though not all, cases. Since the barcamp conference I have since re-implemented this section of the project using Python and BeautifulSoup which was able to handle all the downloaded pages.

Releasing Sphinx

I considered three options for the search engine; two Lucene based projects, Zend_Search and Solr, and then sphinx. I had heard that Zend_Search would be rather slow at indexing and I felt that Solr, although the most powerful option, would be overly complex for my needs. I therefore decided to give Sphinx a try.

Sphinx is set up to be able to pull content from a MySQL database so when I had a complete copy of the forum locally I extracted all the posts to a MySQL database table ready to be fed to Sphinx. First though I had to get Sphinx talking to MySQL. To fetch content directly from a MySQL database Sphinx requires the mysql-devel package to be installed. A search using apt-get couldn't find the package. Fortunately a quick google search turned up this page which suggested a fix.

After that the only other problem was the paths in the manual didn't match up with the path to the sphinx executables on my system. The api and the example configuration files could be easily adapted for my needs and I quickly had a working search engine. The rest of the slides are self-explanatory so I'll stop here. If you have any questions post them in the comments below and I'll do my best to answer them.