Seizure detection challenge on kaggle

Following a hiatus of a couple of years I have rejoined the competitors on kaggle. The UPenn and Mayo Clinic Seizure Detection Challenge had 8 days to run when I decided to participate. For the time I had available I'm quite pleased with my final score. I finished in 27th place with 0.93558. The metric used was area under the ROC curve, 1.0 is perfect and 0.5 being no better than random.

The code is now on github.

Prompted by a post from Zac Stewart I decided to give pipelines in scikit-learn a try. The data from the challenge consisted of electroencephalogram recordings from several patients and dogs. These subjects had different numbers of channels in their recordings, so manually implementing the feature extraction would have been very slow and repetitive. Using pipelines made the process incredibly easy and allowed me to make changes quickly.

The features I used were incredibly simple. All the code is in transformers.py - I used variance, median, and the FFT which I pooled into 6 bins. No optimization of hyperparameters was attempted before I ran out of time.

Next time, I'll be looking for a competition with longer to run.

View comments . . .


Introduction to Scientific Computing at DC Python

This Saturday the DC Python group ran a coding meetup. As part of the event I ran an introduction to scientific computing for about 7 people.

After a quick introduction to numpy, matplotlib, pandas and scikit-learn we decided to pick a dataset and apply some machine learning. The dataset we decided to use was from a Kaggle competition looking at the Titanic disaster. This competition had been posted to help the community get started with machine learning so it seemed perfect.

Continue reading . . .


Strategies for data collection

19 Aug.
2012
, ,

I am currently working on a fairly complex data collection task. This is the third in the past year and by now I'm reasonably comfortable handling the mechanics, especially when I can utilise tools like Scrapy, lxml and a reasonable ORM for database access. Deciding exactly what to store seems like an easy question and yet it is this question which seems to be causing me the most trouble.

The difficulty exists because in deciding what to store multiple competing interests need to be balanced.

Store everything

Storing everything is the easiest to implement and enables the decision of which data points you are interested in to be delayed. The disadvantages with storing everything is that it can place significant demands on storage capacity and risks silent failure.

Store just what you need

Storing just the data you are interested in minimises storage requirements and makes it easier to detect failures. If the information you want is moved, more common for html scraping than APIs, or you realise you have not been collecting everything you want there is no way to go back and alter what you extract or how you extract it.

Failure detection

Failure detection is easier with storing just what you need because your expectations are more detailed. If you expect to find an integer at a certain node in the DOM and either fail to find the node or the content is not an integer you can be relatively certain that there is an error. If you are storing the entire document a request to complete a CAPTCHA or a notice that you have exceeded a rate limit may be indistinguishable from the data you are hoping to collect.

So far I've taken an approach somewhere between these two extremes although I doubt I am close to the optimal solution. For the current project I need to parse much of the data I am interested in so that I can collect the remainder. It feels natural in this situation to favour storing only what I intend to use even though this decision has slowed down development.

Have you been in a similar situation and faced these same choices? Which approach did you take?

View comments . . .


Django and Scrapy

08 Feb.
2012
, ,

I'm currently working on a project which centres around pulling in data from an external website, "mashing" it up with some additional content, and then displaying it on a website.

The website is going to be interactive and reasonably complex so I decided to use django. To acquire the external data there isn't a webservice so I'm stuck parsing html (and excel spreadsheets but that's a separate story). Scrapy seemed ideal for this and although I wish I had used some other approach than xpath it largely has been.

Having set up my database models in django and built my spider in scrapy the next step was putting the data from the spider in the database. There are plenty of posts detailing how to use the django ORM from outside a django project, even some specific to scrapy but they didn't seem to be working for me.

The issue was the way I handled development and production environment settings.

Continue reading . . .


Numpy talk at Python Northwest

05 Jan.
2012
, , ,

Back in December I gave a talk introducing Numpy to the PyNorthwest group. The slides are available as a pdf.

Although I frequently use Numpy I'm far from an expert and the content of my talk reflected this. I started with a general introduction to the array object and then expanded the scope of the talk to highlight some of the projects that use Numpy. I gave an example of using MDP and matplotlib.

The talk was followed by some excellent discussion. We went through some of the code on slide 6 in a lot of detail.

The PyNorthwest group meets at Madlab in Manchester city centre on the third Thursday of each month. If you're in the area check it out. The January event is on the 19th, starting at 7pm.

View comments . . .


Full text visualisation

At BarcampNortheast4 last weekend and at the Python Northwest meetup on Thursday I gave a presentation on the work I've been doing generating full text visualisations of PDF document libraries.

This was the third BarcampNortheast event I have attended. Each has been slightly different but they have all been a weekend well spent. This year felt a little smaller than previous years but that may have partly been because we were in a bigger space.

I have been attending the python Edinburgh meetups for a while. They have always been interesting and the Northwest meetup this Thursday was the first since I moved back to the Northwest. The format, alternating talks and coding sessions, is different to Edinburgh, regular pub meetups with irregular talks, coding sessions and miniconferences. It was an interesting crowd and the other talks, on Apache Thrift and teaching programming to GCSE students (15-16 year olds), gave a really good variety of subjects to discuss later.

Continue reading . . .


Images and Vision in Python: Slides from talk at Python Edinburgh Mini-Conf 2011

28 May
2011
, , ,

Last weekend the Python Edinburgh users group hosted a mini-conference. Saturday morning was kicked off with a series of talks followed by sessions introducing and then focusing on contributing to django prior to sprints which really got going on the Sunday.

The slides for my talk on, "Images and Vision in Python" are now available in pdf format here.

The slide deck I used is relatively lightweight with my focus being on demonstrating using the different packages available. The code I went through is below.

from PIL import Image

#Open an image and show it
pil1 = Image.open('filename')
pil1.show()

#Get its size
pil1.size
#Resize
pil1s = pil1.resize((100,100))
#or - thumbnail
pil1.thumbnail((100,100), Image.ANTIALIAS)

#New image
bg = Image.new('RGB', (500,500), '#ffffff')

#Two ways of accessing the pixels
#getpixel/putpixel and load
#load is faster
pix = bg.load()

for a in range(100, 200):
	for b in range(100,110):
		pix[a,b] = (0,0,255)
bg.show()

#Drawing shapes is slightly more involved
from PIL import ImageDraw
draw = ImageDraw.Draw(bg)
draw.ellipse((300,300,320,320), fill='#ff0000')
bg.show()

from PIL import ImageFont
font = ImageFont.truetype("/usr/share/fonts/truetype/freefont/FreeSerif.ttf", 72)
draw.text((10,10), "Hello", font=font, fill='#00ff00')
bg.show()


#Demo's for vision
from scipy import ndimage
import mahotas

#Create a sample image
v1 = np.zeros((10,10), bool)
v1[1:4,1:4] = True
v1[4:7,2:6] = True
imshow(v1, interpolation="Nearest")
imshow(mahotas.dilate(v1), interpolation="Nearest")
imshow(mahotas.erode(v1), interpolation="Nearest")
imshow(mahotas.thin(v1), interpolation="Nearest")

#Opening, closing and top-hat as combinations of dilate and erode

#Labeling
#Latest version of mahotas has a label func
v1[8:,8:] = True
imshow(v1)
labeled, nr_obj = ndimage.label(v1)
nr_obj
imshow(labeled, interpolation="Nearest")
pylab.jet()

#Thresholding
#Convert a grayscale image to a binary image
v2 = mahotas.imread("/home/jonathan/openplaques/blueness_images/1.jpg")
T = mahotas.otsu(v2)
imshow(v2)
imshow(v2 > T)

#Distance Transforms
dist = mahotas.distance(v2 > T)
imshow(dist)

View comments . . .


Quick tips for data analysis in python MDP and matplotlib

19 Dec.
2010
, , ,

I've been using MDP and matplotlib a lot recently and although overall I've been very pleased with the documentation for both projects I have run into a few problems for which the solutions were not immediately obvious. This post gives the solution for each in the expectation it will certainly be useful to me in the future and the hope that it may also be useful to others.

Principal Component Analysis with MDP

Data Layout

The tutorial for the Modular Toolkit for Data Processing (MDP) starts with a quick example of using the toolkit for a pca analysis and yet I still ran into a couple of problems. The first issue I had was how the pca function expects to receive data. I suspect this is simply due to unfamiliarity with the field and the language used within the field. For future reference the data is expected to be in the following format.

Gene 1 Gene 2 Gene 3 Gene 4
Experimental Condition 1 . . . .
Experimental Condition 2 . . . .
Variance Accounted For in PC1, 2, etc

The previously mentioned quick start tutorial was very useful in getting something useful out quickly but I couldn't find a way to get a value for how much of the variance present in the data was accounted for in the principal components. To get that, as far as I've been able to determine, you need to interact with the PCANode directly rather than using the convenience function. The code is still relative straightforward.

import mdp
import numpy as np
import matplotlib.pyplot as plt

#Create sample data
var1 = np.random.normal(loc=0., scale=0.5, size=(10,5))
var2 = np.random.normal(loc=4., scale=1., size=(10,5))
var = np.concatenate((var1,var2), axis=0)

#Create the PCA node and train it
pcan = mdp.nodes.PCANode(output_dim=3)
pcar = pcan.execute(var)

#Graph the results
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(pcar[:10,0], pcar[:10,1], 'bo')
ax.plot(pcar[10:,0], pcar[10:,1], 'ro')

#Show variance accounted for
ax.set_xlabel('PC1 (%.3f%%)' % (pcan.d[0]))
ax.set_ylabel('PC2 (%.3f%%)' % (pcan.d[1]))

plt.show()

Running this code produces an image similar to the one below.

PCA graph

Growing neural gas with MDP

The growing neural gas implementation was another sample application highlighted in the tutorial for MDP. It held my interest for a while as a technique which could potentially be applied to the transcription of plaques for the openplaques project. It wasn't immediately obvious how to get the position of a node from a connected nodes object. As the tutorial left the details of visualisation up to the user I'll present the solution to getting the node location in the form of the necessary code to visualise the node training. The end result will look something like the following.

Matplotlib

I've been using Matplotlib to plot data exclusively for a while now. The defaults produce reasonable quality graphs and any differences in opinion can be quickly fixed either by altering options in matplotlib or, as the graphs can be saved in svg format, in a vector image manipulation program such as Inkscape. Although most options can be changed in matplotlib it can sometimes be difficult to find the correct option. Most of the time the naming of variables are, to my mind, logical but sometimes I just can't find the right way to describe what I want to do.

Hiding axes

I wanted to have a grid of 6 graphs but didn't want to display the axes on all the graphs as I felt this looked cluttered.

Fixing the axis range

If I was going to display the axes on only some of the graphs then the values for the axes needed to be the same on all of them.

import numpy as np
import matplotlib.pyplot as plt

#Generate sample data
var = np.random.random_sample((40,2))

fig = plt.figure()
for i in range(4):
    ax = fig.add_subplot(220 + i + 1)
    start = i * 10
    ax.plot(var[start:start+10,0], var[start:start+10,1], 'bo')
    
    #Hide the x axis on the top row of charts
    if i in [0,1]:
        ax.set_xticklabels(ax.get_xticklabels(), visible=False)
        
    #Hide the y axis on the right column of charts
    if i in [1,3]:
        ax.set_yticklabels(ax.get_yticklabels(), visible=False)
    
    #Set the axis range
    ax.axis([0,1,0,1])
plt.show()

Running this code should produce an image similar to the one below.

Selectively displaying axes

Removing second point in plot legend

The legend assumes that values are connected so two points and the connecting line are shown by default. If the points on the graph aren't connected then this looked strange. To remove the duplicate symbol is straightforward.

import numpy as np
import matplotlib.pyplot as plt

#Generate sample data
var = np.random.random_sample((10,2))

#Plot data with labels
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(var[0:5,0], var[0:5,1], 'bo', label="First half")
ax.plot(var[5:10,0], var[5:10,1], 'r^', label="Second half")
ax.legend(numpoints=1)
plt.show()

Display one point in plot legend

View comments . . .


AI Cookbook Competition - Month Three

A little over two months ago I wrote about the first round of the AI cookbook competition. Since then there have been two further rounds and a considerable amount of further progress. For the latest round I was able to get the error score down to 10.867 using an additional image pre-processing step and then a variety of text clean-up improvements.

Image Pre-processing

Ian, who writes the AI Cookbook, had the theory that the curved text present at the top of many of the plaques in the test set were causing tesseract, our OCR software of choice, significant problems in transcribing the main text. If we could automatically recognise the curved text and block it out the transcription should be significantly improved. In the diagram below the text we want to be transcribed is in green and the text we don't want is in red.

I couldn't think of a good method to actually recognise the curved text at the top so decided to use a 'dumb' approach. The curved text is in the same place on all the plaques so I built a system to apply the same mask to all the images. To do this I went back to what I could still remember from high school math lessons. To the probable delight of my old math teachers I quickly had some working code. The code I wrote cycles through all the pixels in the image and converts them to a distance and angle relative to the centre of the image. This process is hopefully easier to visualise in the image below. The distance is simple enough to calculate as we're dealing with a right-angle triangle; we simply square the x and y values, add them together and take the square root. The angle is a little trickier. The y-value represents the opposite length of the triangle and the x-value represents the adjacent length so from the mnemonic SOH CAH TOA we known the angle will be tan-1 (O/A). Knowing that we can then apply our rules for distance and angle.

Text Clean-up

The text clean-up was lots of little steps. Briefly I've,

  1. Made various improvements to the regexes for cleaning up the years
  2. Converted any instances of 'vv' (two v's) to 'w' (one w)
  3. Switched 0 (zero) to o (letter o) in words
  4. Removed any one/two character tokens from the end of the string
  5. Improved the selection of suggestions from the spell checker
  6. Broken up long words to see if a valid word can be found in the two halves
  7. Changed "s to 's
  8. Improved correction for endings where the ending is lived|worked|died here and the spelling checker returns bad results
  9. Removed any words containing three of lowercase, uppercase, digits and punctuation.

The regex for that last item is something of a monstrosity and as I'm far from an expert it wouldn't surprise me if it doesn't entirely do what I think it does. I've used whitespace to make it slightly easier to follow. Each line represents a sub-expression, if any sub-expression matches the string then the expression as a whole is considered to match. Each line matches a different combination of three from digits, lowercase, uppercase and punctuation. The .+ at the end means we match one or more of any character. The expressions in brackets starting with a question mark are look ahead assertions. The .+ still matches any character but the look ahead assertions state that at least one of the characters matched must be a digit for instance. It doesn't matter in what order the characters are present as long as they are present. If you suspect there is a flaw in the pattern or know some way to simplify it then I would really appreciate a quick note in the comments field below.

re.compile(r""" #matching a combination of digits, lowercase, uppercase and punctuation
            ((?=.*\d)(?=.*[a-z])(?=.*['"-,\.]).+| #d,l,p
            (?=.*[A-Z])(?=.*[a-z])(?=.*['"-,\.]).+| #u,l,p
            (?=.*\d)(?=.*[A-Z])(?=.*[a-z]).+| #d,u,l
            (?=.*\d)(?=.*[A-Z])(?=.*['"-,\.]).+ #u,p,d
            )""", re.VERBOSE)

That's all for now. I believe Ian is planning to run the competition for a further month and there are still considerable improvements to be made so it would be great to see more people taking part.

View comments . . .


AI cookbook competition - transcription for the openplaques project

Ian Ozsvald over at aicookbook has been doing some work using optical character recognition (OCR) to transcribe plaques for the openplaques group. His write-ups have been interesting so when he posted a challenge to the community to improve on his demo code I decided to give it a try.

The demo code was very much a proof of principle and its score of 709.3 was easy to beat. I managed to quickly get the score down to 44 and with a little more work reached 33.4. The score is a Levenshtein distance metric so the lower the better. I was hoping to get below 30 but in the end just didn't have time. I suspect it wouldn't take a lot of work to improve on my score. Here's what I've done so far . . .

Configure the system

All the work I've done was on an Ubuntu 10.04 installation and the instructions which follow will only deal with this environment. Beyond the base install I use three different packages:

Python Image Library
Used for pre-processing the images before submitting to tesseract
Tesseract
The OCR software used
Enchant spellchecker
Used for cleaning up the transcribed text

Their installation is straightforward using apt-get

$ sudo apt-get install python-imaging python-enchant tesseract-ocr tesseract-ocr-eng

Fetch images

The demo code written by Ian (available here) includes a script to fetch the images from flickr. It's as simple as running the following

$ python get_plaques.py easy_blue_plaques.csv

Once the images are downloaded I suggest you go ahead and run the demo transcribing script. Again it's nice and simple

$ python plaque_transcribe_demo.py easy_blue_plaques.csv

Then you can calculate the score using

$ python summarise_results.py results.csv

Improving transcription

Ian had posted a number of good suggestions on the wiki for how to improve the transcription quality. I used four approaches:

Image preprocessing
Cropping the image and converting to black and white takes the score from 782 (the demo code produced a higher score on my system than it did for Ian) to 44.6
Restricting the characters tesseract will return
By restricting the character set used by tesseract to alphanumeric characters and a limited selection of punctuation characters further lowered the score from 44.6 to 35.7
Spell checking
Running the results from tesseract through a spell checker and filtering out some common errors brought the score down to 33.4

I'll post the entire script at the bottom of this post but want to highlight a few of the key elements first.

The first stage of cropping the image on the plaque is handled by the function crop_to_plaque which expects a python image library image object. The function then reduces the size of the image to speed up processing before looking for blue pixels. A blue pixel is assumed to be any pixel where the value of the blue channel is 20% higher than both the red and green channels. The number of blue pixels in each row and column of the image is counted and then the image is cropped down to the rows and columns where the number of blue pixels is greater than 15% of the height and width of the image. This value is based solely on experimentation and seemed to give good results for this selection of plaques.

The next stage of converting the image to black and white is handled by the function convert_to_bandl which again expects a python image library image object. The function converts any blue pixels to white and all other pixels to black. Ian has pointed out that this approach might be overly stringent and I might get better results using some grey as well. The result of running these two functions on three of the plaques is shown below.

combined_image_web.jpg

The next step was limiting the character set used by tesseract. The easiest way to do this is to create a file in /usr/share/tesseract-ocr/tessdata/configs/ which I called goodchars with the following content.

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.,()-"

That selection of characters seems to include all the characters present in the plaques. To use this limited character set the call to tesseract needs to be altered to

cmd = 'tesseract %s %s -l eng nobatch goodchars' % (filename_tif, filename_base)

Finally I perform a bunch of small clean up tasks. Firstly I fix the year ranges which frequently had extra spaces inserted and occasionally 1s appeared as i or l and 3 appeared as a parenthesis. These were fixed by a couple of regular expressions including one callback function (clean_years). Then I seperate the transcription out into individual words and fix a number of more issues including lone characters and duplicated characters before checking the spelling on any words of more than two characters.

Where next?

There is still lots of 'low hanging fruit' on this problem. At the moment the curved text at the top of the plaque and the small symbol at the bottom of the plaques is handled badly and I think the bad characters at the beginning and end of the transcriptions could be easily stripped out. The spelling corrections I make do overall reduce the error but they introduce some new errors. I suspect by being more selective in where spelling checks are made some of these introduced errors could be removed.

The entire script

import os
import sys
import csv
import urllib
from PIL import Image # http://www.pythonware.com/products/pil/
import ImageFilter
import enchant
import re

# This recognition system depends on:
# http://code.google.com/p/tesseract-ocr/
# version 2.04, it must be installed and compiled already

# plaque_transcribe_test5.py
# run it with 'cmdline> python plaque_transcribe_test5.py easy_blue_plaques.csv'
# and it'll:
# 1) send images to tesseract
# 2) read in the transcribed text file
# 3) convert the text to lowercase
# 4) use a Levenshtein error metric to compare the recognised text with the
# human supplied transcription (in the plaques list below)
# 5) write error to file

# For more details see:
# http://aicookbook.com/wiki/Automatic_plaque_transcription

def load_csv(filename):
    """build plaques structure from CSV file"""
    plaques = []
    plqs = csv.reader(open(filename, 'rb'))#, delimiter=',')
    for row in plqs:
        image_url = row[1]
        text = row[2]
        # ignore id (0) and plaque url (3) for now
        last_slash = image_url.rfind('/')
        filename = image_url[last_slash+1:]
        filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
        filename = filename_base + '.tif'        
        root_url = image_url[:last_slash+1]
        plaque = [root_url, filename, text]
        plaques.append(plaque)
    return plaques

def levenshtein(a,b):
    """Calculates the Levenshtein distance between a and b
       Taken from: http://hetland.org/coding/python/levenshtein.py"""
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n
        
    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)
            
    return current[n]

def transcribe_simple(filename):
    """Convert image to TIF, send to tesseract, read the file back, clean and
    return"""
    # read in original image, save as .tif for tesseract
    im = Image.open(filename)
    filename_base = os.path.splitext(filename)[0] # turn 'abc.jpg' into 'abc'
    
    #Enhance contrast
    #contraster = ImageEnhance.Contrast(im)
    #im = contraster.enhance(3.0)
    im = crop_to_plaque(im)
    im = convert_to_bandl(im)
    
    filename_tif = 'processed' + filename_base + '.tif'
    im.save(filename_tif, 'TIFF')

    # call tesseract, read the resulting .txt file back in
    cmd = 'tesseract %s %s -l eng nobatch goodchars' % (filename_tif, filename_base)
    print "Executing:", cmd
    os.system(cmd)
    input_filename = filename_base + '.txt'
    input_file = open(input_filename)
    lines = input_file.readlines()
    line = " ".join([x.strip() for x in lines])
    input_file.close()
    # delete the output from tesseract
    os.remove(input_filename)

    # convert line to lowercase
    transcription = line.lower()
    
    #Remove gaps in year ranges
    transcription = re.sub(r"(\d+)\s*-\s*(\d+)", r"\1-\2", transcription)
    transcription = re.sub(r"([0-9il\)]{4})", clean_years, transcription)
    
    #Separate words
    d = enchant.Dict("en_GB")
    newtokens = []
    print 'Prior to post-processing: ', transcription
    tokens = transcription.split(" ")
    for token in tokens:
        if (token == 'i') or (token == 'l') or (token == '-'):
            pass
        elif token == '""':
            newtokens.append('"')
        elif token == '--':
            newtokens.append('-')
        elif len(token) > 2:
            if d.check(token):
                #Token is a valid word
                newtokens.append(token)
            else:
                #Token is not a valid word
                suggestions = d.suggest(token)
                if len(suggestions) > 0:
                    #If the spell check has suggestions take the first one
                    newtokens.append(suggestions[0])
                else:
                    newtokens.append(token)
        else:
            newtokens.append(token)
            
    transcription = ' '.join(newtokens)

    return transcription
    
def clean_years (m):
    digits = m.group(1)
    year = []
    for digit in digits:
        if digit == 'l':
            year.append('1')
        elif digit == 'i':
            year.append('1')
        elif digit == ')':
            year.append('3')
        else:
            year.append(digit)
    return ''.join(year)
    
def crop_to_plaque (srcim):
    
    scale = 0.25
    wkim = srcim.resize((int(srcim.size[0] * scale), int(srcim.size[1] * scale)))
    wkim = wkim.filter(ImageFilter.BLUR)
    #wkim.show()
    
    width = wkim.size[0]
    height = wkim.size[1]
    
    #result = wkim.copy();
    highlight_color = (255, 128, 128)
    R,G,B = 0,1,2
    lrrange = {}
    for x in range(width):
        lrrange[x] = 0
    tbrange = {}
    for y in range(height):
        tbrange[y] = 0
    
    for x in range(width):    
        for y in range(height):
            point = (x,y)
            pixel = wkim.getpixel(point)
            if (pixel[B] > pixel[R] * 1.2) and (pixel[B] > pixel[G] * 1.2):
                lrrange[x] += 1
                tbrange[y] += 1
                #result.putpixel(point, highlight_color)
    
        
    #result.show();
    
    left = 0
    right = 0 
    cutoff = 0.15      
    for x in range(width):
        if (lrrange[x] > cutoff * height) and (left == 0):
            left = x
        if lrrange[x] > cutoff * height:
            right = x

    top = 0
    bottom = 0
    for y in range(height):
        if (tbrange[y] > cutoff * width) and (top == 0):
            top = y
        if tbrange[y] > cutoff * width:
            bottom = y
    
    left = int(left / scale)
    right = int(right / scale)
    top = int(top / scale)
    bottom = int(bottom / scale)
    
    box = (left, top, right, bottom)
    region = srcim.crop(box)
    #region.show()
    
    return region
    
def convert_to_bandl (im):
    width = im.size[0]
    height = im.size[1]
    
    white = (255, 255, 255)
    black = (0, 0, 0)
    R,G,B = 0,1,2
    
    for x in range(width):
        for y in range(height):
            point = (x,y)
            pixel = im.getpixel(point)
            if (pixel[B] > pixel[R] * 1.2) and (pixel[B] > pixel[G] * 1.2):
                im.putpixel(point, white)
            else:
                im.putpixel(point, black)
    #im.show()
    return im
    

if __name__ == '__main__':
    argc = len(sys.argv)
    if argc != 2:
        print "Usage: python plaque_transcribe_demo.py plaques.csv (e.g. \
easy_blue_plaques.csv)"
    else:
        plaques = load_csv(sys.argv[1])

        results = open('results.csv', 'w')

        for root_url, filename, text in plaques:
            print "----"
            print "Working on:", filename
            transcription = transcribe_simple(filename)
            print "Transcription: ", transcription
            print "Text: ", text
            error = levenshtein(text, transcription)
            assert isinstance(error, int)
            print "Error metric:", error
            results.write('%s,%d\n' % (filename, error))
            results.flush()
        results.close()

View comments . . .


Page 1 of 2 Next