Lightning talk slides on deep learning with keras

21 Aug.

At the DCPython Office Hours event this month I gave a lightning talk on convolutional neural networks implemented with the keras library. The notebook is now up on github.

Deep neural networks are typically too slow to train on CPUs. Instead, GPUs are used. The example in the notebook uses a relatively small network so should be runnable on any hardware.

View comments . . .

Lightning talk slides on web server log analysis with pandas

21 Aug.

At the DCPython Office Hours event in May I gave a lightning talk on using pandas to analyse nginx access logs. The notebook is now up on github.

View comments . . .

Seizure detection challenge on kaggle

Following a hiatus of a couple of years I have rejoined the competitors on kaggle. The UPenn and Mayo Clinic Seizure Detection Challenge had 8 days to run when I decided to participate. For the time I had available I'm quite pleased with my final score. I finished in 27th place with 0.93558. The metric used was area under the ROC curve, 1.0 is perfect and 0.5 being no better than random.

The code is now on github.

Prompted by a post from Zac Stewart I decided to give pipelines in scikit-learn a try. The data from the challenge consisted of electroencephalogram recordings from several patients and dogs. These subjects had different numbers of channels in their recordings, so manually implementing the feature extraction would have been very slow and repetitive. Using pipelines made the process incredibly easy and allowed me to make changes quickly.

The features I used were incredibly simple. All the code is in - I used variance, median, and the FFT which I pooled into 6 bins. No optimization of hyperparameters was attempted before I ran out of time.

Next time, I'll be looking for a competition with longer to run.

View comments . . .

Introduction to Scientific Computing at DC Python

This Saturday the DC Python group ran a coding meetup. As part of the event I ran an introduction to scientific computing for about 7 people.

After a quick introduction to numpy, matplotlib, pandas and scikit-learn we decided to pick a dataset and apply some machine learning. The dataset we decided to use was from a Kaggle competition looking at the Titanic disaster. This competition had been posted to help the community get started with machine learning so it seemed perfect.

Continue reading . . .

Strategies for data collection

19 Aug.
, ,

I am currently working on a fairly complex data collection task. This is the third in the past year and by now I'm reasonably comfortable handling the mechanics, especially when I can utilise tools like Scrapy, lxml and a reasonable ORM for database access. Deciding exactly what to store seems like an easy question and yet it is this question which seems to be causing me the most trouble.

The difficulty exists because in deciding what to store multiple competing interests need to be balanced.

Store everything

Storing everything is the easiest to implement and enables the decision of which data points you are interested in to be delayed. The disadvantages with storing everything is that it can place significant demands on storage capacity and risks silent failure.

Store just what you need

Storing just the data you are interested in minimises storage requirements and makes it easier to detect failures. If the information you want is moved, more common for html scraping than APIs, or you realise you have not been collecting everything you want there is no way to go back and alter what you extract or how you extract it.

Failure detection

Failure detection is easier with storing just what you need because your expectations are more detailed. If you expect to find an integer at a certain node in the DOM and either fail to find the node or the content is not an integer you can be relatively certain that there is an error. If you are storing the entire document a request to complete a CAPTCHA or a notice that you have exceeded a rate limit may be indistinguishable from the data you are hoping to collect.

So far I've taken an approach somewhere between these two extremes although I doubt I am close to the optimal solution. For the current project I need to parse much of the data I am interested in so that I can collect the remainder. It feels natural in this situation to favour storing only what I intend to use even though this decision has slowed down development.

Have you been in a similar situation and faced these same choices? Which approach did you take?

View comments . . .

Disk space used by databases and tables in MySQL PostgreSQL

07 July

I don't use raw SQL very often so when I do I usually end up checking the manual for the correct syntax. One query I've wanted to run a couple of times recently and always struggled to find the correct statement for is checking the amount of disk space used by a table or database.

For the ease of future reference here they are.

Continue reading . . .

Django and Scrapy

08 Feb.
, ,

I'm currently working on a project which centres around pulling in data from an external website, "mashing" it up with some additional content, and then displaying it on a website.

The website is going to be interactive and reasonably complex so I decided to use django. To acquire the external data there isn't a webservice so I'm stuck parsing html (and excel spreadsheets but that's a separate story). Scrapy seemed ideal for this and although I wish I had used some other approach than xpath it largely has been.

Having set up my database models in django and built my spider in scrapy the next step was putting the data from the spider in the database. There are plenty of posts detailing how to use the django ORM from outside a django project, even some specific to scrapy but they didn't seem to be working for me.

The issue was the way I handled development and production environment settings.

Continue reading . . .

Numpy talk at Python Northwest

05 Jan.
, , ,

Back in December I gave a talk introducing Numpy to the PyNorthwest group. The slides are available as a pdf.

Although I frequently use Numpy I'm far from an expert and the content of my talk reflected this. I started with a general introduction to the array object and then expanded the scope of the talk to highlight some of the projects that use Numpy. I gave an example of using MDP and matplotlib.

The talk was followed by some excellent discussion. We went through some of the code on slide 6 in a lot of detail.

The PyNorthwest group meets at Madlab in Manchester city centre on the third Thursday of each month. If you're in the area check it out. The January event is on the 19th, starting at 7pm.

View comments . . .

Full text visualisation

At BarcampNortheast4 last weekend and at the Python Northwest meetup on Thursday I gave a presentation on the work I've been doing generating full text visualisations of PDF document libraries.

This was the third BarcampNortheast event I have attended. Each has been slightly different but they have all been a weekend well spent. This year felt a little smaller than previous years but that may have partly been because we were in a bigger space.

I have been attending the python Edinburgh meetups for a while. They have always been interesting and the Northwest meetup this Thursday was the first since I moved back to the Northwest. The format, alternating talks and coding sessions, is different to Edinburgh, regular pub meetups with irregular talks, coding sessions and miniconferences. It was an interesting crowd and the other talks, on Apache Thrift and teaching programming to GCSE students (15-16 year olds), gave a really good variety of subjects to discuss later.

Continue reading . . .

Images and Vision in Python: Slides from talk at Python Edinburgh Mini-Conf 2011

28 May
, , ,

Last weekend the Python Edinburgh users group hosted a mini-conference. Saturday morning was kicked off with a series of talks followed by sessions introducing and then focusing on contributing to django prior to sprints which really got going on the Sunday.

The slides for my talk on, "Images and Vision in Python" are now available in pdf format here.

The slide deck I used is relatively lightweight with my focus being on demonstrating using the different packages available. The code I went through is below.

from PIL import Image

#Open an image and show it
pil1 ='filename')

#Get its size
pil1s = pil1.resize((100,100))
#or - thumbnail
pil1.thumbnail((100,100), Image.ANTIALIAS)

#New image
bg ='RGB', (500,500), '#ffffff')

#Two ways of accessing the pixels
#getpixel/putpixel and load
#load is faster
pix = bg.load()

for a in range(100, 200):
	for b in range(100,110):
		pix[a,b] = (0,0,255)

#Drawing shapes is slightly more involved
from PIL import ImageDraw
draw = ImageDraw.Draw(bg)
draw.ellipse((300,300,320,320), fill='#ff0000')

from PIL import ImageFont
font = ImageFont.truetype("/usr/share/fonts/truetype/freefont/FreeSerif.ttf", 72)
draw.text((10,10), "Hello", font=font, fill='#00ff00')

#Demo's for vision
from scipy import ndimage
import mahotas

#Create a sample image
v1 = np.zeros((10,10), bool)
v1[1:4,1:4] = True
v1[4:7,2:6] = True
imshow(v1, interpolation="Nearest")
imshow(mahotas.dilate(v1), interpolation="Nearest")
imshow(mahotas.erode(v1), interpolation="Nearest")
imshow(mahotas.thin(v1), interpolation="Nearest")

#Opening, closing and top-hat as combinations of dilate and erode

#Latest version of mahotas has a label func
v1[8:,8:] = True
labeled, nr_obj = ndimage.label(v1)
imshow(labeled, interpolation="Nearest")

#Convert a grayscale image to a binary image
v2 = mahotas.imread("/home/jonathan/openplaques/blueness_images/1.jpg")
T = mahotas.otsu(v2)
imshow(v2 > T)

#Distance Transforms
dist = mahotas.distance(v2 > T)

View comments . . .

Page 1 of 3 Next