Transportation Techies: Capital Bikeshare TSP

The theme for the Transportation Techies event this month was Capital Bikeshare. This is the bike sharing service in Washington DC. Information is available on every trip and every station. Lots of analyses are possible with all this data. This event was the seventh on this theme.

I had not worked with geographical or transportation data before this so I learned a lot. I treated the stations as cities in the traveling salesperson problem. I then calculated the shortest path visiting all the stations.

I was able to do this using open data and open source software. This included customizing the calculation of distances for cycling.

The slides I presented include links to all the data and software used. The code I wrote is available on github. I include a Dockerfile for running the routing software with data for the Washington DC region.

Lightning talk slides on deep learning with keras

At the DCPython Office Hours event this month I gave a lightning talk on convolutional neural networks implemented with the keras library. The notebook is now up on github.

Deep neural networks are typically too slow to train on CPUs. Instead, GPUs are used. The example in the notebook uses a relatively small network so should be runnable on any hardware.

Seizure detection challenge on kaggle

Following a hiatus of a couple of years I have rejoined the competitors on kaggle. The UPenn and Mayo Clinic Seizure Detection Challenge had 8 days to run when I decided to participate. For the time I had available I'm quite pleased with my final score. I finished in 27th place with 0.93558. The metric used was area under the ROC curve, 1.0 is perfect and 0.5 being no better than random.

The code is now on github.

Prompted by a post from Zac Stewart I decided to give pipelines in scikit-learn a try. The data from the challenge consisted of electroencephalogram recordings from several patients and dogs. These subjects had different numbers of channels in their recordings, so manually implementing the feature extraction would have been very slow and repetitive. Using pipelines made the process incredibly easy and allowed me to make changes quickly.

The features I used were incredibly simple. All the code is in transformers.py - I used variance, median, and the FFT which I pooled into 6 bins. No optimization of hyperparameters was attempted before I ran out of time.

Next time, I'll be looking for a competition with longer to run.

Introduction to Scientific Computing at DC Python

This Saturday the DC Python group ran a coding meetup. As part of the event I ran an introduction to scientific computing for about 7 people.

After a quick introduction to numpy, matplotlib, pandas and scikit-learn we decided to pick a dataset and apply some machine learning. The dataset we decided to use was from a Kaggle competition looking at the Titanic disaster. This competition had been posted to help the community get started with machine learning so it seemed perfect.

Continue reading ...