Following a hiatus of a couple of years I have rejoined the competitors on kaggle. The UPenn and Mayo Clinic Seizure Detection Challenge had 8 days to run when I decided to participate. For the time I had available I'm quite pleased with my final score. I finished in 27th place with 0.93558. The metric used was area under the ROC curve, 1.0 is perfect and 0.5 being no better than random.
Prompted by a post from Zac Stewart I decided to give pipelines in scikit-learn a try. The data from the challenge consisted of electroencephalogram recordings from several patients and dogs. These subjects had different numbers of channels in their recordings, so manually implementing the feature extraction would have been very slow and repetitive. Using pipelines made the process incredibly easy and allowed me to make changes quickly.
The features I used were incredibly simple. All the code is in transformers.py - I used variance, median, and the FFT which I pooled into 6 bins. No optimization of hyperparameters was attempted before I ran out of time.
Next time, I'll be looking for a competition with longer to run.
This Saturday the DC Python group ran a coding meetup. As part of the event I ran an introduction to scientific computing for about 7 people.
After a quick introduction to numpy, matplotlib, pandas and scikit-learn we decided to pick a dataset and apply some machine learning. The dataset we decided to use was from a Kaggle competition looking at the Titanic disaster. This competition had been posted to help the community get started with machine learning so it seemed perfect.
About a month ago I came across Kaggle which provides a platform for prediction competitions. It's an interesting concept. Accurate predictions are very useful but designing systems to make such predictions is challenging. By engaging the public it's hoped that talent not normally available to the competition organiser will have a try at the problem and come up with a model which is superior to previous efforts.
Prediction is not exactly my area of expertise but I wanted to have a crack at one of the competitions currently running; predicting response to treatment in HIV patients. I haven't yet started developing a model but wanted to release the python framework I've put together to test ideas. It can be downloaded here.
I've included a number of demonstration prediction methods; randomly guessing, assuming all will respond or assuming none will respond. I suggest you start with one of these methods and then improve on it with your own attempt. The random method was my first submission which, at the time of writing, currently puts me in 30th position out of 33 teams. Improving on that shouldn't be difficult.
The usage of the framework isn't difficult.
>>> import bootstrap >>> boot = bootstrap.Bootstrap("method_rand") >>> boot.run(50) Mean score: 0.501801084135 Standard deviation: 0.0241816159815 Maximum: 0.544554455446 Minimum: 0.442386831276 >>>
During development you can use the bootstrap class to get an idea of how well your method works as demonstrated above. All the training data is split randomly into training and testing sets and then the method trained on the training set and assessed on the test set. This process is repeated, the default is 50 times, and the the scores returned. The score returned will be different to the score when you submit but hopefully should give you an indication of how well you're doing.
>>> import submission >>> sub = submission.Submission("method_rand") >>> sub.run("submission1.csv") >>>
When you are satisfied with your method you can create the file needed for submission using the above code. In this case we are sticking with the random method. The submission file is submission1.csv. Hopefully this code is useful to you and you'll submit a prediction method yourself.