Introduction to Scientific Computing at DC Python
This Saturday the DC Python group ran a coding meetup. As part of the event I ran an introduction to scientific computing for about 7 people.
After a quick introduction to numpy, matplotlib, pandas and scikit-learn we decided to pick a dataset and apply some machine learning. The dataset we decided to use was from a Kaggle competition looking at the Titanic disaster. This competition had been posted to help the community get started with machine learning so it seemed perfect.
During the event
We spent most of our time getting set up and exploring the data. It was only towards the end of the day we started looking at applying machine learning.
I decided to give pandas a try importing the data from the csv file. I had not used it much previously but it worked really well.
I have placed the code in a github repository. By the end of the day we had reached this point.
This should display the following.
[0.57042253521126762, 0.54929577464788737, 0.66901408450704225, 0.70422535211267601, 0.69178082191780821] mean: 0.636947713679
[0.528169014084507, 0.49295774647887325, 0.57042253521126762, 0.55633802816901412, 0.63698630136986301] mean: 0.556974725063
The model we have created is correct 63% of the time and the random model achieves 55%. When you run this you may find that you get slightly different values due to the random component. We are doing better than would be expected by chance but hopefully further improvements can be made.
Further Improvements
Currently the parameters for the model are set at reasonable guesses but there is no optimization. Using GridSearch we can try different parameters and then choose the parameters that perform best. This raises the accuracy score to 69%.
We are not yet considering the sex of the passengers. If we fix this, by converting the text fields to 0/1 integers we improve the accuracy to 78%.
I won't develop this any further for the moment. If you want to continue expanding this an approach similar to that used for the sex can be used for the cabin and port of embarkation. Scikit-learn also makes it very easy to switch from one machine learning algorithm to another which you may want to consider.
Very cool! Just a note that, rather than grid search, which has many well-known limitations and inefficiencies, I highly recommend checking out hyperopt (http://jaberg.github.com/hyperopt/) which takes a more reasoned approach to hyperparameter optimization. It's been really helpful in my research!