Following a hiatus of a couple of years I have rejoined the competitors on kaggle. The UPenn and Mayo Clinic Seizure Detection Challenge had 8 days to run when I decided to participate. For the time I had available I'm quite pleased with my final score. I finished in 27th place with 0.93558. The metric used was area under the ROC curve, 1.0 is perfect and 0.5 being no better than random.
Prompted by a post from Zac Stewart I decided to give pipelines in scikit-learn a try. The data from the challenge consisted of electroencephalogram recordings from several patients and dogs. These subjects had different numbers of channels in their recordings, so manually implementing the feature extraction would have been very slow and repetitive. Using pipelines made the process incredibly easy and allowed me to make changes quickly.
The features I used were incredibly simple. All the code is in transformers.py - I used variance, median, and the FFT which I pooled into 6 bins. No optimization of hyperparameters was attempted before I ran out of time.
Next time, I'll be looking for a competition with longer to run.
Over the past several months I have been working on a method for measuring fibrosis. I published an article based on this work in Physiological Reports. The journal has started a podcast series and this article was in the second episode. I discussed the article with Physiological Reports editor Tom Kleyman. I embedded the full podcast below and the article is available on the journal website
Fibrosis is an important step in healing an injury. The scar that might form after a cut is an example of normal physiological fibrosis. Unfortunately fibrosis is not always benign. Pathological fibrosis is the deposition of excessive fibrous tissue. This interferes with healing and the function of the organ. Fibrosis is a dominant feature in the histological damage seen in many diseases. Examples include idiopathic pulmonary fibrosis, liver cirrhosis, and Crohn's disease. My interest is in chronic kidney disease.
The advanced stages of kidney disease requires treatment by dialysis or kidney transplantation. Both of these options have many negative consequences. Treatments to slow the development of fibrosis would help many patients.
Accurate measurements of fibrosis are vital in treatment development. The sirius red method in this article is more reproducible and precise. I hope it will contribute to getting better treatment options to the patients that need them.
This Saturday the DC Python group ran a coding meetup. As part of the event I ran an introduction to scientific computing for about 7 people.
After a quick introduction to numpy, matplotlib, pandas and scikit-learn we decided to pick a dataset and apply some machine learning. The dataset we decided to use was from a Kaggle competition looking at the Titanic disaster. This competition had been posted to help the community get started with machine learning so it seemed perfect.
I am currently working on a fairly complex data collection task. This is the third in the past year and by now I'm reasonably comfortable handling the mechanics, especially when I can utilise tools like Scrapy, lxml and a reasonable ORM for database access. Deciding exactly what to store seems like an easy question and yet it is this question which seems to be causing me the most trouble.
The difficulty exists because in deciding what to store multiple competing interests need to be balanced.
Storing everything is the easiest to implement and enables the decision of which data points you are interested in to be delayed. The disadvantages with storing everything is that it can place significant demands on storage capacity and risks silent failure.
Store just what you need
Storing just the data you are interested in minimises storage requirements and makes it easier to detect failures. If the information you want is moved, more common for html scraping than APIs, or you realise you have not been collecting everything you want there is no way to go back and alter what you extract or how you extract it.
Failure detection is easier with storing just what you need because your expectations are more detailed. If you expect to find an integer at a certain node in the DOM and either fail to find the node or the content is not an integer you can be relatively certain that there is an error. If you are storing the entire document a request to complete a CAPTCHA or a notice that you have exceeded a rate limit may be indistinguishable from the data you are hoping to collect.
So far I've taken an approach somewhere between these two extremes although I doubt I am close to the optimal solution. For the current project I need to parse much of the data I am interested in so that I can collect the remainder. It feels natural in this situation to favour storing only what I intend to use even though this decision has slowed down development.
Have you been in a similar situation and faced these same choices? Which approach did you take?
I don't use raw SQL very often so when I do I usually end up checking the manual for the correct syntax. One query I've wanted to run a couple of times recently and always struggled to find the correct statement for is checking the amount of disk space used by a table or database.
For the ease of future reference here they are.