Presentation at BarcampNortheast3 - Improvised search for a private phpBB forum using PHP, MySQL and sphinx

Three weekends ago I went down to Newcastle to attend BarcampNortheast3. For anyone who doesn't know what a barcamp is Wikipedia provides a decent explanation

BarCamp is an international network of user generated conferences (or unconferences) - open, participatory workshop-events, whose content is provided by participants. The first BarCamps focused on early-stage web applications, and related open source technologies, social protocols, and open data formats.

The idea is that those attending also present. I decided to talk about the work I did setting up a dedicated search engine for a private phpBB powered forum as it is something not many people probably have ever needed to do. Below is the short powerpoint presentation I put together. Excessive use of powerpoint is discouraged so there isn't much in it. Further details are below.

Background

phpBB does have its own search functionality so it's reasonable to ask why is anything else needed. Unfortunately the users of this forum had found that the site was slowing down. The problem was believed to be the search functionality and so it was set to only make the past year and future content searchable. This meant there was several years worth of content which didn't show up in the search engine. I had arrived to this community relatively recently and so had missed a lot of good content. Initially I had asked to have access to the database which would have made making the site fully searchable straightforward. Unfortunately the owner of the site didn't have the technical know how to feel safe granting me, a relative newcomer, access to the database.

3 steps

To get around this I decided to do what google does and fetch the site one html page at a time and get the content that way. The project could be broken down into 3 steps; create a mirror of the site locally, insert the content extracted from the local mirror into a database, and finally setup sphinx to index the content.

The Problems

As the site is password protected creating the local mirror required some additional work to handle login sessions. The html of the pages throws up several errors running it through the w3c validator which created problems for extracting content. Finally all the important information in the URL is in the query string.

Wget

Wget is a really nice tool for downloading information from the internet. It was relatively easy to cajole it into handling logging it. Unfortunately I soon realized it was repeatedly downloading the same content again and again. This was a problem with wget but reflected the redundant linking structure of phpBB. A topipc might have twenty posts and each post had a unique URL which pulled in all the content for the entire topic. Wget does allow you to filter the URLs you download but it doesn't filter on the query string which is what I needed.

Zend_HTTP

In the end I created a custom script to handle crawling the site with Zend_HTTP handling the actual HTTP requests.

Scraping HTML

Running each downloaded page through the PHP Tidy extension and then feeding the resulting text to SimpleXML worked in most, though not all, cases. Since the barcamp conference I have since re-implemented this section of the project using Python and BeautifulSoup which was able to handle all the downloaded pages.

Releasing Sphinx

I considered three options for the search engine; two Lucene based projects, Zend_Search and Solr, and then sphinx. I had heard that Zend_Search would be rather slow at indexing and I felt that Solr, although the most powerful option, would be overly complex for my needs. I therefore decided to give Sphinx a try.

Sphinx is set up to be able to pull content from a MySQL database so when I had a complete copy of the forum locally I extracted all the posts to a MySQL database table ready to be fed to Sphinx. First though I had to get Sphinx talking to MySQL. To fetch content directly from a MySQL database Sphinx requires the mysql-devel package to be installed. A search using apt-get couldn't find the package. Fortunately a quick google search turned up this page which suggested a fix.

After that the only other problem was the paths in the manual didn't match up with the path to the sphinx executables on my system. The api and the example configuration files could be easily adapted for my needs and I quickly had a working search engine. The rest of the slides are self-explanatory so I'll stop here. If you have any questions post them in the comments below and I'll do my best to answer them.