Django and Scrapy

08 Feb.
2012
, ,

I'm currently working on a project which centres around pulling in data from an external website, "mashing" it up with some additional content, and then displaying it on a website.

The website is going to be interactive and reasonably complex so I decided to use django. To acquire the external data there isn't a webservice so I'm stuck parsing html (and excel spreadsheets but that's a separate story). Scrapy seemed ideal for this and although I wish I had used some other approach than xpath it largely has been.

Having set up my database models in django and built my spider in scrapy the next step was putting the data from the spider in the database. There are plenty of posts detailing how to use the django ORM from outside a django project, even some specific to scrapy but they didn't seem to be working for me.

The issue was the way I handled development and production environment settings.

The root directory of my django project contains a settings.py file, a dev_settings.py file and a prod_settings.py file. The settings.py file doesn't actually include any settings. By default it imports the dev_settings file and then if the hostname does not match the development environment it also imports the prod_settings file which replaces much of dev_settings.

from dev_settings import *

import socket
if socket.gethostname() != 'dev_env_hostname':
    from prod_settings import *

The solution posted on the stackoverflow question imports the settings file correctly but then fails to find either the dev_settings or prod_settings files resulting in an exception. The solution was to add the root directory of the django project to the sys.path variable. This was relatively simple to achieve.

import os, sys
directory, filename = os.path.split(os.path.realpath(__file__))
sys.path.append(directory)

By sticking those three lines at the top of my settings.py file I was able to continue using different files for development and production while also using the django ORM inside my scrapy project.

Beyond these changes the solution posted by bababa on stackoverflow needed no modification. For completeness, the code added to the scrapy settings file is below.

def setup_django_env(path):
    import imp, os
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

setup_django_env('/path/to/django/project/')

Comments

witek on Feb. 8, 2012, 9:01 p.m.

You may try peewee - the one-file ORM with syntax like Django.