August | 2010 | MuckRock Reporter's Notebook

This is a technical post about how MuckRock recently integrated DocumentCloud’s services into our site using their API. If you are creating a Django based site and are interested in how to integrate DocumentCloud into your site, read on!

First, you will need a DocumentCloud account. They are still in a private beta, so you will have to request an account for now. Assuming you have an account, you probably want to start by looking at the help for their API. We need to be able to make HTTP GET and POST requests – including the ability to send a file in multipart/form-data format. This isn’t straightforward to do using the Python standard libraries, so I found this MultipartPostHandler library to help out.

Now the way I wanted this to work for our site is to have a model that represents a DocumentCloud document and allow staff to edit these documents directly through our admin site without having to log in the DocumentCloud site directly. Since the HTTP requests may take a while or error out, it is a good idea to run them asynchronously so users do not have to wait for them to finish. To do this I use celery. Setting up celery is beyond the scope of this post, but the documentation on their site is good and you should be able to follow that. Now for some code!

The model is just set up to mirror the options available for a DocumentCloud document, plus a ForeignKey to a FOIA Request model, which represents the original request on our website and a doc_id which is used to store the unique ID DocumentCloud assigns to this document:

# models.py
class FOIADocument(models.Model):
    """A DocumentCloud document attached to a FOIA request"""

    access = (('public', 'Public'),
              ('private', 'Private'),
              ('organization', 'Organization'))

    foia = models.ForeignKey(FOIARequest, related_name='documents')
    document = models.FileField(upload_to='foia_documents')
    title = models.CharField(max_length=70)
    source = models.CharField(max_length=70)
    description = models.TextField()
    access = models.CharField(max_length=12, choices=access)
    doc_id = models.SlugField(max_length=80, editable=False)

Now for the admin model to integrate it into the Django admin interface:

#admin.py
class FOIADocumentAdmin(admin.ModelAdmin):
    """FOIA Image Inline admin options"""
    readonly_fields = ['doc_id']
    list_display = ('title', 'foia', 'doc_id', 'description')

    def save_model(self, request, obj, form, change):
        """Attach user to article on save"""
        obj.save()
        # wait 3 seconds to give database a chance to sync
        upload_document_cloud.apply_async(args=[obj.pk, change],
                                          countdown=3)

We make doc_id a read-only attribute, as this will be set automatically. On a save, we first save the object to store it to the database. Then we make an asynchronous call to upload_document_cloud which is a celery task I have set up to do all the interesting stuff with the API. There is a delay of 3 seconds to make sure the database has chance to save the object, as in development I would sometimes get errors that the object was not in the database. It passes the primary key for the object, and whether or not this is a change (as opposed to a new object).

Lets take a look at the task now.

First we import the libraries we will be using, our DocumentCloud username and password (I put these in my settings.py), and the model from above.

# tasks.py
from celery.decorators import task
from django.core import management
from settings import DOCUMNETCLOUD_USERNAME, DOCUMENTCLOUD_PASSWORD

import base64
import json
import urllib2
from vendor import MultipartPostHandler

from foia.models import FOIADocument

The task itself starts with loading the recently saved model back form the database – we waited 3 seconds to make sure it was synced, but just to be sure we will retry if it still doesn’t exist. The default retry is 30 seconds later. We also return without doing anything if the model already has a doc_id but is not being changed – this should never happen, but it is always a good idea to code defensively.

@task(ignore_result=True)
def upload_document_cloud(doc_pk, change, **kwargs):
    """Upload a document to Document Cloud"""

    try:
        doc = FOIADocument.objects.get(pk=doc_pk)
    except FOIADocument.DoesNotExist, exc:
        # give database time to sync
        upload_document_cloud.retry(args=[doc_pk, change],
                                    kwargs=kwargs, exc=exc)

    if doc.doc_id and not change:
        # not change means we are uploading a new one -
        # it should not have an id yet
        return

Now we set up the parameters for our API call. They need to be coerced from unicode to regular strings due to the way they are encoded in the MultipartFormHandler – failing to due so caused encoding errors. If we are changing we will use the update API by making a PUT (we fake it using _method) to the documents doc_id. If this is a new document we upload the file using a POST to upload.

    # coerced from unicode to regular strings
    # in order to avoid encoding errors
    params = {
        'title': str(doc.title),
        'source': str(doc.source),
        'description': str(doc.description),
        'access': str(doc.access),
        'related_article': str('http://www.muckrock.com' +
            doc.foia.get_absolute_url()),
        }
    if change:
        params['_method'] = str('put')
        url = '/documents/%s.json' % doc.doc_id
    else:
        params['file'] = open(str(doc.document.path), 'rb')
        url = '/upload.json'

We perform the request here. Urllib2 does not allow you to use the http://username@password:example.com syntax for basic authentication, so we add the header manually. Also notice that we are using https so as not to allow snoopers to find our account’s password. If this is a first time upload, upon return of the call we will parse the JSON returned and assign the id attribute to the doucment and save it. We also catch any errors that may happen, such as timing out due to a bad network connection, and will retry the request. It will retry up to 3 times by default before giving up.

    opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler)
    request = urllib2.Request('https://www.documentcloud.org/api/%s' % url, params)
    # This is just standard username/password encoding
    auth = base64.encodestring('%s:%s' % (DOCUMNETCLOUD_USERNAME, DOCUMENTCLOUD_PASSWORD))[:-1]
    request.add_header('Authorization', 'Basic %s' % auth)

    try:
        ret = opener.open(request).read()
        if not change:
            info = json.loads(ret)
            doc.doc_id = info['id']
            doc.save()
    except urllib2.URLError, exc:
        upload_document_cloud.retry(args=[doc.pk, change], kwargs=kwargs, exc=exc)

And that is it. If you have any questions or need help adapting any code for your particular use, feel free to contact me at mitch@muckrock.com

MuckRock Reporter's Notebook

Using the DocumentCloud API

MuckRock, Now with More DocumentCloud Goodness!

A peak at MuckRock’s new wizard

Recent Posts

Archives

Categories

Meta