Using the DocumentCloud API

Leave a comment

This is a technical post about how MuckRock recently integrated DocumentCloud’s services into our site using their API.  If you are creating a Django based site and are interested in how to integrate DocumentCloud into your site, read on!

First, you will need a DocumentCloud account.  They are still in a private beta, so you will have to request an account for now.  Assuming you have an account, you probably want to start by looking at the help for their API.  We need to be able to make HTTP GET and POST requests – including the ability to send a file in multipart/form-data format.  This isn’t straightforward to do using the Python standard libraries, so I found this MultipartPostHandler library to help out.

Now the way I wanted this to work for our site is to have a model that represents a DocumentCloud document and allow staff to edit these documents directly through our admin site without having to log in the DocumentCloud site directly.  Since the HTTP requests may take a while or error out, it is a good idea to run them asynchronously so users do not have to wait for them to finish.  To do this I use celery.  Setting up celery is beyond the scope of this post, but the documentation on their site is good and you should be able to follow that.  Now for some code!

The model is just set up to mirror the options available for a DocumentCloud document, plus a ForeignKey to a FOIA Request model, which represents the original request on our website and a doc_id which is used to store the unique ID DocumentCloud assigns to this document:

# models.py
class FOIADocument(models.Model):
    """A DocumentCloud document attached to a FOIA request"""

    access = (('public', 'Public'),
              ('private', 'Private'),
              ('organization', 'Organization'))

    foia = models.ForeignKey(FOIARequest, related_name='documents')
    document = models.FileField(upload_to='foia_documents')
    title = models.CharField(max_length=70)
    source = models.CharField(max_length=70)
    description = models.TextField()
    access = models.CharField(max_length=12, choices=access)
    doc_id = models.SlugField(max_length=80, editable=False)

Now for the admin model to integrate it into the Django admin interface:

#admin.py
class FOIADocumentAdmin(admin.ModelAdmin):
    """FOIA Image Inline admin options"""
    readonly_fields = ['doc_id']
    list_display = ('title', 'foia', 'doc_id', 'description')

    def save_model(self, request, obj, form, change):
        """Attach user to article on save"""
        obj.save()
        # wait 3 seconds to give database a chance to sync
        upload_document_cloud.apply_async(args=[obj.pk, change],
                                          countdown=3)

We make doc_id a read-only attribute, as this will be set automatically.  On a save, we first save the object to store it to the database.  Then we make an asynchronous call to upload_document_cloud which is a celery task I have set up to do all the interesting stuff with the API.  There is a delay of 3 seconds to make sure the database has chance to save the object, as in development I would sometimes get errors that the object was not in the database.  It passes the primary key for the object, and whether or not this is a change (as opposed to a new object).

Lets take a look at the task now.

First we import the libraries we will be using, our DocumentCloud username and password (I put these in my settings.py), and the model from above.

# tasks.py
from celery.decorators import task
from django.core import management
from settings import DOCUMNETCLOUD_USERNAME, DOCUMENTCLOUD_PASSWORD

import base64
import json
import urllib2
from vendor import MultipartPostHandler

from foia.models import FOIADocument

The task itself starts with loading the recently saved model back form the database – we waited 3 seconds to make sure it was synced, but just to be sure we will retry if it still doesn’t exist. The default retry is 30 seconds later. We also return without doing anything if the model already has a doc_id but is not being changed – this should never happen, but it is always a good idea to code defensively.

@task(ignore_result=True)
def upload_document_cloud(doc_pk, change, **kwargs):
    """Upload a document to Document Cloud"""

    try:
        doc = FOIADocument.objects.get(pk=doc_pk)
    except FOIADocument.DoesNotExist, exc:
        # give database time to sync
        upload_document_cloud.retry(args=[doc_pk, change],
                                    kwargs=kwargs, exc=exc)

    if doc.doc_id and not change:
        # not change means we are uploading a new one -
        # it should not have an id yet
        return

Now we set up the parameters for our API call. They need to be coerced from unicode to regular strings due to the way they are encoded in the MultipartFormHandler – failing to due so caused encoding errors. If we are changing we will use the update API by making a PUT (we fake it using _method) to the documents doc_id. If this is a new document we upload the file using a POST to upload.

    # coerced from unicode to regular strings
    # in order to avoid encoding errors
    params = {
        'title': str(doc.title),
        'source': str(doc.source),
        'description': str(doc.description),
        'access': str(doc.access),
        'related_article': str('http://www.muckrock.com' +
            doc.foia.get_absolute_url()),
        }
    if change:
        params['_method'] = str('put')
        url = '/documents/%s.json' % doc.doc_id
    else:
        params['file'] = open(str(doc.document.path), 'rb')
        url = '/upload.json'

We perform the request here. Urllib2 does not allow you to use the http://username@password:example.com syntax for basic authentication, so we add the header manually. Also notice that we are using https so as not to allow snoopers to find our account’s password. If this is a first time upload, upon return of the call we will parse the JSON returned and assign the id attribute to the doucment and save it. We also catch any errors that may happen, such as timing out due to a bad network connection, and will retry the request. It will retry up to 3 times by default before giving up.

    opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler)
    request = urllib2.Request('https://www.documentcloud.org/api/%s' % url, params)
    # This is just standard username/password encoding
    auth = base64.encodestring('%s:%s' % (DOCUMNETCLOUD_USERNAME, DOCUMENTCLOUD_PASSWORD))[:-1]
    request.add_header('Authorization', 'Basic %s' % auth)

    try:
        ret = opener.open(request).read()
        if not change:
            info = json.loads(ret)
            doc.doc_id = info['id']
            doc.save()
    except urllib2.URLError, exc:
        upload_document_cloud.retry(args=[doc.pk, change], kwargs=kwargs, exc=exc)

And that is it. If you have any questions or need help adapting any code for your particular use, feel free to contact me at mitch@muckrock.com

MuckRock, Now with More DocumentCloud Goodness!

Leave a comment

The MuckRock team has been playing with DocumentCloud integration, and we’ve found a happy medium in terms of balancing the “request” view and “document” view. Now, whenever you have a request that has  a group of documents attached to it, you can just click the document name you’d like to view, and you’ll be taken to the beautiful DocumentCloud/New York Times’ viewer:

The viewer lets you easily scroll around and read through a document, download the original PDF, search through text and much, much more, and we’d love to hear your thoughts. A bit thanks to the DocumentCloud team for their incredible work, and for letting us take part in their own beta!

A peak at MuckRock’s new wizard

1 Comment

The mission of MuckRock is to make it as simple as possible to file your freedom of information requests, and we’ve been working hard to make it easier than ever to navigate creating, tracking and sharing your requests. We have big announcements coming up soon on the “sharing” front, but right now, we wanted to give you a better look at the “creating” portion: We’ve broken the wizard down into tabbed categories with better descriptions helping you figure out exactly what kind of request you need to make to get the information you want.

You can also see a progress bar up top that shows you how far you are along in actually getting that request finished up. You’ll actually find a similar progress bar on all request pages now (see here for a sample). That gives you a quick, visual way to see how far along your request is: Whether it’s waiting for a response, hasn’t been sent out yet, or is even still just a draft in your inbox. There’s also a new color-coding system for tracking these responses, so that if a response is coded green, everything’s smooth sailing; if it’s coded red, the request has been denied, the government has failed to respond within the statutory limits, or something else has gone wrong; and if it’s coded yellow, a user response is needed, such as finishing up a draft or OK’ing some proposed compromise that the government agency is requesting. You can even quickly see the status of all current requests with this system on the FOI request homepage, which is right now almost entirely green.