Welcome to the web blog
Bit pusher at Spotify. Previously Interactive News at the New York Times, U.S. Digital Service, and Code for America.
In my last post, I talked about how embrace the qualities of good enough Postgres full-text search. In this post, I want to talk about some of that approach’s weaknesses, some ways to improve.
There are two primary problems:
One of the problems that we noticed early on was that once the data got any larger than absolutely tiny, the database refresh would block the main web request and make things really quite slow. This is to be expected, but causes a very poor experience whenever people update or add new records.
Additionally, whenever we want to add new tables to our materialized search view, we now not only have to rewrite the materialized view create query, but we also have to completely add and drop our triggers (due to their only being executing on certain columns).
Fortunately, there are some tools that allow us to solve this problem: SQLAlchemy ORM events and Celery. We’ll use the ORM events to replace the database triggers, and Celery to mitigate slowness by punting the actual materialized view rebuild over into a separate worker process.
A chrome extension that replaces "easy to use" with "full of magic" in software documentation.
— Ben Smithgall (@bsmithgall) August 17, 2015
If you are using the app factory and blueprint patterns (which I highly recommend), plugging in Celery is a non-trivial exercise, despite what the documentation might say. One of the biggest difficulties is dealing with the fact that you might not have an application context when you create your Celery workers. Another tripping point is dealing with circular imports. We’ll walk through the approach used in the Pittsburgh Purchasing Suite.
We’re going to be using redis as our broker in this instance for a few reasons (it is free on Heroku, and it can double as a lightweight cache we will use later). You can install Celery and python dependencies in one go with this command (NOTE: you’ll need to have redis installed. If you are on OSX, you can install it using homebrew):
We should now have Celery installed, along with the dependencies for redis. The next step is to get a Celery instance instantiated. A good way to do this is to have a make_celery
factory method that lives in the same place as your app creation factory. Here’s what a small application that uses this pattern might look like:
Let’s walk through step-by-step and see what is going on here. In both our app factory and our Celery factory, we use the werkzeug import_string
utility to get a configuration string from our environment. An example of what this might look like would be settings.Config
. This will looking in a settings.py
file for an object named Config
, and import it. Then, we instantiate with a Celery
object or a Flask
object, depending on which factory method we are calling.
Now that we have a celery
object and an app factory, we are ready to get started. We’ll need to create an entry point that our Celery worker can use and have access to the Flask application context:
What does this do? It creates a new app
from the app factory and pushes that app’s context, allowing a Celery worker to boot with full access to the application’s context. This allows us to do things that we wouldn’t be able to do otherwise, like connecting to the database or sending templated emails. At this point, we should be able to start all of the pieces that we need to actually start running Celery tasks:
You should now have everything running. Note: in development, we can use the CELERY_ALWAYS_EAGER
flag to have celery run tasks immediately and avoid running a broker and a separate celery worker process. For now, though, we’ll want to keep all three running to make sure that everything works as expected.
Now that our yak has been fully shaved, we can get started with the original purpose, writing our database refresh task! To do this, we’ll want to create a mixin for our SQLAlchemy events, and a task to perform the work itself. Let’s start with the task, as that’s what we will import to use in our database mixin.
Here’s the code for our rebuilding task:
This is fairly straightforward:
Disposing of the connection is an important step here; if we fail to do it, it’s possible to leave a bunch of uncommitted connections open, which will quickly overwhelm the database and prevent it from processing other requests, including web requests. Now that we have our task, let’s write our database mixin:
This is a bit more complicated, so let’s go through it, starting with the mixin itself.
For testing purposes, we break out the actual refresh logic from the class. In the refresh_search_view
function, we use some logic from the SQLAlchemy docs to determine if columns on the object have been modified. This is important because when we use this mixin, it will run the event on modification of the original object and any of the object’s relationships. Therefore, we want to be a bit convservative about when we actually fire the rebuild event. If the object in question has been changed, we fire our rebuild_search_view
task, calling .delay()
to tell Celery to run it async.
__declare_last__
The __declare_last__
directive is a special directive that lets us declare events after the mapper is otherwise configured. This allows us to attach event listeners after the rest of the mapper configuration. There are other ways to attach event listeners, but I think this is perhaps the best way to do it with the declarative base setup. Once we know this, the logic of the hook says to listen for the three ‘after’ events, and to fire the class’s event_handler
classmethod. In this case, it’s the handler that we described above.
One big area of improvement that we can take advantage of is the caching that we get by using redis as our Celery broker. We can use a cache lock and tell celery to only run the rebuild if we aren’t already doing it:
Testing is a bit difficult, but we can use mocks to good effect here:
What we are doing here is creating a fake SQLAlchemy declarative base model as if it were a model in our normal app. We override the event_handler
classmethod, having it set a called
class property. Because the RefreshSearchViewMixin
is applied to our class, when we create, update, or delete an instance of the class, the called
property should be set to true.
This implementation has been a big improvement for us: it has increased speed, allowed us to delegate rebuilds, and makes updating the data models simpler. It does involve some additional dependencies in production, but these can be mocked out in development. If you have comments or suggestions, let me know on twitter.