There are two primary problems:
One of the problems that we noticed early on was that once the data got any larger than absolutely tiny, the database refresh would block the main web request and make things really quite slow. This is to be expected, but causes a very poor experience whenever people update or add new records.
Additionally, whenever we want to add new tables to our materialized search view, we now not only have to rewrite the materialized view create query, but we also have to completely add and drop our triggers (due to their only being executing on certain columns).
Fortunately, there are some tools that allow us to solve this problem: SQLAlchemy ORM events and Celery. We’ll use the ORM events to replace the database triggers, and Celery to mitigate slowness by punting the actual materialized view rebuild over into a separate worker process.
A chrome extension that replaces "easy to use" with "full of magic" in software documentation.— Ben Smithgall (@bsmithgall) August 17, 2015
If you are using the app factory and blueprint patterns (which I highly recommend), plugging in Celery is a non-trivial exercise, despite what the documentation might say. One of the biggest difficulties is dealing with the fact that you might not have an application context when you create your Celery workers. Another tripping point is dealing with circular imports. We’ll walk through the approach used in the Pittsburgh Purchasing Suite.
We’re going to be using redis as our broker in this instance for a few reasons (it is free on Heroku, and it can double as a lightweight cache we will use later). You can install Celery and python dependencies in one go with this command (NOTE: you’ll need to have redis installed. If you are on OSX, you can install it using homebrew):
We should now have Celery installed, along with the dependencies for redis. The next step is to get a Celery instance instantiated. A good way to do this is to have a
make_celery factory method that lives in the same place as your app creation factory. Here’s what a small application that uses this pattern might look like:
Let’s walk through step-by-step and see what is going on here. In both our app factory and our Celery factory, we use the werkzeug
import_string utility to get a configuration string from our environment. An example of what this might look like would be
settings.Config. This will looking in a
settings.py file for an object named
Config, and import it. Then, we instantiate with a
Celery object or a
Flask object, depending on which factory method we are calling.
Now that we have a
celery object and an app factory, we are ready to get started. We’ll need to create an entry point that our Celery worker can use and have access to the Flask application context:
What does this do? It creates a new
app from the app factory and pushes that app’s context, allowing a Celery worker to boot with full access to the application’s context. This allows us to do things that we wouldn’t be able to do otherwise, like connecting to the database or sending templated emails. At this point, we should be able to start all of the pieces that we need to actually start running Celery tasks:
You should now have everything running. Note: in development, we can use the
CELERY_ALWAYS_EAGER flag to have celery run tasks immediately and avoid running a broker and a separate celery worker process. For now, though, we’ll want to keep all three running to make sure that everything works as expected.
Now that our yak has been fully shaved, we can get started with the original purpose, writing our database refresh task! To do this, we’ll want to create a mixin for our SQLAlchemy events, and a task to perform the work itself. Let’s start with the task, as that’s what we will import to use in our database mixin.
Here’s the code for our rebuilding task:
This is fairly straightforward:
Disposing of the connection is an important step here; if we fail to do it, it’s possible to leave a bunch of uncommitted connections open, which will quickly overwhelm the database and prevent it from processing other requests, including web requests. Now that we have our task, let’s write our database mixin:
This is a bit more complicated, so let’s go through it, starting with the mixin itself.
For testing purposes, we break out the actual refresh logic from the class. In the
refresh_search_view function, we use some logic from the SQLAlchemy docs to determine if columns on the object have been modified. This is important because when we use this mixin, it will run the event on modification of the original object and any of the object’s relationships. Therefore, we want to be a bit convservative about when we actually fire the rebuild event. If the object in question has been changed, we fire our
rebuild_search_view task, calling
.delay() to tell Celery to run it async.
__declare_last__ directive is a special directive that lets us declare events after the mapper is otherwise configured. This allows us to attach event listeners after the rest of the mapper configuration. There are other ways to attach event listeners, but I think this is perhaps the best way to do it with the declarative base setup. Once we know this, the logic of the hook says to listen for the three ‘after’ events, and to fire the class’s
event_handler classmethod. In this case, it’s the handler that we described above.
One big area of improvement that we can take advantage of is the caching that we get by using redis as our Celery broker. We can use a cache lock and tell celery to only run the rebuild if we aren’t already doing it:
Testing is a bit difficult, but we can use mocks to good effect here:
What we are doing here is creating a fake SQLAlchemy declarative base model as if it were a model in our normal app. We override the
event_handler classmethod, having it set a
called class property. Because the
RefreshSearchViewMixin is applied to our class, when we create, update, or delete an instance of the class, the
called property should be set to true.
This implementation has been a big improvement for us: it has increased speed, allowed us to delegate rebuilds, and makes updating the data models simpler. It does involve some additional dependencies in production, but these can be mocked out in development. If you have comments or suggestions, let me know on twitter.