Multi-Table Full Text Search with Postgres, Flask, and Sqlalchemy
June 23, 2015
One of the projects I’ve been working during my Code for America fellowship is a search-based application over contracts in the City of Pittsburgh. A common problem among people in the City is that they need to look up information about contracts when they go to buy something. They might need to get the right identifier for the City Controller to pay the invoice, or figure out which specific height of fence post they can buy. Right now, they are often leafing through large stacks of printed documents, spreadsheets scattered across shared drives, and several websites. The goal of this project is to unify that data and make it searchable.
Over time, we have been collecting more and more information about these contracts that we want to expose via search. This information is stored in several different database tables. For awhile, using naïve substring-based search was fine for our purposes, but obvious weaknesses have been a blocker for our users. After reading Postgres full-text search is Good Enough!, and Implementing Multi-Table Full Text Search with Postgres in Rails, I figured that I would try implementing Postgres full-text search. I ended using a very similar approach to the one outlined in the Rails post.
Why didn’t I use [ElasticSearch/Solr/some other perfect solution]? There are two reasons: first, the app (and its accompanying database) is (very) small, so going all the way to elasticsearch or an equalivalent seemed like overkill. Second, the use case is perfect for Postgres full-text search: we are using Postgres already, and I just needed to implement some additional search functionality without the added requirement of an indexing step. Overall, I used a process similar to the one outlined in the Rails post linked above – a database migration sets up the majority of the functionality, a SQLAlchemy Model is created to give access to the model, and everything ends up working out fine.
Writing the migraton
The first piece of the puzzle is writing the migration (note: I’m using alembic to manage migrations). I needed the migration to accomplish the following things:
Set up GIN indices on all of my TSVector columns
Create view refresh triggers on all of the appropriate tables
I’m going to include example snippets from the project. You can find the whole thing here
This neatly accomplishes all of the tasks needed – it materializes a search_view, which will concurrently refresh whenever any of the underlying columns to be searched changes.
Set up the Sqlalchemy model
In my models, I set up a SearchView model that matched up with the schema of the table created in the migration. Once this is set up and the migration is run, we can access the data as we normally might. In a python shell:
There is one remaining problem to be ironed out before we can move on, though: if we attempt to do any migrations from this point forward, alembic will generate a SearchView table. We need to get alembic to ignore this. I found this gist, which describes a good solution to this problem.
In our alembic.ini:
# set tables to ignore
[alembic:exclude]
tables = search_view
This handles the migration side of things. Now we need to actually implement our search.
Setting up the search
The most recent versions of sqlalchemy support full-text search operations, including match, so getting set up with basic full-text search is fairly straightforward.
Going past this gets a bit more challenging. For example, the ts_rank function is not supported out of the box, so we have to write that ourselves:
We also want to concatenate all of our search documents together with weights to get a good ranking. This caused some problems because the db.func.concat() function returns a text/string type, converting from the required TSVector type. This type coercion causes incorrect ranking. However, db.func.concat() also works on a Column-level, which is what is needed for Postgres to properly concatenate TSVector columns. Additionally, we have to coalesce the columns to catch NULL values. Failing to do so can really screw up the results:
Now we are getting somewhere! The last thing that I want to do was to allow users to filter their search results. For example, users might only want to find contract line items that match their search term.
In order to do this, we can write a function which returns a list of columns we want to filter by, and pass that into our query:
When plugged into a fairly light interface with some added features, it looks something like this:
Search! Filter! Sort!
Testing
One side effect of this is that when writing tests, you can no longer use SQLAlchemy’s create_all() method, because it will incorrectly create a table as opposed to materializing a view. My solution to this is the run the database migrations on test set up. One thing to note, however, is that running the migrations takes longer than the create_all() method, so only do that in cases where you are actually testing the full-text search. One caveat is that in the tearDown method, I had no success trying to drop the materialized view only. Instead, I had to drop the whole and recreate the whole schema to make sure the database was clean between tests. If you are using flask-migrate, you can do the following:
Some notes
As noted in the piece about a similar implementation in Rails, the approach is pretty fragile if you decide you need to add a new table later. You will have to do some pretty serious mucking around in the migrations should that become necessary.
However, this implementation has worked out pretty well for this project, but I’m not sure if it’s the best way to go about it. If you have comments or suggestions, let me know on twitter.