Welcome to the web blog
Bit pusher at Spotify. Previously Interactive News at the New York Times, U.S. Digital Service, and Code for America.
One of the projects I’ve been working during my Code for America fellowship is a search-based application over contracts in the City of Pittsburgh. A common problem among people in the City is that they need to look up information about contracts when they go to buy something. They might need to get the right identifier for the City Controller to pay the invoice, or figure out which specific height of fence post they can buy. Right now, they are often leafing through large stacks of printed documents, spreadsheets scattered across shared drives, and several websites. The goal of this project is to unify that data and make it searchable.
Over time, we have been collecting more and more information about these contracts that we want to expose via search. This information is stored in several different database tables. For awhile, using naïve substring-based search was fine for our purposes, but obvious weaknesses have been a blocker for our users. After reading Postgres full-text search is Good Enough!, and Implementing Multi-Table Full Text Search with Postgres in Rails, I figured that I would try implementing Postgres full-text search. I ended using a very similar approach to the one outlined in the Rails post.
Why didn’t I use [ElasticSearch/Solr/some other perfect solution]? There are two reasons: first, the app (and its accompanying database) is (very) small, so going all the way to elasticsearch or an equalivalent seemed like overkill. Second, the use case is perfect for Postgres full-text search: we are using Postgres already, and I just needed to implement some additional search functionality without the added requirement of an indexing step. Overall, I used a process similar to the one outlined in the Rails post linked above – a database migration sets up the majority of the functionality, a SQLAlchemy Model is created to give access to the model, and everything ends up working out fine.
The first piece of the puzzle is writing the migration (note: I’m using alembic to manage migrations). I needed the migration to accomplish the following things:
I’m going to include example snippets from the project. You can find the whole thing here
Full code here
from alembic import op
import sqlalchemy as sa
trigger_tuples = [
('contract', 'description'),
('company', 'company_name'),
('contract_property', 'value'),
('line_item', 'description'),
]
index_set = [
'tsv_contract_description',
'tsv_company_name',
'tsv_detail_value',
'tsv_line_item_description'
]
def upgrade()
conn = op.get_bind()
# create the materialized view
conn.execute(sa.sql.text('''
CREATE MATERIALIZED VIEW search_view AS (
contract.id as contract_id,
[...]
)
'''))
# create unique index on ids
op.create_index(op.f('ix_search_view_id'), 'search_view', ['id'], unique=True)
# create remaining indices on the tsv columns
for index in index_set:
op.create_index(op.f(
'ix_tsv_{}'.format(index)), 'search_view', [index], postgresql_using='gin'
)
# for triggers, we need to build a new function which runs our refresh materialized view
conn.execute(sa.sql.text('''
CREATE OR REPLACE FUNCTION trig_refresh_search_view() RETURNS trigger AS
$$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY search_view;
RETURN NULL;
END;
$$
LANGUAGE plpgsql ;
'''))
for table, column in trigger_tuples:
conn.execute(sa.sql.text('''
DROP TRIGGER IF EXISTS tsv_{table}_{column}_trigger ON {table}
'''.format(table=table, column=column)))
conn.execute(sa.sql.text('''
CREATE TRIGGER tsv_{table}_{column}_trigger AFTER TRUNCATE OR INSERT OR DELETE OR UPDATE OF {column}
ON {table} FOR EACH STATEMENT
EXECUTE PROCEDURE trig_refresh_search_view()
'''.format(table=table, column=column)))
This neatly accomplishes all of the tasks needed – it materializes a search_view
, which will concurrently refresh whenever any of the underlying columns to be searched changes.
In my models, I set up a SearchView
model that matched up with the schema of the table created in the migration. Once this is set up and the migration is run, we can access the data as we normally might. In a python shell:
>>> search_result = SearchView.query.first()
# <purchasing.data.models.SearchView object at 0x1070d2710>
>>> type(search_result)
# <class 'purchasing.data.models.SearchView'>
# the city of pittsburgh has two contracts related to scuba gear,
# one for actual gear, one for maintenance.
>>> print db.session.query(SearchView.contract_description, SearchView.company_name).filter(SearchView.contract_description.match('scuba')).all()
# [(u'SERVICE OF SCUBA EQUIPMENT', u'SPLASH WATER SPORTS'), (u'SCUBA SUPPLIES', u'DIVE RESCUE INTERNATIONAL'), (u'SCUBA SUPPLIES', u'SPLASH WATER SPORTS')]
There is one remaining problem to be ironed out before we can move on, though: if we attempt to do any migrations from this point forward, alembic will generate a SearchView
table. We need to get alembic to ignore this. I found this gist, which describes a good solution to this problem.
In our alembic.ini
:
# set tables to ignore
[alembic:exclude]
tables = search_view
In our env.py
(see here for full changes):
def exclude_tables_from_config(config_):
tables_ = config_.get('tables', None)
if tables_ is not None:
tables = tables_.split(',')
return tables
exclude_tables = exclude_tables_from_config(config.get_section('alembic:exclude'))
def include_object(object, name, type_, reflected, compare_to):
if type_ == "table" and name in exclude_tables:
return False
else:
return True
This handles the migration side of things. Now we need to actually implement our search.
The most recent versions of sqlalchemy support full-text search operations, including match, so getting set up with basic full-text search is fairly straightforward.
# db is an instance of <class 'flask_sqlalchemy.SQLAlchemy'>
# search_term is a string that will get converted to a `tsvector` type by sqlalchemy
def search(search_term):
return db.session.query(
db.distinct(SearchView.contract_id)
).filter(db.or_(
SearchView.tsv_company_name.match(search_term, postgresql_regconfig='english'),
SearchView.tsv_line_item_description.match(search_term, postgresql_regconfig='english'),
SearchView.tsv_contract_description.match(search_term, postgresql_regconfig='english'),
SearchView.tsv_detail_value.match(search_term, postgresql_regconfig='english')
)).all()
print search('scuba')
# [(548,), (541,)]
Going past this gets a bit more challenging. For example, the ts_rank
function is not supported out of the box, so we have to write that ourselves:
class TSRank(GenericFunction):
package = 'full_text'
name = 'ts_rank'
# TSRank is now available as db.func.full_text.ts_rank()
We also want to concatenate all of our search documents together with weights to get a good ranking. This caused some problems because the db.func.concat()
function returns a text/string type, converting from the required TSVector
type. This type coercion causes incorrect ranking. However, db.func.concat()
also works on a Column-level, which is what is needed for Postgres to properly concatenate TSVector columns. Additionally, we have to coalesce the columns to catch NULL
values. Failing to do so can really screw up the results:
def search(search_term):
return db.session.query(
db.distinct(SearchView.contract_id),
SearchView.contract_description,
SearchView.company_name,
db.func.max(db.func.full_text.ts_rank(
db.func.setweight(db.func.coalesce(SearchView.tsv_company_name, ''), 'A').concat(
db.func.setweight(db.func.coalesce(SearchView.tsv_contract_description, ''), 'D')
).concat(
db.func.setweight(db.func.coalesce(SearchView.tsv_detail_value, ''), 'D')
).concat(
db.func.setweight(db.func.coalesce(SearchView.tsv_line_item_description, ''), 'B')
), db.func.to_tsquery(search_term, postgresql_regconfig='english')
)).label('rank')
).filter(db.or_(
SearchView.tsv_company_name.match(search_term, postgresql_regconfig='english'),
SearchView.tsv_line_item_description.match(search_term, postgresql_regconfig='english'),
SearchView.tsv_contract_description.match(search_term, postgresql_regconfig='english'),
SearchView.tsv_detail_value.match(search_term, postgresql_regconfig='english')
)).group_by(
SearchView.contract_id, SearchView.contract_description, SearchView.company_name
).order_by(db.text('rank DESC')).all()
search('scuba')
# [(541, u'SCUBA SUPPLIES', u'DIVE RESCUE INTERNATIONAL', 0.0607927), (541, u'SCUBA SUPPLIES', u'SPLASH WATER SPORTS', 0.0607927), (548, u'SERVICE OF SCUBA EQUIPMENT', u'SPLASH WATER SPORTS', 0.0607927)]
fences = search('fence')
# the city has a bunch of fencing contracts. some of them have
# 'fence' in the name of the awarded company, some in other places
print len(fences)
# 12
print fences[0]
# (280, u'FENCING INSTALLATION, REPAIRS, ETC. (CD)', u'ALLEGHENY FENCE CONSTRUCTION', 0.665342)
# note that the with the company name containin the word fence, we have a higher rank
print fences[-1]
(421, u'GRASS SEED AND LANDSCAPING SUPPLIES II', u'KEYSTONE TURF PRODUCTS', 0.243171)
# here, the word 'fence' is part of the a line item associated with the contract, so the
# rank is lower.
Now we are getting somewhere! The last thing that I want to do was to allow users to filter their search results. For example, users might only want to find contract line items that match their search term.
In order to do this, we can write a function which returns a list of columns we want to filter by, and pass that into our query:
def build_filter(req_args, fields, search_term, filter_form, _all):
'''Build the where clause filter
'''
clauses = []
for arg_name, _, filter_column in fields:
if _all or req_args.get(arg_name) == 'y':
if not _all:
filter_form[arg_name].checked = True
clauses.append(filter_column.match(search_term, postgresql_regconfig='english'))
return clauses
# filter form is a small form that presents a series of
# checkboxes on the web page for users to filter
filter_form = FilterForm()
fields = [
('company_name', 'Company Name', SearchView.tsv_company_name),
('line_item', 'Line Item', SearchView.tsv_line_item_description),
('contract_description', 'Contract Description', SearchView.tsv_contract_description),
('contract_detail', 'Contract Detail', SearchView.tsv_detail_value),
]
def search(search_term, filter_where):
return db.session.query(
db.distinct(SearchView.contract_id),
SearchView.contract_description,
SearchView.company_name,
db.func.max(db.func.full_text.ts_rank(
db.func.setweight(db.func.coalesce(SearchView.tsv_company_name, ''), 'A').concat(
db.func.setweight(db.func.coalesce(SearchView.tsv_contract_description, ''), 'D')
).concat(
db.func.setweight(db.func.coalesce(SearchView.tsv_detail_value, ''), 'D')
).concat(
db.func.setweight(db.func.coalesce(SearchView.tsv_line_item_description, ''), 'B')
), db.func.to_tsquery(search_term, postgresql_regconfig='english')
)).label('rank')
).filter(db.or_(*filter_where)).group_by(
SearchView.contract_id, SearchView.contract_description, SearchView.company_name
).order_by(db.text('rank DESC')).all()
# this time, let's search for tree contracts, because
# they are scattered across a bunch of categories
search_term = 'tree'
filter_none = build_filter(
{}, fields, search_term, filter_form, True
)
trees = search(search_term, filter_none)
print len(trees)
# 16
print trees[0]
# (308, u'PRUNING AND REMOVAL OF TREES IN CD AREAS ONLY', u'BENTLEY TREE CARE', 0.650144)
# now, let's filter by line items
filter_line_items = build_filter(
{'line_item': 'y'}, fields, search_term, filter_form, False
)
trees = search(search_term, filter_line_items)
print len(trees)
# 5
When plugged into a fairly light interface with some added features, it looks something like this:
Search! Filter! Sort!
One side effect of this is that when writing tests, you can no longer use SQLAlchemy’s create_all()
method, because it will incorrectly create a table as opposed to materializing a view. My solution to this is the run the database migrations on test set up. One thing to note, however, is that running the migrations takes longer than the create_all()
method, so only do that in cases where you are actually testing the full-text search. One caveat is that in the tearDown
method, I had no success trying to drop the materialized view only. Instead, I had to drop the whole and recreate the whole schema to make sure the database was clean between tests. If you are using flask-migrate, you can do the following:
class TestMySearch(BaseTestCase):
def setUp(self):
# in this case, we don't want to run the super()
# because that will incorrectly do a create all!
from flask_migrate import upgrade
upgrade()
def tearDown(self):
db.session.execute('''DROP SCHEMA IF EXISTS public cascade;''')
db.session.execute('''CREATE SCHEMA public;''')
db.session.commit()
db.session.remove()
db.drop_all()
# make sure we dispose of our session completely
db.get_engine(self.app).dispose()
As noted in the piece about a similar implementation in Rails, the approach is pretty fragile if you decide you need to add a new table later. You will have to do some pretty serious mucking around in the migrations should that become necessary.
However, this implementation has worked out pretty well for this project, but I’m not sure if it’s the best way to go about it. If you have comments or suggestions, let me know on twitter.