Developing for Similarity¶
Setting up the environment¶
NOTE: If there is no submitted data in the lowlevel table, similarity cannot be initialized.
Once your database and server are built, there are a few steps required to initialize the similarity engine. These steps can be run completed altogether using the following cli command:
./develop.sh manage similarity init
Environment subcommands¶
The init
similarity command runs each of these sub-commands in order. They can be run
individually to perform a specific step.
Compute required statistics over the lowlevel table:
./develop.sh manage similarity compute-stats
Extract data values for all recordings in the lowlevel table:
./develop.sh manage similarity add-metrics
This command can be run again at any time to extract values for recordings that have been added since the last time that this command was run.
Finally, build the annoy indices:
./develop.sh manage similarity add-indices
This command will rebuild an index from scratch, reading from the metrics that were extracted from
the add-metrics
command. This should be run each time after add-metrics
to rebuild the
similarity indexes.
It is possible to alter the index parameters using additional arguments.
Similarity implementation details¶
Metrics list¶
We store a list of metrics in the similarity.similarity_metrics database table.
A definition of these each of these metrics is stored in the similarity.metrics
python module.
These definitions describe which database field to load the metric from, and any preprocessing
that has to be performed before adding it to the index.
Statistics¶
Some features (Weighted MFCCs, Weighted GFCCs) require a mean and standard deviation value to
perform normalisation.
We can’t compute the mean of all items in the database as this would take too long, so
instead we take a random sample (by default similarity.manage.NORMALIZATION_SAMPLE_SIZE
, 10000) of items
and compute the mean and SD on those items. In our experiments this gave a good tradeoff of accuracy vs
computation speed.
Speed of database access and indexing¶
Reading all items directly from the database (lowlevel_json table) is too slow to do every time
the annoy index needs to be updated. Because of this, we split the import process into two steps.
The add-metrics
subcommand extracts data from the lowlevel table into a new summary table
(similarity.similarity) that includes only the specific numerical feature necessary for the similarity index.
The add-indices
subcommand reads from the similarity.similarity table and builds an Annoy index. We build
this index from scratch each time the command is run due to technical requirements in Annoy, and the fact that
this operation doesn’t take too long to complete.
Adding a new feature¶
(todo: fill in with more detail)
- Add new metric to
admin/sql/populate_metrics_table.sql
- Add new metric class to
similarity.metrics
- Add this class to
similarity.metrics.BASE_METRICS
- Extract new data if necessary in
db.similarity.get_batch_data
- alter similarity.similarity table to include new column for this feature
- todo: should we be able to fill in just this column, or do we recreate similarity.similarity?
Index parameters¶
Indices take a number of parameters when being built. We are currently evaluating them ourselves and have developed a set of base indices, but you may want to experiment with different parameters.
Queries to an index become more precise when the index is built with a higher number of trees, but they will also take longer to build. Indices can also vary in the method used to calculate distance between recordings. Limitations of parameter selection can be found in the Annoy documentation and our codebase