We'll be following this notebook throughout the article.
Netflix’s recommendation algorithm does a pretty good job of suggesting relevant content - given the sheer volume of options (~16k movies and TV programs in 2023) and how quickly it has to propose shows to users. How does Netflix do it? In a word, semantic search.
Semantic search comprehends the meaning and context (both attributes and consumption patterns) behind user queries and movie / TV show descriptions, and can therefore provide better personalization in its queries and recommendations than traditional keyword-based approaches. But semantic search poses certain challenges - foremost among them: 1) ensuring accurate search results, 2) interpretability, and 3) scalability - challenges any successful content recommendation strategy will have to address. Using Superlinked’s library, you can overcome these difficulties.
In this article, we’ll show you how to use the Superlinked library to set up your own semantic search and generate a list of relevant movies based on your preferences.
Semantic search conveys a lot of value in vector search but poses three significant vector embedding challenges for developers:
The Superlinked library enables you to address these challenges. Below, we’ll build a content recommender (specifically for movies), starting with information we have about a given movie, embed this information as a multimodal vector, build out a searchable vector index for all our movies, and then use query weights to tweak our results and arrive at good movie recommendations. Let’s get into it.
Below, you’ll perform semantic search on the Netflix movie dataset using the following elements of the Superlinked library:
Successfully recommending movies is difficult mostly because there are so many options (>9000 titles in 2023), and users want recommendations on demand, immediately. Let's take a data-driven approach to find something we want to watch. In our dataset of movies, we know the:
We can embed these inputs, and put together a vector index on top of our embeddings, creating a space we can search semantically.
Once we have our indexed vector space, we will:
Your first step is to install the library and import the requisite classes.
(Note: Below, change alt.renderers.enable(“mimetype”)
to alt.renderers.enable('colab')
if you’re running this in google colab. Keep “mimetype” if you’re executing in github.)
%pip install superlinked==5.3.0 from datetime import timedelta, datetime import altair as alt import os import pandas as pd from superlinked.evaluation.charts.recency_plotter import RecencyPlotter from superlinked.framework.common.dag.context import CONTEXT_COMMON, CONTEXT_COMMON_NOW from superlinked.framework.common.dag.period_time import PeriodTime from superlinked.framework.common.schema.schema import schema from superlinked.framework.common.schema.schema_object import String, Timestamp from superlinked.framework.common.schema.id_schema_object import IdField from superlinked.framework.common.parser.dataframe_parser import DataFrameParser from superlinked.framework.dsl.executor.in_memory.in_memory_executor import ( InMemoryExecutor, InMemoryApp, ) from superlinked.framework.dsl.index.index import Index from superlinked.framework.dsl.query.param import Param from superlinked.framework.dsl.query.query import Query from superlinked.framework.dsl.query.result import Result from superlinked.framework.dsl.source.in_memory_source import InMemorySource from superlinked.framework.dsl.space.text_similarity_space import TextSimilaritySpace from superlinked.framework.dsl.space.recency_space import RecencySpace alt.renderers.enable("mimetype") # NOTE: to render altair plots in colab, change 'mimetype' to 'colab' alt.data_transformers.disable_max_rows() pd.set_option("display.max_colwidth", 190)
We also need to prep the dataset - define time constants, set the url location of the data, create a data store dictionary, read the csv into a pandas DataFrame, clean the dataframe and data so it can be searched properly, and do a quick verification and overview. (See cells 3 and 4 for details.)
Now that the dataset is prepared, you can optimize your retrieval using the Superlinked library.
Superlinked’s library contains a set of core building blocks that we use to construct an index and manage retrieval. You can read about these building blocks in more detail here.
First, you need to define your Schema to tell the system about your data.
# accommodate our inputs in a typed schema @schema class MovieSchema: description: String title: String release_timestamp: Timestamp genres: String id: IdField movie = MovieSchema()
Next, you use Spaces to say how you want to treat each part of the data when embedding. Which Spaces are used depends on your datatype. Each Space is optimized to embed the data so as to return the highest possible quality of retrieval results.
In Space definitions, we describe how the inputs should be embedded in order to reflect the semantic relationships in our data.
# textual fields are embedded using a sentence-transformers model description_space = TextSimilaritySpace( text=movie.description, model="sentence-transformers/paraphrase-MiniLM-L3-v2" ) title_space = TextSimilaritySpace( text=movie.title, model="sentence-transformers/paraphrase-MiniLM-L3-v2" ) genre_space = TextSimilaritySpace( text=movie.genres, model="sentence-transformers/paraphrase-MiniLM-L3-v2" ) # release date are encoded using our recency space # periodtimes aim to reflect notable breaks in our scores recency_space = RecencySpace( timestamp=movie.release_timestamp, period_time_list=[ PeriodTime(timedelta(days=4 * YEAR_IN_DAYS)), PeriodTime(timedelta(days=10 * YEAR_IN_DAYS)), PeriodTime(timedelta(days=40 * YEAR_IN_DAYS)), ], negative_filter=-0.25, ) movie_index = Index(spaces=[description_space, title_space, genre_space, recency_space])
Once you’ve set up your spaces and created your index, you use the source and executor parts of the library to set up your queries. See cells 10-13 in the notebook.
Now that the queries are prepared, let’s move on to running queries and optimizing retrieval by adjusting weights.
The recency space lets you alter the results of your query by preferentially pulling in older or newer releases from your dataset. We use 4, 10, and 40 years as our period times so that we can give years with more titles more focus - see cell 5).
Notice the breaks in the score at 4, 10, and 40 years. Titles older than 40 years get a negative_filter
score.
Let's define a quick util function to present our results in the notebook.
def present_result( result: Result, cols_to_keep: list[str] = ["description", "title", "genres", "release_year", "id"], ) -> pd.DataFrame: # parse result to dataframe df: pd.DataFrame = result.to_pandas() # transform timestamp back to release year df["release_year"] = [ datetime.fromtimestamp(timestamp).year for timestamp in df["release_timestamp"] ] return df[cols_to_keep]
The Superlinked library lets you perform various kinds of queries; here we define two. Both of our query types of query (simple and advanced) let me weight individual spaces (description, title, genre, and of course recency) according to my preferences. The difference between them is that with simple query, I set one query text and then surface similar results in the description, title, and genre spaces. With advanced query, I have more fine-grained control. If I want, I can enter different query texts in each of the description, title, and genre spaces. Here's the query code:
query_text_param = Param("query_text") simple_query = ( Query( movie_index, weights={ description_space: Param("description_weight"), title_space: Param("title_weight"), genre_space: Param("genre_weight"), recency_space: Param("recency_weight"), }, ) .find(movie) .similar(description_space.text, query_text_param) .similar(title_space.text, query_text_param) .similar(genre_space.text, query_text_param) .limit(Param("limit")) ) advanced_query = ( Query( movie_index, weights={ description_space: Param("description_weight"), title_space: Param("title_weight"), genre_space: Param("genre_weight"), recency_space: Param("recency_weight"), }, ) .find(movie) .similar(description_space.text, Param("description_query_text")) .similar(title_space.text, Param("title_query_text")) .similar(genre_space.text, Param("genre_query_text")) .limit(Param("limit")) )
In simple queries, I set my query text and apply different weights depending on their importance to me.
result: Result = app.query( simple_query, query_text="Heartfelt romantic comedy", description_weight=1, title_weight=1, genre_weight=1, recency_weight=0, limit=TOP_N, ) present_result(result)
Our results contain some titles I’ve already seen. I can deal with this by upweighting recency to bias my results towards recent titles. Weights are normalized to have unit sum (i.e., all weights are adjusted so they always sum up to a total of 1), so you don't have to worry about how you set them.
result: Result = app.query( simple_query, query_text="Heartfelt romantic comedy", description_weight=1, title_weight=1, genre_weight=1, recency_weight=3, limit=TOP_N, ) present_result(result)
My results (above) are now all post-2021.
Using the simple query, I can weight any specific space (description, title, genre, or recency) to make it count more when returning results. Let’s experiment with this. Below, we’ll give more weight to genre, and downweight title - my query text is basically just a genre with some additional context. I keep my recency as is because I’d still like my results to be biased towards recent movies.
result = app.query( simple_query, query_text="Heartfelt romantic comedy", description_weight=1, title_weight=0.1, genre_weight=2, recency_weight=1, limit=TOP_N, ) present_result(result)
This query pushes the release year back a little to give me more genre-weighted results (below).
The advanced query gives me even more fine-grained control. I retain control over recency, but can also specify search text for description, title, and genre, and assign each a specific weight according to my preferences, per below (and cells 19-21),
result = app.query( advanced_query, description_query_text="Heartfelt lovely romantic comedy for a cold autumn evening.", title_query_text="love", genre_query_text="drama comedy romantic", description_weight=0.2, title_weight=3, genre_weight=1, recency_weight=5, limit=TOP_N, ) present_result(result)
Say in my last movie results I found a movie I’ve already seen and would like to see something similar. Let’s assume I like White Christmas, a 1954 romantic comedy (id = tm16479) about singer-dancers coming together for a stage show to draw guests to a struggling Vermont inn. By adding an extra with_vector
clause (with a movie_id
parameter) to advanced_query, with_movie_query lets me search using this movie (or any movie I like), and gives me all the fine-grained control of separate subsearch query text and weighting.
First, we add our movie_id parameter:
with_movie_query = advanced_query.with_vector(movie, Param("movie_id"))
And then I can set my other subsearch queries either to empty or whatever’s most relevant, along with any weightings that make sense. Let’s say my first query returns results that reflect the stage performance / band aspect of White Christmas (see cell 24), but I want to watch a movie that’s more family-oriented. I can enter a description_query_text to skew my results in the desired direction.
result = app.query( with_movie_query, description_query_text="family", title_query_text="", genre_query_text="", description_weight=1, title_weight=0, genre_weight=0, recency_weight=0, description_query_weight=1, movie_id="tm16479", limit=TOP_N, ) present_result(result)
But now that I see my results, I realize I’m actually more in the mood for something light-hearted and funny. Let’s adjust my query accordingly:
Result = app.query( with_movie_query, description_query_text="", title_query_text="", genre_query_text="comedy", description_weight=1, title_weight=0, genre_weight=2, recency_weight=0, description_query_weight=1, movie_id="tm16479", limit=TOP_N, ) present_result(result)
Okay, those results are better. I’ll pick one of these. Put the popcorn on!
Superlinked makes it easy to test, iterate, and improve your retrieval quality. Above, we’ve walked you through how to use the Superlinked library to do semantic search on a vector space, the way Netflix does, and return accurate, relevant movie results. We’ve also seen how to fine-tune our results, tweaking weights and search terms until we get to just the right outcome.
Now, try out the notebook yourself, and see what you can achieve.
Stay updated with VectorHub
Continue Reading