Improve your vector search

Most Recent

Table of Contents

Semantic search - challenges

Creating a fast and reliable search experience with Superlinked

Conclusion

Was this helpful?

Publication Date: June 3, 2024|

Finding better movies using semantic search, built with Superlinked

Takeaways

Notebook Code

Superlinked's TextSimilarity space with paraphrase-MiniLM-L3-v2 model effectively encodes movie metadata.

Query weights can balance between description, title, genre and recency preferences.

Advanced queries allow different search terms for each metadata field.

Movie recommendations improve by using both vector similarity and weighted attributes.

Recency scores help surface newer content when desired.

The system handles multi-modal data including text descriptions and release dates.

We'll be following this notebook throughout the article.

Netflix’s recommendation algorithm does a pretty good job of suggesting relevant content - given the sheer volume of options (~16k movies and TV programs in 2023) and how quickly it has to propose shows to users. How does Netflix do it? In a word, semantic search.

Semantic search comprehends the meaning and context (both attributes and consumption patterns) behind user queries and movie / TV show descriptions, and can therefore provide better personalization in its queries and recommendations than traditional keyword-based approaches. But semantic search poses certain challenges - foremost among them: 1) ensuring accurate search results, 2) interpretability, and 3) scalability - challenges any successful content recommendation strategy will have to address. Using Superlinked’s library, you can overcome these difficulties.

In this article, we’ll show you how to use the Superlinked library to set up your own semantic search and generate a list of relevant movies based on your preferences.

Semantic search - challenges

Semantic search conveys a lot of value in vector search but poses three significant vector embedding challenges for developers:

Quality and relevance: Ensuring that your embeddings accurately capture the semantic meaning of your data requires careful selection of embedding techniques, training data, and hyperparameters. Poor-quality embeddings can lead to inaccurate search results and irrelevant recommendations.
Interpretability: High-dimensional vector spaces are too complicated to be easily understood. To gain insights into the relationships and similarities encoded within them, data scientists have to develop methods to visualize and analyze them.
Scalability: Managing and processing high-dimensional embeddings, especially in large datasets, can strain computational resources and increase latency. Efficient methods for indexing, retrieval, and similarity computation are essential to ensure scalability and real-time performance in production environments.

The Superlinked library enables you to address these challenges. Below, we’ll build a content recommender (specifically for movies), starting with information we have about a given movie, embed this information as a multimodal vector, build out a searchable vector index for all our movies, and then use query weights to tweak our results and arrive at good movie recommendations. Let’s get into it.

Creating a fast and reliable search experience with Superlinked

Below, you’ll perform semantic search on the Netflix movie dataset using the following elements of the Superlinked library:

Recency space - to understand the freshness (currency and relevancy) of your data, identifying newer movies
TextSimilarity space - to interpret the various pieces of metadata you have about the movie, such as description, title, and genre
Query time weights - letting you choose what’s most important in your data when you run the query, thereby optimizing without needing to re-embed the whole dataset, do postprocessing, or employ a custom reranking model (i.e., reducing latency)

The Netflix dataset, and what we’ll do with it

Successfully recommending movies is difficult mostly because there are so many options (>9000 titles in 2023), and users want recommendations on demand, immediately. Let's take a data-driven approach to find something we want to watch. In our dataset of movies, we know the:

description
genre
title
release_year

We can embed these inputs, and put together a vector index on top of our embeddings, creating a space we can search semantically.

Once we have our indexed vector space, we will:

first, browse the movies, filtered by an idea (heartfelt romantic comedy)
next, tweak the results, giving more importance to matches in certain input fields (i.e., weighting)
then, search in description, genre, and title with different search terms for each
and, after finding a movie that’s a close but not exact match, also search around using that movie as a reference

Installation and dataset preparation

Your first step is to install the library and import the requisite classes.

(Note: Below, change alt.renderers.enable(“mimetype”) to alt.renderers.enable('colab') if you’re running this in google colab. Keep “mimetype” if you’re executing in github.)

%pip install superlinked==5.3.0

from datetime import timedelta, datetime

import altair as alt
import os
import pandas as pd

from superlinked.evaluation.charts.recency_plotter import RecencyPlotter
from superlinked.framework.common.dag.context import CONTEXT_COMMON, CONTEXT_COMMON_NOW
from superlinked.framework.common.dag.period_time import PeriodTime
from superlinked.framework.common.schema.schema import schema
from superlinked.framework.common.schema.schema_object import String, Timestamp
from superlinked.framework.common.schema.id_schema_object import IdField
from superlinked.framework.common.parser.dataframe_parser import DataFrameParser
from superlinked.framework.dsl.executor.in_memory.in_memory_executor import (
   InMemoryExecutor,
   InMemoryApp,
)
from superlinked.framework.dsl.index.index import Index
from superlinked.framework.dsl.query.param import Param
from superlinked.framework.dsl.query.query import Query
from superlinked.framework.dsl.query.result import Result
from superlinked.framework.dsl.source.in_memory_source import InMemorySource
from superlinked.framework.dsl.space.text_similarity_space import TextSimilaritySpace
from superlinked.framework.dsl.space.recency_space import RecencySpace

alt.renderers.enable("mimetype") # NOTE: to render altair plots in colab, change 'mimetype' to 'colab'
alt.data_transformers.disable_max_rows()
pd.set_option("display.max_colwidth", 190)

We also need to prep the dataset - define time constants, set the url location of the data, create a data store dictionary, read the csv into a pandas DataFrame, clean the dataframe and data so it can be searched properly, and do a quick verification and overview. (See cells 3 and 4 for details.)

Now that the dataset is prepared, you can optimize your retrieval using the Superlinked library.

Building out the index for vector search

Superlinked’s library contains a set of core building blocks that we use to construct an index and manage retrieval. You can read about these building blocks in more detail here.

First, you need to define your Schema to tell the system about your data.

# accommodate our inputs in a typed schema
@schema
class MovieSchema:
   description: String
   title: String
   release_timestamp: Timestamp
   genres: String
   id: IdField

movie = MovieSchema()

Next, you use Spaces to say how you want to treat each part of the data when embedding. Which Spaces are used depends on your datatype. Each Space is optimized to embed the data so as to return the highest possible quality of retrieval results.

In Space definitions, we describe how the inputs should be embedded in order to reflect the semantic relationships in our data.

# textual fields are embedded using a sentence-transformers model
description_space = TextSimilaritySpace(
   text=movie.description, model="sentence-transformers/paraphrase-MiniLM-L3-v2"
)
title_space = TextSimilaritySpace(
   text=movie.title, model="sentence-transformers/paraphrase-MiniLM-L3-v2"
)
genre_space = TextSimilaritySpace(
   text=movie.genres, model="sentence-transformers/paraphrase-MiniLM-L3-v2"
)
# release date are encoded using our recency space
# periodtimes aim to reflect notable breaks in our scores
recency_space = RecencySpace(
   timestamp=movie.release_timestamp,
   period_time_list=[
       PeriodTime(timedelta(days=4 * YEAR_IN_DAYS)),
       PeriodTime(timedelta(days=10 * YEAR_IN_DAYS)),
       PeriodTime(timedelta(days=40 * YEAR_IN_DAYS)),
   ],
   negative_filter=-0.25,
)
movie_index = Index(spaces=[description_space, title_space, genre_space, recency_space])

Once you’ve set up your spaces and created your index, you use the source and executor parts of the library to set up your queries. See cells 10-13 in the notebook.

Now that the queries are prepared, let’s move on to running queries and optimizing retrieval by adjusting weights.

Understanding recency, and how to use it in Superlinked

The recency space lets you alter the results of your query by preferentially pulling in older or newer releases from your dataset. We use 4, 10, and 40 years as our period times so that we can give years with more titles more focus - see cell 5).

Notice the breaks in the score at 4, 10, and 40 years. Titles older than 40 years get a negative_filter score.

Recency scores by period

Reviewing and optimizing search results using different query time weights

Let's define a quick util function to present our results in the notebook.

def present_result(
   result: Result,
   cols_to_keep: list[str] = ["description", "title", "genres", "release_year", "id"],
) -> pd.DataFrame:
   # parse result to dataframe
   df: pd.DataFrame = result.to_pandas()
   # transform timestamp back to release year
   df["release_year"] = [
       datetime.fromtimestamp(timestamp).year for timestamp in df["release_timestamp"]
   ]
   return df[cols_to_keep]

Simple and advanced queries

The Superlinked library lets you perform various kinds of queries; here we define two. Both of our query types of query (simple and advanced) let me weight individual spaces (description, title, genre, and of course recency) according to my preferences. The difference between them is that with simple query, I set one query text and then surface similar results in the description, title, and genre spaces. With advanced query, I have more fine-grained control. If I want, I can enter different query texts in each of the description, title, and genre spaces. Here's the query code:

query_text_param = Param("query_text")

simple_query = (
    Query(
        movie_index,
        weights={
            description_space: Param("description_weight"),
            title_space: Param("title_weight"),
            genre_space: Param("genre_weight"),
            recency_space: Param("recency_weight"),
        },
    )
    .find(movie)
    .similar(description_space.text, query_text_param)
    .similar(title_space.text, query_text_param)
    .similar(genre_space.text, query_text_param)
    .limit(Param("limit"))
)

advanced_query = (
    Query(
        movie_index,
        weights={
            description_space: Param("description_weight"),
            title_space: Param("title_weight"),
            genre_space: Param("genre_weight"),
            recency_space: Param("recency_weight"),
        },
    )
    .find(movie)
    .similar(description_space.text, Param("description_query_text"))
    .similar(title_space.text, Param("title_query_text"))
    .similar(genre_space.text, Param("genre_query_text"))
    .limit(Param("limit"))
)

Simple query

In simple queries, I set my query text and apply different weights depending on their importance to me.

result: Result = app.query(
   simple_query,
   query_text="Heartfelt romantic comedy",
   description_weight=1,
   title_weight=1,
   genre_weight=1,
   recency_weight=0,
   limit=TOP_N,
)

present_result(result)

Simple Query results 1

Our results contain some titles I’ve already seen. I can deal with this by upweighting recency to bias my results towards recent titles. Weights are normalized to have unit sum (i.e., all weights are adjusted so they always sum up to a total of 1), so you don't have to worry about how you set them.

result: Result = app.query(
   simple_query,
   query_text="Heartfelt romantic comedy",
   description_weight=1,
   title_weight=1,
   genre_weight=1,
   recency_weight=3,
   limit=TOP_N,
)

present_result(result)

Simple Query results 1

My results (above) are now all post-2021.

Using the simple query, I can weight any specific space (description, title, genre, or recency) to make it count more when returning results. Let’s experiment with this. Below, we’ll give more weight to genre, and downweight title - my query text is basically just a genre with some additional context. I keep my recency as is because I’d still like my results to be biased towards recent movies.

result = app.query(
   simple_query,
   query_text="Heartfelt romantic comedy",
   description_weight=1,
   title_weight=0.1,
   genre_weight=2,
   recency_weight=1,
   limit=TOP_N,
)

present_result(result)

This query pushes the release year back a little to give me more genre-weighted results (below).

Simple Query results 3

Advanced query

The advanced query gives me even more fine-grained control. I retain control over recency, but can also specify search text for description, title, and genre, and assign each a specific weight according to my preferences, per below (and cells 19-21),

result = app.query(
    advanced_query,
    description_query_text="Heartfelt lovely romantic comedy for a cold autumn evening.",
    title_query_text="love",
    genre_query_text="drama comedy romantic",
    description_weight=0.2,
    title_weight=3,
    genre_weight=1,
    recency_weight=5,
    limit=TOP_N,
)

present_result(result)

Search using a specific movie

Say in my last movie results I found a movie I’ve already seen and would like to see something similar. Let’s assume I like White Christmas, a 1954 romantic comedy (id = tm16479) about singer-dancers coming together for a stage show to draw guests to a struggling Vermont inn. By adding an extra with_vector clause (with a movie_id parameter) to advanced_query, with_movie_query lets me search using this movie (or any movie I like), and gives me all the fine-grained control of separate subsearch query text and weighting.

First, we add our movie_id parameter:

with_movie_query = advanced_query.with_vector(movie, Param("movie_id"))

And then I can set my other subsearch queries either to empty or whatever’s most relevant, along with any weightings that make sense. Let’s say my first query returns results that reflect the stage performance / band aspect of White Christmas (see cell 24), but I want to watch a movie that’s more family-oriented. I can enter a description_query_text to skew my results in the desired direction.

result = app.query(
    with_movie_query,
    description_query_text="family",
    title_query_text="",
    genre_query_text="",
    description_weight=1,
    title_weight=0,
    genre_weight=0,
    recency_weight=0,
    description_query_weight=1,
    movie_id="tm16479",
    limit=TOP_N,
)

present_result(result)

Advanced Query results 1

But now that I see my results, I realize I’m actually more in the mood for something light-hearted and funny. Let’s adjust my query accordingly:

Result = app.query(
    with_movie_query,
    description_query_text="",
    title_query_text="",
    genre_query_text="comedy",
    description_weight=1,
    title_weight=0,
    genre_weight=2,
    recency_weight=0,
    description_query_weight=1,
    movie_id="tm16479",
    limit=TOP_N,
)

present_result(result)

Advanced Query results 2

Okay, those results are better. I’ll pick one of these. Put the popcorn on!

Conclusion

Superlinked makes it easy to test, iterate, and improve your retrieval quality. Above, we’ve walked you through how to use the Superlinked library to do semantic search on a vector space, the way Netflix does, and return accurate, relevant movie results. We’ve also seen how to fine-tune our results, tweaking weights and search terms until we get to just the right outcome.

Now, try out the notebook yourself, and see what you can achieve.

Stay updated with VectorHub