VDB Comparison
Generate custom embeddings with ALL your structured and unstructured data.Try Superlinked

Multi-attribute search with vector embeddings

#Vector Search
#Superlinked
Publication Date: September 1, 2024

Vector search represents a revolution in information retrieval. Vector embedding - by taking account of context and semantic meaning - empowers vector search to return more relevant and accurate results, handle not just structured but also unstructured data and multiple languages, and scale. But to generate high quality responses in real-world applications, we often need to assign different weights to specific attributes of our data objects.

There are two common approaches to multi-attribute vector search. Both start by separately embedding each attribute of a data object. The main difference between these two approaches is in how our embeddings are stored and searched.

  1. the naive approach - store each attribute vector in separate vector stores (one per attribute), perform a separate search for each attribute, combine search results, and post-process (e.g., weight) as required.
  2. the Superlinked approach - concatenate and store all attribute vectors in the same vector store (using Superlinked's built-in funtionality), which allows us to search just once, with attendant efficiency gains. Superlinked's spaces also let us weight each attribute at query time to surface more relevant results, with no post-processing.

Two approaches to multi-attribute vector search

Below, we'll use these two approaches to implement a multi-attribute vector search tool - a Dungeons and Dragons monster finder! Our simple implementations, especially the second, will illustrate how to create more powerful and flexible search systems, ones that can handle complex, multi-faceted queries with ease, whatever your use case.

If you're new to vector similarity search, don't worry! We've got you covered - check out our building blocks articles.

Okay, let's go monster hunting!

A Dungeons & Dragons use case

It's game night, your friends are perched around the games table, waiting to see what Dungeons & Dragons (D&D) character they'll become and quest they'll embark on. Tonight, you're Dungeon Master (storyteller and guide), crafter of thrilling encounters to challenge and enthrall your players. Your trusty D&D Monster Manual contains thousands of creatures. Finding the perfect monster for each situation among the myriad options can be overwhelming. The ideal foe needs to match the setting, difficulty, and narrative of the moment.

What if we could create a tool that instantly finds the monster most suited to each scenario? A tool that considers multiple factors simultaneously, ensuring each encounter is as immersive and exciting as possible?

Let's embark on a quest of our own: build the ultimate monster-finding system, using the power of multi-attribute vector search!

Dataset

First, we'll generate a small synthetic dataset of monsters, by prompting a Large Language Model (LLM):

Generate two JSON lists: 'monsters' and 'queries'. 1. 'monsters' list: Create 20 unique monsters with the following properties: - name: A distinctive name - look: Brief description of appearance (2-3 sentences) - habitat: Where the monster lives (2-3 sentences) - behavior: How the monster acts (2-3 sentences) Ensure some monsters share similar features while remaining distinct. 2. 'queries' list: Create 5 queries to search for monsters: - Each query should be in the format: {look: "...", habitat: "...", behavior: "..."} - Use simple, brief descriptions (1-3 words per field) - Make queries somewhat general to match multiple monsters Output format: { "monsters": [ {"name": "...", "look": "...", "habitat": "...", "behavior": "..."}, ... ], "queries": [ {"look": "...", "habitat": "...", "behavior": "..."}, ... ] }

Let's take a look at a sample of the dataset our LLM generated. Note: LLM generation is non-deterministic, so your results may differ.

Here are our first five monsters:

namelook                          habitat                  behavior                     
0LuminothMoth-like creature with glowing wings and antennaDense forests and jungles with bioluminescent floraEmits soothing light patterns to communicate and attract prey
1Aqua WraithTranslucent humanoid figure made of flowing waterRivers, lakes, and coastal areasShapeshifts to blend with water bodies and controls currents
2Stoneheart GolemMassive humanoid composed of interlocking rock formationsRocky mountains and ancient ruinsHibernates for centuries, awakens to protect its territory
3Whispering ShadeShadowy, amorphous being with glowing eyesDark forests and abandoned buildingsFeeds on fear and whispers unsettling truths
4Zephyr DancerGraceful avian creature with iridescent feathersHigh mountain peaks and wind-swept plainsCreates mesmerizing aerial displays to attract mates

...and our generated queries:

lookhabitatbehavior
0glowingdark placeslight manipulation
1elementalextreme environmentsenvironmental control
2shapeshiftingvaried landscapesillusion creation
3crystallinemineral-rich areasenergy absorption
4etherealatmosphericmind influence

See original dataset and query examples here.

Retrieval

Let's set up parameters we'll use in both of our approaches - naive and Superlinked - below.

We generate our vector embeddings with sentence-transformers/all-mpnet-base-v2. For simplicity's sake, we'll limit our output to the top 3 matches. (For complete code, including necessary imports and helper functions, see the notebook.)

LIMIT = 3 MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"

Now, let's get our multi-attribute monster search under way! First, we'll try the naive approach.

Naive approach

In our naive approach, we embed attributes independently and store them in different indices. At query time, we run multiple kNN-searches on all the indices, and then combine all our partial results into one.

We start by defining a class NaiveRetriever to perform similarity-based search on our dataset, using our all-mpnet-base-v2-generated embeddings.

class NaiveRetriever: def __init__(self, data: pd.DataFrame): self.model = SentenceTransformer(MODEL_NAME) self.data = data.copy() self.ids = self.data.index.to_list() self.knns = {} for key in self.data: embeddings = self.model.encode(self.data[key].values) knn = NearestNeighbors(metric="cosine").fit(embeddings) self.knns[key] = knn def search_key(self, key: str, value: str, limit: int = LIMIT) -> pd.DataFrame: embedding = self.model.encode(value) knn = self.knns[key] distances, indices = knn.kneighbors( [embedding], n_neighbors=limit, return_distance=True ) ids = [self.ids[i] for i in indices[0]] similarities = (1 - distances).flatten() # by definition: # cosine distance = 1 - cosine similarity result = pd.DataFrame( {"id": ids, f"score_{key}": similarities, key: self.data[key][ids]} ) result.set_index("id", inplace=True) return result def search(self, query: dict, limit: int = LIMIT) -> pd.DataFrame: results = [] for key, value in query.items(): if key not in self.knns: continue result_key = self.search_key(key, value, limit=limit) result_key.drop(columns=[key], inplace=True) results.append(result_key) merged_results = pd.concat(results, axis=1) merged_results["score"] = merged_results.mean(axis=1, skipna=False) merged_results.sort_values("score", ascending=False, inplace=True) return merged_results naive_retriever = NaiveRetriever(df.set_index("name"))

Let's use the first query from our generated list above, and search for monsters using our naive_retriever:

query = { 'look': 'glowing', 'habitat': 'dark places', 'behavior': 'light manipulation' } naive_retriever.search(query)

Our naive_retriever returns the following search results for each attribute:

Look: glowing

idscore_looklook
Whispering Shade0.503578Shadowy, amorphous being with glowing eyes
Sandstorm Djinn0.407344Swirling vortex of sand with glowing symbols
Luminoth0.378619Moth-like creature with glowing wings and antenna

Awesome! Our returned monster results are relevant - they all have some "glowing" characteristic.

Let's see what the naive approach returns when we search the other two attributes.

Habitat: dark places

idscore_habitathabitat
Whispering Shade0.609567Dark forests and abandoned buildings
Fungal Network0.438856Underground caverns and damp forests
Thornvine Elemental0.423421Overgrown ruins and dense jungles

Behavior: light manipulation

idscore_behaviorbehavior
Living Graffiti0.385741Shapeshifts to blend with surroundings and absorbs pigments
Crystalwing Drake0.385211Hoards precious gems and can refract light into powerful beams
Luminoth0.345566Emits soothing light patterns to communicate and attract prey

All the retrieved monsters do possess the wanted attributes. At first glance, the naive search results may seem promising. But we need to find monsters that possess all three attributes simultaneously. Let's merge our results to see how well our monsters do at achieving this goal:

idscore_lookscore_habitatscore_behavior
Whispering Shade0.5035780.609567
Sandstorm Djinn0.407344
Luminoth0.3786190.345566
Fungal Network0.438856
Thornvine Elemental0.423421
Living Graffiti0.385741
Crystalwing Drake0.385211

And here, the limits of the naive approach become obvious. Let's evaluate:

  1. Relevance by attribute:

    • "Look": Three monsters were retrieved (Whispering Shade, Sandstorm Djinn, and Luminoth).
    • "Habitat": Only one monster from the "Look" results was relevant (Whispering Shade).
    • "Behavior": Only one monster from the "Look" results was relevant (Luminoth), but it's different from the one relevant for "Habitat".
  2. Overall relevance:

    • No single monster was retrieved for all three attributes simultaneously.
    • The results are fragmented: different monsters are relevant for different attributes.

In short, the naive search approach fails to find monsters that satisfy all criteria at once. Maybe we can fix this issue by proactively retrieving more monsters for each attribute? Let's try it with 6 monsters per attribute, instead of 3. Let's take a look at what this approach generates:

idscore_lookscore_habitatscore_behavior
Whispering Shade0.5035780.609567
Sandstorm Djinn0.4073440.365061
Luminoth0.3786190.345566
Nebula Jellyfish0.366270.259969
Dreamweaver Octopus0.315679
Quantum Firefly0.288578
Fungal Network0.438856
Thornvine Elemental0.423421
Mist Phantom0.3668160.236649
Stoneheart Golem0.342287
Living Graffiti0.385741
Crystalwing Drake0.385211
Aqua Wraith0.283581

We've now retrieved 13 monsters (more than half of our tiny dataset!), and still have the same issue: not one of these monsters was retrieved for all three attributes.

Increasing the number of retrieved monsters (beyond 6) might solve our problem, but it creates additional issues:

  1. In production, retrieving more results (multiple kNN searches) lengthens search time noticeably.
  2. For each new attribute we introduce, our chances of finding a "complete" monster - with all the attributes in our query - drops exponentially. To prevent this, we have to retrieve many more nearest neighbors (monsters), making the total number of retrieved monsters grow exponentially.
  3. We still have no guarantee we'll retrieve monsters that possess all our desired attributes.
  4. If we do manage to retrieve monsters that satisfy all criteria at once, we'll have to expend additional overhead reconciling results.

In sum, the naive approach is too uncertain and inefficient for viable multi-attribute search, especially in production.

Superlinked approach

Let's implement our second approach to see if it does any better than the naive one.

First, we define the schema, spaces, index, and query:

@schema class Monster: id: IdField look: String habitat: String behavior: String monster = Monster() look_space = TextSimilaritySpace(text=monster.look, model=MODEL_NAME) habitat_space = TextSimilaritySpace(text=monster.habitat, model=MODEL_NAME) behavior_space = TextSimilaritySpace(text=monster.behavior, model=MODEL_NAME) monster_index = Index([look_space, habitat_space, behavior_space]) monster_query = ( Query( monster_index, weights={ look_space: Param("look_weight"), habitat_space: Param("habitat_weight"), behavior_space: Param("behavior_weight"), }, ) .find(monster) .similar(look_space.text, Param("look")) .similar(habitat_space.text, Param("habitat")) .similar(behavior_space.text, Param("behavior")) .limit(LIMIT) ) default_weights = { "look_weight": 1.0, "habitat_weight": 1.0, "behavior_weight": 1.0 }

Now, we start the executor and upload the data:

monster_parser = DataFrameParser(monster, mapping={monster.id: "name"}) source: InMemorySource = InMemorySource(monster, parser=monster_parser) executor = InMemoryExecutor(sources=[source], indices=[monster_index]) app = executor.run() source.put([df])

Let's run the same query we ran in our naive approach implementation above:

query = { 'look': 'glowing', 'habitat': 'dark places', 'behavior': 'light manipulation' } app.query( monster_query, limit=LIMIT, **query, **default_weights )
idscorelookhabitatbehavior
Whispering Shade0.376738Shadowy, amorphous being with glowing eyesDark forests and abandoned buildingsFeeds on fear and whispers unsettling truths
Luminoth0.340084Moth-like creature with glowing wings and antennaDense forests and jungles with bioluminescent floraEmits soothing light patterns to communicate and attract prey
Living Graffiti0.330587Two-dimensional, colorful creature that inhabits flat surfacesUrban areas, particularly walls and billboardsShapeshifts to blend with surroundings and absorbs pigments

Et voila! This time, each of our top returned monsters' ranks highly in a score that represents a kind of "mean" of all three of the characteristics we want our monster to have. Our second and third results possess all three characteristics, and our top result (Whispering Shade), though it's behavior is less related to light manipulation, is very relevant in look (glowing) and habitat (dark places) characteristics, giving it the highest score overall. What an improvement!

Can we can replicate our results? Let's try another query and find out.

query = { 'look': 'shapeshifting', 'habitat': 'varied landscapes', 'behavior': 'illusion creation' }
idscorelookhabitatbehavior
Mist Phantom0.489574Ethereal, fog-like humanoid with shifting featuresSwamps, moors, and foggy coastlinesLures travelers astray with illusions and whispers
Zephyr Dancer0.342075Graceful avian creature with iridescent feathersHigh mountain peaks and wind-swept plainsCreates mesmerizing aerial displays to attract mates
Whispering Shade0.337434Shadowy, amorphous being with glowing eyesDark forests and abandoned buildingsFeeds on fear and whispers unsettling truths

Great! Our outcomes are excellent again.

What if we want to find monsters that are similar to a specific monster from our dataset? Let's try it with a monster we haven't seen yet - "Harmonic Coral". We could extract attributes for this monster and create query parameters manually. But Superlinked has a with_vector method we can use on the query object. Because each monster's id is its name, we can express our request as simply as:

app.query( monster_query.with_vector(monster, "Harmonic Coral"), **default_weights, limit=LIMIT )
idscorelookhabitatbehavior
Harmonic Coral1Branching, musical instrument-like structure with vibrating tendrilsShallow seas and tidal poolsCreates complex melodies to communicate and influence emotions
Dreamweaver Octopus0.402288Cephalopod with tentacles that shimmer like aurorasDeep ocean trenches and underwater cavesInfluences the dreams of nearby creatures
Aqua Wraith0.330869Translucent humanoid figure made of flowing waterRivers, lakes, and coastal areasShapeshifts to blend with water bodies and controls currents

The top result is the most relevant one, Harmonic Coral itself, as expected. The other two monsters our search retrieves are "Dreamweaver Octopus" and "Aqua Wraith". Both share important thematic (attribute) elements with Harmonic Coral:

  1. Aquatic habitats (habitat)
  2. Ability to influence or manipulate their environment (behavior)
  3. Dynamic or fluid visual characteristics (look)

Attribute weighting

Suppose, now, that we want to give more importance to the "look" attribute. The Superlinked framework lets us easily adjust weights at query time. For easy comparison, we'll search for monsters similar to Harmonic Coral, but with our weights adjusted to favor "look".

weights = { "look_weight": 1.0, "habitat_weight": 0, "behavior_weight": 0 } app.query( monster_query.with_vector(monster, "Harmonic Coral"), limit=LIMIT, **weights )
idscorelookhabitatbehavior
Harmonic Coral0.57735Branching, musical instrument-like structure with vibrating tendrilsShallow seas and tidal poolsCreates complex melodies to communicate and influence emotions
Thornvine Elemental0.252593Plant-like creature with a body of twisted vines and thornsOvergrown ruins and dense junglesRapidly grows and controls surrounding plant life
Plasma Serpent0.243241Snake-like creature made of crackling energyElectrical storms and power plantsFeeds on electrical currents and can short-circuit technology

Our results all (appropriately) have similar appearances - "Branching with vibrating tendrils", "Plant-like creature with a body of twisted vines and thorns", "Snake-like".

Now, let's do another search, ignoring appearance, and looking instead for monsters that are similar in terms of "habitat" and "behavior" simultaneously:

weights = { "look_weight": 0, "habitat_weight": 1.0, "behavior_weight": 1.0 }
idscorelookhabitatbehavior
Harmonic Coral0.816497Branching, musical instrument-like structure with vibrating tendrilsShallow seas and tidal poolsCreates complex melodies to communicate and influence emotions
Dreamweaver Octopus0.357656Cephalopod with tentacles that shimmer like aurorasDeep ocean trenches and underwater cavesInfluences the dreams of nearby creatures
Mist Phantom0.288106Ethereal, fog-like humanoid with shifting featuresSwamps, moors, and foggy coastlinesLures travelers astray with illusions and whispers

Again, the Superlinked approach produces great results. All three monsters live in watery habitats and possess mind-controlling abilities.

Finally, let's try another search, weighting all three attributes differently - to find monsters that in comparison to Harmonic Coral look somewhat similar, live in very different habitats, and possess very similar behavior:

weights = { "look_weight": 0.5, "habitat_weight": -1.0, "behavior_weight": 1.0 }
idscorelookhabitatbehavior
Harmonic Coral0.19245Branching, musical instrument-like structure with vibrating tendrilsShallow seas and tidal poolsCreates complex melodies to communicate and influence emotions
Luminoth0.149196Moth-like creature with glowing wings and antennaDense forests and jungles with bioluminescent floraEmits soothing light patterns to communicate and attract prey
Zephyr Dancer0.136456Graceful avian creature with iridescent feathersHigh mountain peaks and wind-swept plainsCreates mesmerizing aerial displays to attract mates

Great results again! Our two other retrieved monsters - Luminoth and Zephyr Dancer - have behavior similar to Harmonic Coral, and live in habitats different from Harmonic Coral's. They also look very different from Harmonic Coral. (Harmonic's tendrils and Luminoth's antenna are similar features, but we only down-weighted look_weight by 0.5, and the resemblance between the two monsters ends there.)

Conclusion

Multi-attribute vector search is a significant advance in information retrieval, offering more accuracy, contextual understanding, and flexibility than basic semantic similarity search. Still, our naive approach (above) - storing and searching attribute vectors separately, then combining results - is limited in ability, subtlety, and efficiency when we need to retrieve objects with multiple simultaneous attributes. (Moreover, multiple kNN searches take more time than a single search with concatenated vectors.)

To handle scenarios like this, it's better to store all your attribute vectors in the same vector store and perform a single search, weighting your attributes at query time. The Superlinked approach is more accurate, efficient, and scalable than the naive approach for any application that requires fast, reliable, nuanced, multi-attribute vector retrieval - whether your use case is tackling real world data challenges in your e-commerce or recommendation system... or something entirely different, like battling monsters.

Contributors

Stay updated with VectorHub

Continue Reading