In the rapidly evolving world of technology, building high-performing Generative AI (GenAI) applications has become a top priority for enterprise tech teams globally. The promise of operational efficiencies and enhanced customer experiences is undeniably appealing. However, a significant challenge stands in the way: enterprise data is complex and data entities have both structured and unstructured attributes. While powerful in understanding and working with unstructured data, GenAI models are notoriously poor at understanding and handling structured data predictably.
‍
‍
‍
Structured data is the backbone of most internal and customer-facing enterprise systems. Consider a Q&A chatbot of a financial analyst - we would expect it to understand that the analyst refers to “recent” or “at risk” reports from a particular range of dates, or a recommender system of an e-commerce store - we would expect it to recommend similarly priced or popular products to the product that the customer is currently viewing, in real-time. Unfortunately, traditional GenAI models struggle with numbers, timestamps and categorical data - let alone more complex data structures like time series or graphs. Their strength lies in processing and understanding certain types of unstructured data like text and images, but they falter with structured data, treating important attributes of the data, such as product prices, document dates, store locations, and content categories as if they were merely text. This mismatch often leads to tech teams discovering that their applications deliver subpar results.
In response, many teams attempt to develop and train custom re-ranking models tailored to their specific needs. However, this is a daunting task, which requires significant expertise, time, and resources—luxuries that many enterprises cannot afford. Consequently, most GenAI-powered solutions remain stuck in the proof of concept phase, unable to realize their full potential.
‍
‍
‍
The combination of MongoDB Atlas and Superlinked aims to overcome these challenges and revolutionize the way enterprises build and deploy GenAI applications. Here’s how:
‍
‍
Superlinked’s vector compute framework is a game-changer for data science teams. It enables the creation of custom vector embeddings that seamlessly integrate structured and unstructured data into the same vector space. This unique approach allows enterprises to use vector search to deliver results that take into account both data types, effectively tailoring GenAI models to their specific use cases. The result combines the high-quality performance of a custom model with the convenience of pre-trained GenAI models, offering a significant boost in time-to-market and explainability of the results.
‍
‍
For enterprises already using MongoDB to store their structured data, Atlas is a natural choice. MongoDB Atlas offers an easy and reliable way to implement vector search at scale without adding complexity to their existing data stack. By using Atlas with custom embeddings, generated by Superlinked, businesses are able to harness the full power of their complex data, delivering high-quality applications and achieving the promise of GenAI.
‍
‍
In summary, the combination of Superlinked and MongoDB Atlas offers a clear path to building and deploying high-quality GenAI-powered applications. By addressing the inherent challenges of complex data, this partnership ensures that enterprises can move beyond the POC and MVP stages, delivering real value to their operations and customers.
MongoDB’s partnership with Superlinked aims to make it easier for customers to create and maintain entity-level and sub-entity-level vector embeddings for enterprise retrieval augmented generation and other use cases, including analytics or more standard semantic search and recommendation systems.
‍
Below you’ll find a step-by-step guide for building your first simple application with Superlinked, using Atlas as the vector store and search solution. This Semantic Search application allows users to perform a free text search within a database of product reviews and demonstrates how combining the unstructured text of the review with the star ratings of the product embedded as a numeric value in the same vector space delivers higher-quality and more relevant results.Â
You can find a complete example here, and as always, refer to the official README for the latest details.Â
‍
‍
%pip install superlinked
# we are going to create 2 representations of the data
## 1. separate text and ranking for multimodal superlinked embeddings
## 2. full_review_as_text for LLM embedding of stringified review and rating
@schema
class Review:
id: IdField
review_text: String
rating: Integer
full_review_as_text: String
‍
# Embed review data separately
review_text_space = TextSimilaritySpace(
text=review.review_text, model="all-MiniLM-L6-v2")
rating_maximizer_space = NumberSpace(review.rating, min_value=1,
max_value=5, mode=Mode.MAXIMUM)
## Embed the full review as text
full_review_as_text_space = TextSimilaritySpace(
text=review.full_review_as_text, model="all-MiniLM-L6-v2"
# Combine spaces as vector parts to an index.
## Create one for the stringified review
naive_index = Index([full_review_as_text_space])
## and one for the structured multimodal embeddings
advanced_index = Index([review_text_space, rating_maximizer_space])
‍
openai_config = OpenAIClientConfig(api_key=userdata.get("openai_api_key"), model="gpt-4o")
# Define your query using dynamic parameters for query text and weights.
## first a query on the naive index - using natural language
naive_query = (
Query(
naive_index,
weights={
full_review_as_text_space: Param('full_review_as_text_weight')
},
)
.find(review)
.similar(full_review_as_text_space.text, Param("query_text"))
.limit(Param('limit'))
.with_natural_query(Param("natural_query"), openai_config)
)
## and another on the advanced multimodal index - also using natural language
superlinked_query = (
Query(
advanced_index,
weights={
review_text_space: Param('review_text_weight'),
rating_maximizer_space: Param('rating_maximizer_weight'),
},
)
.find(review)
.similar(review_text_space.text, Param("query_text"))
.limit(Param('limit'))
.with_natural_query(Param("natural_query"), openai_config)
)
Note: Superlinked supports two ways of setting weights for query parts:
‍
# Run the app
source: InMemorySource = InMemorySource(review, parser=DataFrameParser(schema=review))
executor = InMemoryExecutor(sources=[source], indices=[naive_index, advanced_index]index])
app = executor.run()
# Download dataset
data = pd.read_json("https://storage.googleapis.com/superlinked-preview-test-data/amazon_dataset_1000.jsonl",lines=True)
# Ingest data to the framework.
source.put([data])
# query that is based on the LLM embedded reviews# query that is solely based on text ( = zero weight to star ratings)
naive_positive_results = app.query(
naive_query,
natural_query='High rated quality products',
limit=10)
naive_positive_results.to_pandas()
results = app.query(query, review_text_weight=1,rating_maximizer_weight=0, query_text='High quality products', limit=10)
results.to_pandas().head(10)
# query based on multimodal Superlinked embeddings
superlinked_positive_results = app.query(
superlinked_query,
natural_query='High rated quality products',
limit=10)
superlinked_positive_results.to_pandas()
‍
# Clone the repository
git clone https://github.com/superlinked/superlinked
cd <repo-directory>/server
./tools/init-venv.sh
cd runner
source "$(poetry env info --path)/bin/activate"
cd ..
# Make sure you have your docker engine running and activate the virtual environment
./tools/deploy.py up
‍
from superlinked.framework.dsl.storage.mongo_vector_database import MongoDBVectorDatabase
vector_database = MongoDBVectorDatabase(
"<USER>:<PASSWORD>@<HOST_URL>",
"<DATABASE_NAME>",
"<CLUSTER_NAME>",
"<PROJECT_ID>",
"<API_PUBLIC_KEY>",
"<API_PRIVATE_KEY>",
)
‍
# Copy your configuration to app.py
# ...
# Create a data source to bulk load your production data.
config = DataLoaderConfig("https://storage.googleapis.com/superlinked-sample-datasets/amazon_dataset_ext_1000.jsonl""https://storage.googleapis.com/superlinked-sample-datasets/amazon_dataset_ext_1000.jsonlhttps://storage.googleapis.com/superlinked-sample-datasetspreview-test-data/amazon_dataset_1000.jsonl", DataFormat.JSON, pandas_read_kwargs={"lines": True, "chunksize": 100})
source = DataLoaderSource(review, config)
executor = RestExecutor(
# Add your data source
sources=[source],
# Add the indices ex that contains your configuration
indices=[index],
# Create a REST endpoint for your query.
queries=[RestQuery(RestDescriptor("naive_query"), naive_query),RestQuery(RestDescriptor("superlinked_query"), superlinked_query)],
# Connect to MongoDB Atlas
vector_database=MongoDBVectorDatabase()
)
SuperlinkedRegistry.register(executor)
‍
# Trigger the data load.
curl -X POST 'http://localhost:8080/data-loader/review/run'
# Check the status of the loader.
curl -X GET 'http://localhost:8080/data-loader/review/status'
# Send your first query
curl -X POST \
'http://localhost:8080/api/v1/search/superlinked_query' \
--header 'Accept: */*' \
--header 'Content-Type: application/json' \
--data-raw '{
"natural_query": "High rated quality products",
"limit": 10
}'
‍
Congratulations! You learned how to build your first GenAI-powered application that combines numeric and unstructured data in the same embedding space to deliver high-quality results. Now you are ready to explore additional notebooks here.
‍
We are excited to see the amazing applications that you will build with Superlinked and Atlas - don’t hesitate to share your work with us.
‍
Our winning partnership is designed to empower tech teams, helping them overcome the barriers to effective GenAI implementation and achieve their goals. With Atlas and Superlinked, the future of GenAI in the enterprise is not just promising—it’s here.
‍