Why did we open-source our inference engine? Read the post
← All Posts

5 AI Agents You Can Build in 5 Minutes

5 AI Agents You Can Build in 5 Minutes

You can build a working AI agent in about five minutes, because most of what an agent does is inference, not glue code. An agent perceives something (a document, an image, an order), retrieves the right context, makes a decision, and acts. Each of those steps is a small specialized model: an encoder for retrieval, a reranker for scoring, an extractor for turning messy input into structured fields, and an open LLM to run the loop. The slow part has always been standing up a separate server for every one of those models.

The Superlinked Inference Engine (SIE) collapses that into one step. It serves 85+ open source models behind a single API and a single Docker container, exposed through three primitives: encode, score, and extract, plus generation for the agent loop itself. One SIEClient, one endpoint, every model your agent needs. That is why these five projects go from clone to running so quickly. Realistically the slowest step is the first model download, not the wiring.

Every agent below is a self-contained example from the SIE repo. Clone it, run it, and read the pipeline. Here are five of the most interesting ones to start with.

1. A shopping search agent (the literal five-minute build)

TL;DR: Out-search Amazon on a laptop in three function calls.

The fastest place to start is a full Amazon-style product search engine that runs on a laptop in about five minutes. It uses all three SIE primitives through three SDK calls: extract pulls structured attributes out of raw product text, encode turns descriptions into embeddings for semantic retrieval, and score reranks the candidates so the best matches land on top.

This is the pattern every other agent on this list reuses, so it is worth building first. Once you have seen extract-then-encode-then-score working end to end against real product data, the rest is variation on a theme.

Build it: Self-hosted product search in 5 minutes.

2. A fraud-screening agent that gates real money

TL;DR: A bouncer for your checkout button that catches fraud before the money moves.

This is an agent that actually takes an action, not just answers a question. It sits inside a Stripe Link checkout and scores every incoming order against a corpus of past fraud patterns before the Stripe PaymentIntent is ever created. The risk band it produces annotates the payment button in the same UI, in the same request.

Under the hood it chains all three primitives: extract reads the order, encode represents it, and score compares it against known fraud signatures. Because the decision happens before money moves, it is a clean example of perception feeding a real-time action, which is the part of “agent” that most demos skip.

Build it: A Stripe Link checkout with an SIE fraud-risk gate.

3. A compliance research agent that keeps data in your cloud

TL;DR: Answer regulatory questions over private docs without the data ever leaving the building.

If your agent has to reason over regulated or private documents, this one is the template. It is a compliance RAG pipeline built on a domain-tuned LoRA encoder and a custom cross-encoder that reranks and prunes context in a single forward pass, all served from one SIE cluster.

Two things make it stand out for enterprise use. First, the LoRA adapter is hot-loaded at request time, so domain specialization does not mean a separate deployment. Second, nothing leaves your infrastructure, because SIE runs on your own GPUs rather than a third-party API. For legal, finance, and healthcare teams, that combination is usually the whole reason to self-host.

Build it: Private fine-tuned compliance RAG.

4. A multimodal sommelier agent

TL;DR: Point your phone at a bottle and impress your in-laws with AI-powered wine pairings.

Here is the fun one. This agent pairs preference-based wine retrieval with OCR-based label detection, so a user can point a camera at a bottle and get recommendations back. It wires extract (reading the label), encode (representing taste preferences and descriptions), and score (ranking pairings) into one user-facing flow.

It is a compact demonstration of multimodal perception driving a recommendation: image in, structured understanding out, ranked suggestions back. Swap wine for products, parts, or documents and the same shape holds for any “snap it and find the match” agent.

Build it: Build a multimodal wine recommender with OCR.

5. A model-scout agent for ML teams

TL;DR: Find your next embedding model by describing it, not doom-scrolling a leaderboard.

The last one is an agent for builders. You describe your task in plain language and it searches across roughly 14,000 Hugging Face embedding models, ranking them by task-specific MTEB scores. Instead of scrolling a leaderboard, you ask “best model for multilingual legal retrieval” and get a reasoned shortlist.

It is also a neat illustration of SIE turning a large catalog into a searchable product experience: the same app starts in a lightweight local demo mode with no keys required, then moves to live semantic search the moment you point it at a running SIE endpoint.

Build it: Find SOTA embedding models by MTEB task.

Where to go next

Each example is Apache 2.0 and self-contained, with the models, pipeline, and evaluation documented in the README. Start SIE with one Docker command, install the SDK, and run any of them:

docker run -p 8080:8080 ghcr.io/superlinked/sie-server:latest-cpu-default

Then point your code at it with one client:

from sie_sdk import SIEClient
client = SIEClient("http://localhost:8080")

Browse the full examples gallery or the model catalog to see what else fits your agent.

Frequently asked questions: agent inference cost and efficiency

Where do agent inference costs actually come from?

Agent costs come from multiplication, not from a single expensive call. One agent task fans out into many inference calls: embedding the query, retrieving and reranking candidates, extracting fields, then several LLM steps in the loop. With per-token pricing, every one of those steps bills on every task, so cost scales with the work the agent does rather than with the number of users. The non-generative steps (embeddings, reranking, extraction) usually run far more often than text generation, which is where spend quietly accumulates.

Is it cheaper to self-host inference than pay per-token APIs?

It depends on volume and utilization. Self-hosting tends to win once request volume is steady and high enough to keep a GPU busy, because you pay for GPU time instead of per token. Per-token APIs stay attractive at low or spiky volume, where a dedicated GPU would sit idle. The break-even is a function of your traffic and how well you keep the hardware utilized, so the right move is to measure your highest-frequency steps first.

Which parts of an agent should you move off paid APIs first?

Move the high-frequency, non-generative steps first: embeddings, reranking, extraction, and classification. They run constantly across retrieval and guardrails, they use small models, and small open source models are competitive with proprietary ones on these tasks. Generation can stay on a hosted LLM until the volume justifies bringing it in-house too. Shifting the encode, score, and extract workload off per-token pricing is usually the fastest cost win.

Do you need a separate GPU for every model you self-host?

No, and assuming you do is why naive self-hosting gets expensive. Single-model servers give each model a dedicated GPU pool, which wastes capacity for small-model workloads. SIE instead packs multiple models onto each GPU, loading them on demand and evicting the least-recently-used model when memory fills. An L4 (24GB) can keep roughly 2 to 3 standard models hot at once, while all 85+ models stay available at query time.

How do you improve GPU utilization for small-model inference?

Utilization rises when one cluster serves many models and many workloads instead of one service per model. On-demand loading, LRU eviction, and a load-balancing gateway keep GPUs busy across requests, and KEDA autoscaling with scale-to-zero means idle workloads stop costing money. Sharing a single cluster across teams raises average utilization further, because pipeline jobs and real-time traffic fill the gaps between each other.

How do you cut costs without adding latency?

Keep frequently used models warm and batch requests so the GPU does more useful work per call. After the first load, warm models answer in milliseconds, and a load-balancing gateway spreads traffic so no single replica becomes a bottleneck. The savings come from higher utilization, not from accepting slower responses, so latency and cost improve together rather than trading off.

Can you migrate off a hosted embedding API without rewriting your agent?

Usually yes, if the inference layer exposes a compatible API. SIE provides an OpenAI-compatible /v1/embeddings endpoint for drop-in migration and integrates with LangChain, LlamaIndex, Haystack, DSPy, and CrewAI. In most cases, swapping the serving backend is a base-URL and model-id change rather than an agent rewrite, which keeps the cost of testing self-hosted economics low.

Can you mix open source and proprietary models to control spend?

Yes, and a hybrid setup is often the cheapest path. Offload the high-volume embedding, reranking, and extraction work to small open source models on your own GPUs, and keep a proprietary LLM only for the steps where it clearly earns its cost. A unified gateway lets the agent call open and hosted models through one client, so you can shift the boundary as your volume and budget change.

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.