---
title: Transformers
description: A comprehensive guide to Transformer neural networks, exploring their groundbreaking architecture that revolutionized natural language processing. Learn about self-attention mechanisms, encoder-decoder structures, and how Transformers overcome traditional RNN limitations. Discover their applications in language modeling, machine translation, and emerging limitations in computational complexity and interpretability.
canonical_url: https://superlinked.com/glossary/transformers
last_updated: 2026-06-02
---

# What is a Transformer?

A transformer is a neural network architecture based entirely on self-attention mechanisms, introduced in the 2017 paper "Attention Is All You Need." It processes entire sequences in parallel — unlike RNNs which process step by step — enabling efficient training on long sequences. Transformers are the foundation of every modern large language model, embedding model, and reranker used in search and AI systems.

---

## Why do transformers matter for inference?

Every embedding model and reranker hosted on SIE is a transformer. Understanding the architecture helps you reason about:

- Why larger context windows improve document retrieval quality
- How encoder-only models (BERT-style) differ from decoder-only LLMs (GPT-style)
- What fine-tuning and LoRA adaptation actually change in the model
- Trade-offs between model size, latency, and accuracy

---

## How does a transformer work?

The transformer processes input tokens through a stack of identical layers, each containing two sub-components:

### 1. Multi-head self-attention
Self-attention allows every token to attend to every other token in the sequence simultaneously — computing a weighted average of all token representations based on relevance:

```
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
```

Where Q (query), K (key), and V (value) are linear projections of the input. The attention score between two tokens is their dot product, normalised by sequence length and passed through softmax.

**Multi-head attention** runs this process in parallel across H attention heads, each learning to attend to different aspects of the input (syntax, semantics, co-reference, etc.).

### 2. Feed-forward network
After attention, each token position passes through a small two-layer MLP independently:

```
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
```

This adds non-linearity and capacity beyond what attention alone can represent.

Both sub-layers use **residual connections** (add input to output) and **layer normalisation** to stabilise training.

---

## Encoder-only vs decoder-only vs encoder-decoder

| Architecture | Models | Best for |
|---|---|---|
| Encoder-only | BERT, RoBERTa, BGE, E5 | Embedding, classification, reranking |
| Decoder-only | GPT, LLaMA, Mistral | Text generation |
| Encoder-decoder | T5, BART, mT5 | Translation, summarisation |

**For semantic search and RAG**, encoder-only models are the right choice — they produce rich bidirectional representations of the full input. SIE hosts encoder-only embedding and reranking models.

---

## What is self-attention and why is it powerful?

Self-attention solves the core limitation of RNNs: the inability to efficiently capture long-range dependencies. In an RNN, information from early tokens must pass through every intermediate step to reach later ones. In a transformer, every token attends directly to every other token — the distance between tokens doesn't matter.

This means:
- "The bank by the river" and "the bank processed the loan" — the word "bank" gets different representations based on context
- A legal clause 500 tokens earlier can directly influence the representation of a term at the end of the document

---

## Positional encoding

Transformers process all tokens in parallel and have no inherent sense of order. **Positional encodings** add position information to each token's embedding:

- **Sinusoidal (original)** — fixed mathematical functions of position
- **Learned positional embeddings** — trainable position vectors (BERT)
- **Rotary Position Embedding (RoPE)** — encodes relative positions; used in modern embedding models and LLMs
- **ALiBi** — adds a linear bias to attention scores based on distance; enables length generalisation

BGE-M3 and other modern embedding models use RoPE, which contributes to their ability to handle 8,192 token inputs effectively.

---

## Transformer scaling and embedding models

Transformer quality scales predictably with model size, data, and compute. For embedding models:

| Model size | Example | Latency | Quality |
|---|---|---|---|
| Small (~30M) | bge-small-en | Very fast | Good |
| Base (~110M) | bge-base-en | Fast | Better |
| Large (~335M) | bge-large-en | Medium | High |
| XL (~570M) | BGE-M3 | Slower | State of the art |

SIE's GPU batching and cluster deployment make serving larger, higher-quality models at production scale practical.

---

## Frequently asked questions

**What is the difference between a transformer and an LLM?**
A large language model (LLM) is a very large decoder-only transformer trained on massive text corpora for next-token prediction. The transformer is the architecture; LLM describes a specific scale and training paradigm.

**Why are transformer embedding models better than older approaches?**
Transformers produce contextual embeddings — the representation of each word depends on the entire surrounding context. Older methods (Word2Vec, GloVe) produce static embeddings where each word always has the same vector regardless of context.

**How does LoRA work with transformers?**
LoRA (Low-Rank Adaptation) adds small trainable matrices to the attention layers (Q, K, V projections), keeping the base weights frozen. Only the LoRA matrices are updated during fine-tuning — reducing trainable parameters by 100–1000x. SIE supports hot-loading LoRA adapters without server restart.

---

## Related resources

- [What is self-hosted inference?](/glossary/what-is-self-hosted-inference)
- [What is a LoRA adapter?](/glossary/what-is-a-lora-adapter)
- [What is semantic search?](/glossary/what-is-semantic-search)
- [Browse transformer-based models on SIE](/models)
- [What is a reranker?](/glossary/what-is-a-reranker)
