How to extract entities and structured data with SIE
SIE’s extract primitive pulls structured information from unstructured content. It handles named entity recognition (NER), relation extraction, text classification, and vision tasks including captioning and OCR. Models run on your own infrastructure with zero per-call API costs.
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
text = Item(text="Apple CEO Tim Cook announced the iPhone 16 in Cupertino.")result = client.extract( "urchade/gliner_multi-v2.1", text, labels=["person", "organization", "product", "location"])for entity in result["entities"]: print(f"{entity['label']}: {entity['text']} (score: {entity['score']:.2f})")# organization: Apple (score: 0.95)# person: Tim Cook (score: 0.93)# product: iPhone 16 (score: 0.89)# location: Cupertino (score: 0.87)import { SIEClient } from "@superlinked/sie-sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.extract( "urchade/gliner_multi-v2.1", { text: "Apple CEO Tim Cook announced the iPhone 16 in Cupertino." }, { labels: ["person", "organization", "product", "location"] },);for (const entity of result.entities) { console.log(`${entity.label}: ${entity.text} (score: ${entity.score.toFixed(2)})`);}// organization: Apple (score: 0.95)// person: Tim Cook (score: 0.93)// product: iPhone 16 (score: 0.89)// location: Cupertino (score: 0.87)
await client.close();For model recommendations, see the full model catalog.
Input Types
Section titled “Input Types”Item accepts three input modes depending on the model:
text— plain string. Used by GLiNER, GLiREL, GLiClass, and the rest of the text-only extractors.images— list of image bytes (or{data, format}dicts in Python). Used by Florence-2, Donut, GroundingDINO, OWL-v2, and image-input OCR models likezai-org/GLM-OCR,lightonai/LightOnOCR-2-1B, andPaddlePaddle/PaddleOCR-VL-1.5. See Vision Tasks and OCR.document— raw file bytes (PDF, DOCX, HTML, MD, TXT, RTF, ODT, PPTX, XLSX, CSV). Used by the multi-pagedoclingparser. The Python SDK auto-detects the format from a path suffix; bytes-based callers passformatexplicitly. See OCR → Docling.
Named Entity Recognition
Section titled “Named Entity Recognition”GLiNER models support zero-shot NER: define any entity types you need at query time, with no predefined schema.
Custom Entity Types
Section titled “Custom Entity Types”result = client.extract( "urchade/gliner_multi-v2.1", Item(text="The merger between Acme Corp and Beta Inc requires FTC approval."), labels=["company", "regulatory_body", "legal_action"])for entity in result["entities"]: print(f"{entity['label']}: {entity['text']}")# company: Acme Corp# company: Beta Inc# regulatory_body: FTCconst result = await client.extract( "urchade/gliner_multi-v2.1", { text: "The merger between Acme Corp and Beta Inc requires FTC approval." }, { labels: ["company", "regulatory_body", "legal_action"] },);for (const entity of result.entities) { console.log(`${entity.label}: ${entity.text}`);}// company: Acme Corp// company: Beta Inc// regulatory_body: FTCEntity Positions
Section titled “Entity Positions”Each entity includes character positions for highlighting or downstream processing:
result = client.extract( "urchade/gliner_multi-v2.1", Item(text="Tim Cook works at Apple."), labels=["person", "organization"])for entity in result["entities"]: print(f"{entity['label']}: '{entity['text']}' [{entity['start']}:{entity['end']}]")# person: 'Tim Cook' [0:8]# organization: 'Apple' [18:23]const result = await client.extract( "urchade/gliner_multi-v2.1", { text: "Tim Cook works at Apple." }, { labels: ["person", "organization"] },);for (const entity of result.entities) { console.log(`${entity.label}: '${entity.text}' [${entity.start}:${entity.end}]`);}// person: 'Tim Cook' [0:8]// organization: 'Apple' [18:23]Batch Extraction
Section titled “Batch Extraction”documents = [ Item(id="doc-1", text="Microsoft acquired Activision for $69 billion."), Item(id="doc-2", text="Sundar Pichai leads Google's AI initiatives."),]results = client.extract( "urchade/gliner_multi-v2.1", documents, labels=["person", "organization", "money"])const documents = [ { id: "doc-1", text: "Microsoft acquired Activision for $69 billion." }, { id: "doc-2", text: "Sundar Pichai leads Google's AI initiatives." },];const results = await client.extract( "urchade/gliner_multi-v2.1", documents, { labels: ["person", "organization", "money"] },);Response Format
Section titled “Response Format”The ExtractResult contains different fields depending on the extraction type used:
| Field | Type | When present |
|---|---|---|
id | str or None | Always (if provided in input) |
entities | list[Entity] | NER models (GLiNER) |
relations | list[Relation] | Relation extraction (GLiREL) |
classifications | list[Classification] | Classification models (GLiClass) |
objects | list[DetectedObject] | Object detection (GroundingDINO, OWLv2) |
data | dict | Document/composite extractors (Docling, Donut, document-mode Florence-2) |
Entity Fields
Section titled “Entity Fields”| Field | Type | Description |
|---|---|---|
text | str | Extracted text span |
label | str | Entity type label |
score | float | Confidence score from 0 to 1 |
start | int | Start character position |
end | int | End character position |
HTTP API
Section titled “HTTP API”The server defaults to msgpack. For JSON responses:
curl -X POST http://localhost:8080/v1/extract/urchade/gliner_multi-v2.1 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "Tim Cook is the CEO of Apple."}], "params": {"labels": ["person", "organization"]} }'See the HTTP API Reference.
Framework Integrations
Section titled “Framework Integrations”Extraction is available through all major framework integrations, not just the native SDK:
| Framework | Component | Returns |
|---|---|---|
| LangChain | SIEExtractor | Dict with entities, relations, classifications, objects |
| LlamaIndex | create_sie_extractor_tool | Dict with entities, relations, classifications, objects |
| Haystack | SIEExtractor | Typed outputs: Entity, Relation, Classification, DetectedObject |
| DSPy | SIEExtractor | dspy.Prediction with extraction fields |
| CrewAI | SIEExtractorTool | Formatted string with all extraction types |
Frequently Asked Questions
Section titled “Frequently Asked Questions”What is zero-shot NER?
Zero-shot NER means you can define your entity types at query time without fine-tuning a model. GLiNER models like urchade/gliner_multi-v2.1 accept arbitrary label strings and extract matching spans from text. There is no fixed list of entity types.
Does SIE support relation extraction? Yes. GLiREL models extract relationships between entities (for example, “CEO of”, “acquired by”). See Relations and Classification.
Can SIE extract data from PDFs and images?
Yes. SIE supports four dedicated OCR models: zai-org/GLM-OCR, lightonai/LightOnOCR-2-1B, PaddlePaddle/PaddleOCR-VL-1.5, and docling (multi-page PDF/DOCX/HTML). They convert documents to Markdown while preserving tables and layout. Donut and Florence-2 are also available for image captioning and visual QA. See OCR and Vision Tasks.
Which model should I use for entity extraction?
urchade/gliner_multi-v2.1 is a strong default for multilingual NER. It handles zero-shot extraction across 100+ languages. Browse all extraction models in the model catalog.