🇬🇧 EN

📋 Schema Extractor

Uses an AI model to extract structured data from unstructured text — such as emails, articles, invoices, or PDF content. You define a JSON Schema describing the fields you want; the model reads the source text and fills in the schema. The result is a clean, structured object ready for use in downstream steps.

Category: Data Retrieval & Processing · Type identifier: schema_extractor

Overview

A huge amount of valuable information lives in unstructured text: customer emails, support tickets, scanned invoices, meeting notes, news articles. The Schema Extractor node bridges the gap between free-form prose and structured data your workflow can act on.

You provide two things: the source text (from a previous step or the trigger), and a JSON Schema that describes the shape of the output you want. The AI model reads the text, finds the relevant information, and returns a structured object matching your schema.

This is more specialised — and more reliable — than asking a general Agent node to extract data via JSON Mode. The Schema Extractor is purpose-built for extraction tasks, handling edge cases like missing fields, varied formatting, and ambiguous values more robustly than a general agent prompt.

Configuration

Field	Status	Description
Provider	Required	The AI Provider to use for extraction. Any provider with a capable language model works. For best results, use a model known for instruction-following (e.g. GPT-4o, Claude).
Source Text	Required	The text to extract from. Supports `{{ variable }}` references — for example `{{ email_trigger.output.body }}`, `{{ pdf_reader.output.content }}`, or `{{ agent.output.text }}`.
Schema	Required	A JSON Schema object that defines the expected output structure. Write this as a valid JSON object in the field. The model will attempt to fill every property defined in the schema from the source text.

Output Data

Field	Type	Description
Schema fields	varies	Each property defined in your schema becomes a top-level field in the output object. For example, if your schema defines `vendor`, `total`, and `line_items`, those three keys will be present in the output.
`raw_text`	string	The original source text that was passed to the extractor. Useful for debugging or audit trails.

// If your schema defines: vendor, invoice_number, total, line_items {{ extract.output.vendor }} {{ extract.output.invoice_number }} {{ extract.output.total }} {{ extract.output.line_items }} // Access the source text that was processed {{ extract.output.raw_text }}

Example Usage

Extracting invoice data from a PDF

A workflow receives a PDF invoice via a file-watch trigger, extracts text from it, and then uses the Schema Extractor to pull out the key financial fields.

Read the PDF content. Use a Local File node (or Google Drive node) with the PDF path from the trigger. The output will contain the extracted text.
Add a Schema Extractor node. Set Provider and select a capable model. Set Source Text to {{ read_pdf.output.content }}.
Define the schema:
{ "type": "object", "properties": { "vendor_name": { "type": "string", "description": "Name of the company issuing the invoice" }, "invoice_number": { "type": "string", "description": "Invoice reference number" }, "invoice_date": { "type": "string", "description": "Date of the invoice (ISO format if possible)" }, "due_date": { "type": "string", "description": "Payment due date" }, "subtotal": { "type": "number", "description": "Amount before tax" }, "tax": { "type": "number", "description": "Tax amount" }, "total": { "type": "number", "description": "Total amount due" }, "line_items": { "type": "array", "items": { "type": "object", "properties": { "description": { "type": "string" }, "quantity": { "type": "number" }, "unit_price": { "type": "number" }, "amount": { "type": "number" } } } } }, "required": ["vendor_name", "invoice_number", "total"] }
Use the extracted data. Reference {{ extract_invoice.output.total }}, {{ extract_invoice.output.vendor_name }}, etc. in downstream steps to create database records, send notifications, or trigger approval workflows.

Parsing a support email

// Schema for extracting structured data from a customer support email: { "type": "object", "properties": { "customer_name": { "type": "string" }, "product_affected": { "type": "string" }, "issue_summary": { "type": "string" }, "urgency": { "type": "string", "enum": ["low", "medium", "high", "critical"] }, "contact_email": { "type": "string" } } } // Then reference in a routing step: {{ parse_email.output.urgency }} {{ parse_email.output.issue_summary }}

Tips & Notes

Schema Extractor vs. Agent JSON Mode. Both can extract structured data, but the Schema Extractor is the right choice for extraction tasks. It includes system instructions optimised for extraction, handles missing or ambiguous fields more gracefully, and is more reliable when the schema is complex. Use Agent JSON Mode when you also need reasoning, summarisation, or generation alongside extraction.
Use description hints in your schema. Adding a description property to each schema field significantly improves extraction accuracy. The model uses these hints to understand what each field means and where to look for it in the text.
Mark only truly required fields as required. If a field might legitimately not appear in every document (e.g. a discount on an invoice), do not mark it as required. Optional fields that are absent will be returned as null, which is easier to handle downstream than a failed extraction.
Test with varied samples. Different instances of the same document type (different vendors, different email styles) can format information very differently. Test your schema with several representative examples before deploying.

Related Nodes

Agent — use JSON Mode for extraction tasks that also require reasoning or generation.
Transform — reshape the extractor's output further if needed.