๐Ÿ‡ฌ๐Ÿ‡ง EN
๐Ÿ‡ฎ๐Ÿ‡น IT

๐Ÿ“‹ Schema Extractor

Uses an AI model to extract structured data from unstructured text โ€” such as emails, articles, invoices, or PDF content. You define a JSON Schema describing the fields you want; the model reads the source text and fills in the schema. The result is a clean, structured object ready for use in downstream steps.

Category: Data Retrieval & Processing  ยท  Type identifier: schema_extractor

Overview

A huge amount of valuable information lives in unstructured text: customer emails, support tickets, scanned invoices, meeting notes, news articles. The Schema Extractor node bridges the gap between free-form prose and structured data your workflow can act on.

You provide two things: the source text (from a previous step or the trigger), and a JSON Schema that describes the shape of the output you want. The AI model reads the text, finds the relevant information, and returns a structured object matching your schema.

This is more specialised โ€” and more reliable โ€” than asking a general Agent node to extract data via JSON Mode. The Schema Extractor is purpose-built for extraction tasks, handling edge cases like missing fields, varied formatting, and ambiguous values more robustly than a general agent prompt.

Configuration

Field Status Description
Provider Required The AI Provider to use for extraction. Any provider with a capable language model works. For best results, use a model known for instruction-following (e.g. GPT-4o, Claude).
Source Text Required The text to extract from. Supports {{ variable }} references โ€” for example {{ email_trigger.output.body }}, {{ pdf_reader.output.content }}, or {{ agent.output.text }}.
Schema Required A JSON Schema object that defines the expected output structure. Write this as a valid JSON object in the field. The model will attempt to fill every property defined in the schema from the source text.

Output Data

FieldTypeDescription
Schema fieldsvariesEach property defined in your schema becomes a top-level field in the output object. For example, if your schema defines vendor, total, and line_items, those three keys will be present in the output.
raw_textstringThe original source text that was passed to the extractor. Useful for debugging or audit trails.
// If your schema defines: vendor, invoice_number, total, line_items {{ extract.output.vendor }} {{ extract.output.invoice_number }} {{ extract.output.total }} {{ extract.output.line_items }} // Access the source text that was processed {{ extract.output.raw_text }}

Example Usage

Extracting invoice data from a PDF

A workflow receives a PDF invoice via a file-watch trigger, extracts text from it, and then uses the Schema Extractor to pull out the key financial fields.

  1. Read the PDF content. Use a Local File node (or Google Drive node) with the PDF path from the trigger. The output will contain the extracted text.
  2. Add a Schema Extractor node. Set Provider and select a capable model. Set Source Text to {{ read_pdf.output.content }}.
  3. Define the schema:
    { "type": "object", "properties": { "vendor_name": { "type": "string", "description": "Name of the company issuing the invoice" }, "invoice_number": { "type": "string", "description": "Invoice reference number" }, "invoice_date": { "type": "string", "description": "Date of the invoice (ISO format if possible)" }, "due_date": { "type": "string", "description": "Payment due date" }, "subtotal": { "type": "number", "description": "Amount before tax" }, "tax": { "type": "number", "description": "Tax amount" }, "total": { "type": "number", "description": "Total amount due" }, "line_items": { "type": "array", "items": { "type": "object", "properties": { "description": { "type": "string" }, "quantity": { "type": "number" }, "unit_price": { "type": "number" }, "amount": { "type": "number" } } } } }, "required": ["vendor_name", "invoice_number", "total"] }
  4. Use the extracted data. Reference {{ extract_invoice.output.total }}, {{ extract_invoice.output.vendor_name }}, etc. in downstream steps to create database records, send notifications, or trigger approval workflows.

Parsing a support email

// Schema for extracting structured data from a customer support email: { "type": "object", "properties": { "customer_name": { "type": "string" }, "product_affected": { "type": "string" }, "issue_summary": { "type": "string" }, "urgency": { "type": "string", "enum": ["low", "medium", "high", "critical"] }, "contact_email": { "type": "string" } } } // Then reference in a routing step: {{ parse_email.output.urgency }} {{ parse_email.output.issue_summary }}

Tips & Notes

Related Nodes