Structured Outputs - JSON Schema
Source:vignettes/structured_outputs_json_schema.Rmd
structured_outputs_json_schema.Rmd
NOTE: This vignette was generated by Claude 4 Sonnet, and then edited by hand. The JSON Schema code was written by hand and then adapted by the same model. If you think your code was copied, or used by Sonnet, please let us know and we will credit accordingly.
Introduction to Structured Outputs
Structured outputs ensure LLMs return data in exactly the format you need. Instead of parsing messy text, you get validated JSON matching your schema.
Why use structured outputs? - eliminates parsing errors and inconsistent formats - guarantees data types (numbers vs strings, booleans vs text) - enables reliable data extraction pipelines - reduces prompt engineering overhead
Basic workflow: 1. define schema using helper
functions 2. send requests to OpenAI with schema passed in 3. validate
response with validate_response()
Quick Start
WARNING: if using the schema for structured outputs from OpenAI, all fields MUST be required:
contact_schema <- create_json_schema(
name = "contact_info",
schema = schema_object(
name = schema_string("person's full name"),
email = schema_string("email address"),
phone = schema_string("phone number"),
required = list("name", "email", "phone"),
additional_properties = FALSE
)
)
req <- oai_build_completions_request(
input = "Am I speaking with Margaret Phillips? Yes, ok, and your email is mphil@hotmail.co.uk. Ok perfect, and your phone number? Was that 07564789789? Ok great. Just a second please Margaret, you're verified",
schema = contact_schema
)
resp <- req_perform(req)
resp |>
resp_body_json() |>
pluck("choices", 1, "message", "content") |>
validate_response(schema = contact_schema) |>
tibble::as_tibble()
NOTE: The first time you send a request with a schema, it will take longer than usual. “Typical schemas take under 10 seconds to process on the first request, but more complex schemas may take up to a minute.” 1
Schema Types
EndpointR provides several schema types to match different data extraction needs. Each type enforces specific constraints and validations:
-
schema_string()
- text data -
schema_number()
/schema_integer()
- numeric values with optional min/max -
schema_boolean()
- true/false values -
schema_enum()
- predefined choices -
schema_array()
- lists of items -
schema_object()
- nested structures
Let’s explore each type with practical examples:
# text classification with enums
sentiment_schema <- create_json_schema(
name = "sentiment_analysis",
schema = schema_object(
sentiment = schema_enum(
c("positive", "negative", "neutral"),
"overall sentiment of text"
),
confidence = schema_number(
"confidence score",
minimum = 0,
maximum = 1
),
is_spam = schema_boolean("contains spam content"),
required = c("sentiment", "confidence", "is_spam")
)
)
You can inspect the schema then print it in a human-readable form
with json_dump()
and jsonlite::toJSON()
Another, more complicated, schema for product review extraction,
where we introduce the schema_array
:
rating_schema <- create_json_schema(
name = "product_review",
schema = schema_object(
rating = schema_integer("star rating", minimum = 1, maximum = 5),
title = schema_string("review title"),
pros = schema_array(
schema_string(),
"positive aspects mentioned"
),
cons = schema_array(
schema_string(),
"negative aspects mentioned"
),
would_recommend = schema_boolean("recommends product"),
required = c("rating","title", "pros", "cons", "would_recommend")
)
)
Complex Nested Structures
Finally we have a fairly complex schema - the supplier is a schema_object within a schema_object, and line_items sits within a schema_object, has a schema_array, with its own schema_object, and multiple schema_* objects.
# invoice parsing with line items
invoice_schema <- create_json_schema(
name = "invoice_data",
schema = schema_object(
# header information
invoice_number = schema_string("invoice reference number"),
issue_date = schema_string("date issued (YYYY-MM-DD format)"),
due_date = schema_string("payment due date (YYYY-MM-DD format)"),
# billing details
supplier = schema_object(
name = schema_string("supplier company name"),
address = schema_string("supplier address"),
vat_number = schema_string("VAT registration number"),
required = c("name")
),
customer = schema_object(
name = schema_string("customer name"),
address = schema_string("customer address"),
required = c("name")
),
# line items array
line_items = schema_array(
schema_object(
description = schema_string("item description"),
quantity = schema_integer("quantity ordered", minimum = 1),
unit_price = schema_number("price per unit", minimum = 0),
line_total = schema_number("total for this line", minimum = 0),
required = c("description", "quantity", "unit_price", "line_total")
),
"invoice line items",
min_items = 1
),
# totals
subtotal = schema_number("subtotal before tax", minimum = 0),
vat_amount = schema_number("VAT amount", minimum = 0),
total_amount = schema_number("final total amount", minimum = 0)
)
)
invoice_schema@schema$required <- names(invoice_schema@schema$properties) # This helper line ensures ALL properties are marked as required, which is mandatory for OpenAI's structured outputs. Without this, the API will reject the schema. Use this pattern when you want all fields to be required rather than listing them individually.
Validation
Each schema type enforces specific constraints. We have a method for validating whether specific responses meet the schema’s constraints.
How Validation Works
When using structured outputs with LLM providers:
- API-side enforcement: The provider ensures generated responses match your schema
- Local validation: - validate_response() double-checks data integrity locally
This dual approach catches both generation errors and data transmission issues.
Here’s a comprehensive example:
user_profile_schema <- create_json_schema(
name = "user_profile",
schema = schema_object(
# string fields
name = schema_string("full name"),
bio = schema_string("user biography"),
# numeric fields
age = schema_integer("age in years", minimum = 13, maximum = 120),
account_balance = schema_number("balance in pounds", minimum = 0),
is_verified = schema_boolean("account verified status"),
newsletter_opt_in = schema_boolean("subscribed to newsletter"),
subscription_tier = schema_enum(
c("free", "premium", "enterprise"),
"subscription level"),
priority = schema_enum(
c(1, 2, 3),
"support priority level",
type = "integer"
),
interests = schema_array(
schema_string(),
"user interests",
min_items = 1,
max_items = 10
),
required = c("name", "age", "is_verified", "subscription_tier")
)
)
QUESTION: what’s wrong with this schema if we want to use it with the OpenAI API?
Click to view answer
Not all properties are set as required fields, meaning we cannot use this schema for OpenAI’s structured outputs.
Now let’s see what happens when we try validate a mocked response object which conforms to the schema:
valid_user <- '{
"name": "Alice Smith",
"age": 28,
"account_balance": 156.75,
"is_verified": true,
"newsletter_opt_in": false,
"subscription_tier": "premium",
"priority": 2,
"interests": ["data science", "functional programming", "statistics"]
}'
validated_data <- validate_response(user_profile_schema, valid_user)
str(validated_data)
And when we try to validate a mocked response object which does not conform to the schema:
invalid_age <- '{
"name": "Young User",
"age": 10,
"is_verified": true,
"subscription_tier": "free"
}'
validate_response(user_profile_schema, invalid_age)
Working with S7 Objects
EndpointR uses S7 objects for its schema system. This provides better type safety and validation, but it means some familiar S3 methods won’t work as expected. Understanding how to work with these objects will help you debug issues and customise your schemas.
For example, you cannot call:
jsonlite::toJSON(contact_schema)
. If you try, you’ll get
this error: Error: No method asJSON S3 class: S7_object
Instead, you should use the S7 method json_dump
which
has been defined for the json_schema class. This converts the schema
into an R list which is ready to be converted into JSON. I won’t print
the list as-is because it is long and ugly, instead we can check out the
structure:
Alongside json_dump
the
EndpointR::json_schema
is given a
validate_response
method we saw it used earlier in the
quickstart.
Converting Schema Objects to JSON
You can convert the dumped schema to a JSON object using {jsonlite}‘s
toJSON
function. This object will now be of the class
’json’.
contact_json_schema <-
toJSON(contact_json_dump,
pretty = TRUE,
auto_unbox = TRUE)
class(contact_json_schema)
View pretty-printed Schema in JSON form
{
"type": "json_schema",
"json_schema": {
"name": "contact_info",
"schema": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "person's full name"
},
"email": {
"type": "string",
"description": "email address"
},
"phone": {
"type": "string",
"description": "phone number"
}
},
"additionalProperties": false,
"required": ["name", "email", "phone"]
},
"strict": true
}
}
And we can convert our schema back to a regular R list with
{jsonlite}’s fromJSON
function:
Best Practices
Schema design principles:
- Use descriptive field names and descriptions
- Set appropriate constraints (min/max values, required fields)
- Prefer enums over free text for categories
- Nest objects logically for complex data
- Validate some mock responses in advance
Troubleshooting tips:
- Use json_dump() to inspect final schema structure
- use
jsonlite::toJSON(x, pretty = TRUE)
to view the schema in a human-readable form - Test schemas with mock data using validate_response()
- Start simple and add complexity incrementally
- Check enum values match expected model outputs
- Validate required fields cover essential data
- Make sure all properties are required if using OpenAI API for structured outputs
Types Reference
Type | Use Case | Example |
---|---|---|
schema_string() |
Text, names, descriptions | schema_string("email address") |
schema_integer() |
Whole numbers, counts | schema_integer("age", minimum = 0, maximum = 120) |
schema_number() |
Decimals, prices, scores | schema_number("price", minimum = 0) |
schema_boolean() |
Yes/no, true/false flags | schema_boolean("is_active") |
schema_enum() |
Fixed choices | schema_enum(c("small", "medium", "large"), "size") |
schema_array() |
Lists, multiple values | schema_array(schema_string(), "tags") |
schema_object() |
Nested structures | schema_object(name = schema_string("..."), ...) |