Structured Outputs - JSON Schema • EndpointR

NOTE: This vignette was generated by Claude 4 Sonnet, and then edited by hand. The JSON Schema code was written by hand and then adapted by the same model. If you think your code was copied, or used by Sonnet, please let us know and we will credit accordingly.

library(EndpointR)
library(S7)
library(cli)
library(jsonlite)
library(httr2)
library(purrr)

Introduction to Structured Outputs

Structured outputs ensure LLMs return data in exactly the format you need. Instead of parsing messy text, you get validated JSON matching your schema.

Why use structured outputs? - eliminates parsing errors and inconsistent formats - guarantees data types (numbers vs strings, booleans vs text) - enables reliable data extraction pipelines - reduces prompt engineering overhead

Basic workflow: 1. define schema using helper functions 2. send requests to OpenAI with schema passed in 3. validate response with validate_response()

Quick Start

WARNING: if using the schema for structured outputs from OpenAI, all fields MUST be required:

contact_schema <- create_json_schema(
 name = "contact_info",
 schema = schema_object(
   name = schema_string("person's full name"),
   email = schema_string("email address"),
   phone = schema_string("phone number"),
   required = list("name", "email", "phone"),
   additional_properties = FALSE
 )
)

req <- oai_build_completions_request(
  input = "Am I speaking with Margaret Phillips? Yes, ok, and your email is mphil@hotmail.co.uk. Ok perfect, and your phone number? Was that 07564789789? Ok great. Just a second please Margaret, you're verified",
  schema = contact_schema
)

resp <- req_perform(req)

resp |>
  resp_body_json() |> 
  pluck("choices", 1, "message", "content") |> 
  validate_response(schema = contact_schema) |> 
  tibble::as_tibble()

NOTE: The first time you send a request with a schema, it will take longer than usual. “Typical schemas take under 10 seconds to process on the first request, but more complex schemas may take up to a minute.” ¹

Schema Types

EndpointR provides several schema types to match different data extraction needs. Each type enforces specific constraints and validations:

schema_string() - text data
schema_number() / schema_integer() - numeric values with optional min/max
schema_boolean() - true/false values
schema_enum() - predefined choices
schema_array() - lists of items
schema_object() - nested structures

Let’s explore each type with practical examples:

# text classification with enums
sentiment_schema <- create_json_schema(
  name = "sentiment_analysis",
  schema = schema_object(
    sentiment = schema_enum(
      c("positive", "negative", "neutral"),
      "overall sentiment of text"
    ),
    confidence = schema_number(
      "confidence score", 
      minimum = 0, 
      maximum = 1
    ),
    is_spam = schema_boolean("contains spam content"),
    required = c("sentiment", "confidence", "is_spam")
  )
)

You can inspect the schema then print it in a human-readable form with json_dump() and jsonlite::toJSON()

json_dump(sentiment_schema) |> 
  jsonlite::toJSON(pretty = TRUE, auto_unbox = TRUE)

Another, more complicated, schema for product review extraction, where we introduce the schema_array:

rating_schema <- create_json_schema(
  name = "product_review",
  schema = schema_object(
    rating = schema_integer("star rating", minimum = 1, maximum = 5),
    title = schema_string("review title"),
    pros = schema_array(
      schema_string(),
      "positive aspects mentioned"
    ),
    cons = schema_array(
      schema_string(),
      "negative aspects mentioned"
    ),
    would_recommend = schema_boolean("recommends product"),
    required = c("rating","title", "pros", "cons", "would_recommend")
  )
)

Complex Nested Structures

Finally we have a fairly complex schema - the supplier is a schema_object within a schema_object, and line_items sits within a schema_object, has a schema_array, with its own schema_object, and multiple schema_* objects.

# invoice parsing with line items
invoice_schema <- create_json_schema(
  name = "invoice_data",
  schema = schema_object(
    # header information
    invoice_number = schema_string("invoice reference number"),
    issue_date = schema_string("date issued (YYYY-MM-DD format)"),
    due_date = schema_string("payment due date (YYYY-MM-DD format)"),
    # billing details
    supplier = schema_object(
      name = schema_string("supplier company name"),
      address = schema_string("supplier address"),
      vat_number = schema_string("VAT registration number"),
      required = c("name")
    ),
    customer = schema_object(
      name = schema_string("customer name"),
      address = schema_string("customer address"),
      required = c("name")
    ),
    # line items array
    line_items = schema_array(
      schema_object(
        description = schema_string("item description"),
        quantity = schema_integer("quantity ordered", minimum = 1),
        unit_price = schema_number("price per unit", minimum = 0),
        line_total = schema_number("total for this line", minimum = 0),
        required = c("description", "quantity", "unit_price", "line_total")
      ),
      "invoice line items",
      min_items = 1
    ),
    # totals
    subtotal = schema_number("subtotal before tax", minimum = 0),
    vat_amount = schema_number("VAT amount", minimum = 0),
    total_amount = schema_number("final total amount", minimum = 0)
  )
)

invoice_schema@schema$required <-  names(invoice_schema@schema$properties)  # This helper line ensures ALL properties are marked as required, which is   mandatory for OpenAI's structured outputs. Without this, the API will reject  the schema. Use this pattern when you want all fields to be required rather  than listing them individually.

Validation

Each schema type enforces specific constraints. We have a method for validating whether specific responses meet the schema’s constraints.

How Validation Works

When using structured outputs with LLM providers:

API-side enforcement: The provider ensures generated responses match your schema
Local validation: - validate_response() double-checks data integrity locally

This dual approach catches both generation errors and data transmission issues.

Here’s a comprehensive example:

user_profile_schema <- create_json_schema(
 name = "user_profile",
 schema = schema_object(
   # string fields
   name = schema_string("full name"),
   bio = schema_string("user biography"),
   
   # numeric fields
   age = schema_integer("age in years", minimum = 13, maximum = 120),
   account_balance = schema_number("balance in pounds", minimum = 0),
   is_verified = schema_boolean("account verified status"),
   newsletter_opt_in = schema_boolean("subscribed to newsletter"),
   subscription_tier = schema_enum(
     c("free", "premium", "enterprise"),
     "subscription level"),
   priority = schema_enum(
     c(1, 2, 3),
     "support priority level",
     type = "integer"
   ),
   interests = schema_array(
     schema_string(),
     "user interests",
     min_items = 1,
     max_items = 10
   ),
   
   required = c("name", "age", "is_verified", "subscription_tier")
 )
)

QUESTION: what’s wrong with this schema if we want to use it with the OpenAI API?

Click to view answer

Not all properties are set as required fields, meaning we cannot use this schema for OpenAI’s structured outputs.

Now let’s see what happens when we try validate a mocked response object which conforms to the schema:

valid_user <- '{
  "name": "Alice Smith",
  "age": 28,
  "account_balance": 156.75,
  "is_verified": true,
  "newsletter_opt_in": false,
  "subscription_tier": "premium",
  "priority": 2,
  "interests": ["data science", "functional programming", "statistics"]
}'

validated_data <- validate_response(user_profile_schema, valid_user)
str(validated_data)

And when we try to validate a mocked response object which does not conform to the schema:

invalid_age <- '{
  "name": "Young User",
  "age": 10,
  "is_verified": true,
  "subscription_tier": "free"
}'

validate_response(user_profile_schema, invalid_age)

Working with S7 Objects

EndpointR uses S7 objects for its schema system. This provides better type safety and validation, but it means some familiar S3 methods won’t work as expected. Understanding how to work with these objects will help you debug issues and customise your schemas.

For example, you cannot call: jsonlite::toJSON(contact_schema). If you try, you’ll get this error: Error: No method asJSON S3 class: S7_object

Instead, you should use the S7 method json_dump which has been defined for the json_schema class. This converts the schema into an R list which is ready to be converted into JSON. I won’t print the list as-is because it is long and ugly, instead we can check out the structure:

contact_json_dump <- json_dump(contact_schema)
str(contact_json_dump)

Alongside json_dump the EndpointR::json_schema is given a validate_response method we saw it used earlier in the quickstart.

Converting Schema Objects to JSON

You can convert the dumped schema to a JSON object using {jsonlite}‘s toJSON function. This object will now be of the class ’json’.

contact_json_schema <- 
  toJSON(contact_json_dump, 
                 pretty = TRUE, 
                 auto_unbox = TRUE)

class(contact_json_schema)

View pretty-printed Schema in JSON form

{
  "type": "json_schema",
  "json_schema": {
    "name": "contact_info",
    "schema": {
      "type": "object",
      "properties": {
        "name": {
          "type": "string",
          "description": "person's full name"
        },
        "email": {
          "type": "string",
          "description": "email address"
        },
        "phone": {
          "type": "string",
          "description": "phone number"
        }
      },
      "additionalProperties": false,
      "required": ["name", "email", "phone"]
    },
    "strict": true
  }
}

And we can convert our schema back to a regular R list with {jsonlite}’s fromJSON function:

from_contact_json_schema <- fromJSON(contact_json_schema)

class(from_contact_json_schema)

Best Practices

Schema design principles:

Use descriptive field names and descriptions
Set appropriate constraints (min/max values, required fields)
Prefer enums over free text for categories
Nest objects logically for complex data
Validate some mock responses in advance

Troubleshooting tips:

Use json_dump() to inspect final schema structure
use jsonlite::toJSON(x, pretty = TRUE) to view the schema in a human-readable form
Test schemas with mock data using validate_response()
Start simple and add complexity incrementally
Check enum values match expected model outputs
Validate required fields cover essential data
Make sure all properties are required if using OpenAI API for structured outputs

Types Reference

Type	Use Case	Example
`schema_string()`	Text, names, descriptions	`schema_string("email address")`
`schema_integer()`	Whole numbers, counts	`schema_integer("age", minimum = 0, maximum = 120)`
`schema_number()`	Decimals, prices, scores	`schema_number("price", minimum = 0)`
`schema_boolean()`	Yes/no, true/false flags	`schema_boolean("is_active")`
`schema_enum()`	Fixed choices	`schema_enum(c("small", "medium", "large"), "size")`
`schema_array()`	Lists, multiple values	`schema_array(schema_string(), "tags")`
`schema_object()`	Nested structures	`schema_object(name = schema_string("..."), ...)`