Skip to contents

High-level function to generate embeddings for texts in a data frame using OpenAI's embedding API. This function handles the entire process from request creation to response processing, with options for batching & concurrent requests.

Usage

oai_embed_df(
  df,
  text_var,
  id_var,
  model = "text-embedding-3-small",
  dimensions = 1536,
  key_name = "OPENAI_API_KEY",
  batch_size = 10,
  concurrent_requests = 1,
  max_retries = 5,
  timeout = 20,
  endpoint_url = "https://api.openai.com/v1/embeddings",
  progress = TRUE
)

Arguments

df

Data frame containing texts to embed

text_var

Column name (unquoted) containing texts to embed

id_var

Column name (unquoted) for unique row identifiers

model

OpenAI embedding model to use (default: "text-embedding-3-small")

dimensions

Number of embedding dimensions (NULL uses model default)

key_name

Name of environment variable containing the API key

batch_size

Number of texts to process in one batch (default: 10)

concurrent_requests

Number of concurrent requests (default: 1)

max_retries

Maximum retry attempts per request (default: 5)

timeout

Request timeout in seconds (default: 20)

endpoint_url

OpenAI API endpoint URL

progress

Whether to display a progress bar (default: TRUE)

Value

Original data frame with additional columns for embeddings (V1, V2, etc.), plus .error and .error_message columns indicating any failures

Details

This function extracts texts from a specified column, generates embeddings using oai_embed_batch(), and joins the results back to the original data frame using a specified ID column.

The function preserves the original data frame structure and adds new columns for embedding dimensions (V1, V2, ..., Vn). If the number of rows doesn't match after processing (due to errors), it returns the results with a warning.

OpenAI's embedding API allows you to specify the number of dimensions for the output embeddings, which can be useful for reducing memory usage, storage cost,s or matching specific downstream requirements. The default is model-specific (1536 for text-embedding-3-small). OpenAI Embedding Updates

Examples

if (FALSE) { # \dontrun{
  df <- data.frame(
    id = 1:3,
    text = c("First example", "Second example", "Third example")
  )

  # Generate embeddings with default dimensions
  embeddings_df <- oai_embed_df(
    df = df,
    text_var = text,
    id_var = id
  )

  # Generate embeddings with custom dimensions
  embeddings_df <- oai_embed_df(
    df = df,
    text_var = text,
    id_var = id,
    dimensions = 360,  # smaller embeddings
    batch_size = 5
  )

  # Use with concurrent requests for faster processing
  embeddings_df <- oai_embed_df(
    df = df,
    text_var = text,
    id_var = id,
    model = "text-embedding-3-large",
    concurrent_requests = 3
  )
} # }