High-level function to generate embeddings for texts in a data frame using OpenAI's embedding API. This function handles the entire process from request creation to response processing, with options for batching & concurrent requests.
Usage
oai_embed_df(
df,
text_var,
id_var,
model = "text-embedding-3-small",
dimensions = 1536,
key_name = "OPENAI_API_KEY",
batch_size = 10,
concurrent_requests = 1,
max_retries = 5,
timeout = 20,
endpoint_url = "https://api.openai.com/v1/embeddings",
progress = TRUE
)
Arguments
- df
Data frame containing texts to embed
- text_var
Column name (unquoted) containing texts to embed
- id_var
Column name (unquoted) for unique row identifiers
- model
OpenAI embedding model to use (default: "text-embedding-3-small")
- dimensions
Number of embedding dimensions (NULL uses model default)
- key_name
Name of environment variable containing the API key
- batch_size
Number of texts to process in one batch (default: 10)
- concurrent_requests
Number of concurrent requests (default: 1)
- max_retries
Maximum retry attempts per request (default: 5)
- timeout
Request timeout in seconds (default: 20)
- endpoint_url
OpenAI API endpoint URL
- progress
Whether to display a progress bar (default: TRUE)
Value
Original data frame with additional columns for embeddings (V1, V2, etc.), plus .error and .error_message columns indicating any failures
Details
This function extracts texts from a specified column, generates embeddings using
oai_embed_batch()
, and joins the results back to the original data frame using
a specified ID column.
The function preserves the original data frame structure and adds new columns for embedding dimensions (V1, V2, ..., Vn). If the number of rows doesn't match after processing (due to errors), it returns the results with a warning.
OpenAI's embedding API allows you to specify the number of dimensions for the output embeddings, which can be useful for reducing memory usage, storage cost,s or matching specific downstream requirements. The default is model-specific (1536 for text-embedding-3-small). OpenAI Embedding Updates
Examples
if (FALSE) { # \dontrun{
df <- data.frame(
id = 1:3,
text = c("First example", "Second example", "Third example")
)
# Generate embeddings with default dimensions
embeddings_df <- oai_embed_df(
df = df,
text_var = text,
id_var = id
)
# Generate embeddings with custom dimensions
embeddings_df <- oai_embed_df(
df = df,
text_var = text,
id_var = id,
dimensions = 360, # smaller embeddings
batch_size = 5
)
# Use with concurrent requests for faster processing
embeddings_df <- oai_embed_df(
df = df,
text_var = text,
id_var = id,
model = "text-embedding-3-large",
concurrent_requests = 3
)
} # }