High-level function to generate embeddings for multiple text strings using OpenAI's embedding API. This function handles batching, concurrent requests, error handling, and provides progress reporting for large collections of texts.
Usage
oai_embed_batch(
texts,
model = "text-embedding-3-small",
dimensions = 1536,
batch_size = 10,
concurrent_requests = 1,
max_retries = 5,
timeout = 20,
endpoint_url = "https://api.openai.com/v1/embeddings",
key_name = "OPENAI_API_KEY",
include_texts = TRUE,
relocate_col = 2,
verbose = FALSE
)
Arguments
- texts
Vector or list of character strings to generate embeddings for
- model
OpenAI embedding model to use (default: "text-embedding-3-small")
- dimensions
Number of embedding dimensions (default: 1536 for text-embedding-3-small)
- batch_size
Number of texts to process in one API request (default: 10)
- concurrent_requests
Number of requests to send simultaneously (default: 1)
- max_retries
Maximum retry attempts for failed requests (default: 5)
- timeout
Request timeout in seconds (default: 20)
- endpoint_url
OpenAI API endpoint URL (default: OpenAI's embedding endpoint)
- key_name
Name of environment variable containing the API key (default: "OPENAI_API_KEY")
- include_texts
Whether to include original texts in the result (default: TRUE)
- relocate_col
Column position to place error columns (default: 2)
- verbose
Whether to enable verbose request logging (default: FALSE)
Value
A tibble containing:
Embedding vectors as columns (V1, V2, ..., Vn)
.error: Logical column indicating if embedding failed
.error_message: Character column with error details
text: Original texts (if include_texts = TRUE)
Details
This function efficiently processes multiple texts by:
Splitting texts into batches of the specified size
Creating concurrent requests (if configured) for faster processing
Handling individual batch failures gracefully
Pre-allocating memory for embeddings to improve performance
Providing detailed success/failure reporting
If a batch fails, only the documents in that specific batch will be marked as failed, not all documents across all batches. Failed embeddings will be filled with NA values and marked with error information.
The function returns a tibble with embedding columns (V1, V2, ..., Vn), error tracking columns (.error, .error_message), and optionally the original texts.
Examples
if (FALSE) { # \dontrun{
# Basic batch embedding
texts <- c("First text", "Second text", "Third text")
embeddings <- oai_embed_batch(texts)
# Large-scale processing with concurrent requests
large_texts <- rep("Sample text", 100)
embeddings <- oai_embed_batch(
texts = large_texts,
batch_size = 20,
concurrent_requests = 5,
dimensions = 512
)
# Custom model and settings
embeddings <- oai_embed_batch(
texts = texts,
model = "text-embedding-3-large",
dimensions = 1024,
include_texts = FALSE,
timeout = 30
)
} # }