Generate embeddings for texts in a data frame

High-level function to generate embeddings for texts in a data frame. This function handles the entire process from request creation to response processing, with options for batching & parallel execution. Setting the number of retries

Usage

hf_embed_df(
  df,
  text_var,
  id_var,
  endpoint_url,
  key_name,
  batch_size = 8,
  concurrent_requests = 1,
  max_retries = 5,
  timeout = 15,
  progress = TRUE
)

Arguments

df: A data frame containing texts to embed
text_var: Name of the column containing text to embed
id_var: Name of the column to use as ID
endpoint_url: The URL of the Hugging Face Inference API endpoint
key_name: Name of the environment variable containing the API key
batch_size: Number of texts to process in one batch (NULL for no batching)
concurrent_requests: Number of requests to send at once. Some APIs do not allow for multiple requests.
max_retries: Maximum number of retry attempts for failed requests.
timeout: Request timeout in seconds
progress: Whether to display a progress bar

Value

A data frame with the original data plus embedding columns

Examples

if (FALSE) { # \dontrun{
  # Generate embeddings for a data frame
  df <- data.frame(
    id = 1:3,
    text = c("First example", "Second example", "Third example")
  )

  # Use parallel processing without batching
  embeddings_df <- hf_embed_df(
    df = df,
    text_var = text,
    endpoint_url = "https://my-endpoint.huggingface.cloud",
    id_var = id,
    parallel = TRUE,
    batch_size = NULL
  )

  # Use batching without parallel processing
  embeddings_df <- hf_embed_df(
    df = df,
    text_var = text,
    endpoint_url = "https://my-endpoint.huggingface.cloud",
    id_var = id,
    parallel = FALSE,
    batch_size = 10
  )

  # Use both batching and parallel processing
  embeddings_df <- hf_embed_df(
    df = df,
    text_var = text,
    endpoint_url = "https://my-endpoint.huggingface.cloud",
    id_var = id,
    parallel = TRUE,
    batch_size = 10
  )
} # }