Skip to contents

High-level function to generate embeddings for texts in a data frame. This function handles the entire process from request creation to response processing, with options for batching & parallel execution. Setting the number of retries

Usage

hf_embed_df(
  df,
  text_var,
  id_var,
  endpoint_url,
  key_name,
  batch_size = 8,
  concurrent_requests = 1,
  max_retries = 5,
  timeout = 15,
  progress = TRUE
)

Arguments

df

A data frame containing texts to embed

text_var

Name of the column containing text to embed

id_var

Name of the column to use as ID

endpoint_url

The URL of the Hugging Face Inference API endpoint

key_name

Name of the environment variable containing the API key

batch_size

Number of texts to process in one batch (NULL for no batching)

concurrent_requests

Number of requests to send at once. Some APIs do not allow for multiple requests.

max_retries

Maximum number of retry attempts for failed requests.

timeout

Request timeout in seconds

progress

Whether to display a progress bar

Value

A data frame with the original data plus embedding columns

Examples

if (FALSE) { # \dontrun{
  # Generate embeddings for a data frame
  df <- data.frame(
    id = 1:3,
    text = c("First example", "Second example", "Third example")
  )

  # Use parallel processing without batching
  embeddings_df <- hf_embed_df(
    df = df,
    text_var = text,
    endpoint_url = "https://my-endpoint.huggingface.cloud",
    id_var = id,
    parallel = TRUE,
    batch_size = NULL
  )

  # Use batching without parallel processing
  embeddings_df <- hf_embed_df(
    df = df,
    text_var = text,
    endpoint_url = "https://my-endpoint.huggingface.cloud",
    id_var = id,
    parallel = FALSE,
    batch_size = 10
  )

  # Use both batching and parallel processing
  embeddings_df <- hf_embed_df(
    df = df,
    text_var = text,
    endpoint_url = "https://my-endpoint.huggingface.cloud",
    id_var = id,
    parallel = TRUE,
    batch_size = 10
  )
} # }