High-level function to generate embeddings for texts in a data frame. This function handles the entire process from request creation to response processing, with options for batching & parallel execution. Setting the number of retries
Usage
hf_embed_df(
df,
text_var,
id_var,
endpoint_url,
key_name,
batch_size = 8,
concurrent_requests = 1,
max_retries = 5,
timeout = 15,
progress = TRUE
)
Arguments
- df
A data frame containing texts to embed
- text_var
Name of the column containing text to embed
- id_var
Name of the column to use as ID
- endpoint_url
The URL of the Hugging Face Inference API endpoint
- key_name
Name of the environment variable containing the API key
- batch_size
Number of texts to process in one batch (NULL for no batching)
- concurrent_requests
Number of requests to send at once. Some APIs do not allow for multiple requests.
- max_retries
Maximum number of retry attempts for failed requests.
- timeout
Request timeout in seconds
- progress
Whether to display a progress bar
Examples
if (FALSE) { # \dontrun{
# Generate embeddings for a data frame
df <- data.frame(
id = 1:3,
text = c("First example", "Second example", "Third example")
)
# Use parallel processing without batching
embeddings_df <- hf_embed_df(
df = df,
text_var = text,
endpoint_url = "https://my-endpoint.huggingface.cloud",
id_var = id,
parallel = TRUE,
batch_size = NULL
)
# Use batching without parallel processing
embeddings_df <- hf_embed_df(
df = df,
text_var = text,
endpoint_url = "https://my-endpoint.huggingface.cloud",
id_var = id,
parallel = FALSE,
batch_size = 10
)
# Use both batching and parallel processing
embeddings_df <- hf_embed_df(
df = df,
text_var = text,
endpoint_url = "https://my-endpoint.huggingface.cloud",
id_var = id,
parallel = TRUE,
batch_size = 10
)
} # }