Embed text chunks through Hugging Face Inference Embedding Endpoints
Source:R/hf_embed.R
hf_embed_chunks.RdThis function is capable of processing large volumes of text through Hugging Face's Inference Embedding Endpoints. Results are written in chunks to a file, to avoid out of memory issues.
Usage
hf_embed_chunks(
texts,
ids,
endpoint_url,
output_dir = "auto",
chunk_size = 5000L,
concurrent_requests = 5L,
max_retries = 5L,
timeout = 10L,
key_name = "HF_API_KEY",
id_col_name = "id"
)Arguments
- texts
Character vector of texts to process
- ids
Vector of unique identifiers corresponding to each text (same length as texts)
- endpoint_url
Hugging Face Embedding Endpoint
- output_dir
Path to directory for the .parquet chunks
- chunk_size
Number of texts to process in each chunk before writing to disk (default: 5000)
- concurrent_requests
Number of concurrent requests (default: 5)
- max_retries
Maximum retry attempts per failed request (default: 5)
- timeout
Request timeout in seconds (default: 10)
- key_name
Name of environment variable containing the API key (default: "HF_API_KEY")
- id_col_name
Name for the ID column in output (default: "id"). When called from hf_embed_df(), this preserves the original column name.