Skip to contents

This function is capable of processing large volumes of text through Hugging Face's Inference Embedding Endpoints. Results are written in chunks to a file, to avoid out of memory issues.

Usage

hf_embed_chunks(
  texts,
  ids,
  endpoint_url,
  output_dir = "auto",
  chunk_size = 5000L,
  concurrent_requests = 5L,
  max_retries = 5L,
  timeout = 10L,
  key_name = "HF_API_KEY",
  id_col_name = "id"
)

Arguments

texts

Character vector of texts to process

ids

Vector of unique identifiers corresponding to each text (same length as texts)

endpoint_url

Hugging Face Embedding Endpoint

output_dir

Path to directory for the .parquet chunks

chunk_size

Number of texts to process in each chunk before writing to disk (default: 5000)

concurrent_requests

Number of concurrent requests (default: 5)

max_retries

Maximum retry attempts per failed request (default: 5)

timeout

Request timeout in seconds (default: 10)

key_name

Name of environment variable containing the API key (default: "HF_API_KEY")

id_col_name

Name for the ID column in output (default: "id"). When called from hf_embed_df(), this preserves the original column name.

Value

A tibble with columns:

  • ID column (name specified by id_col_name): Original identifier from input

  • .error: Logical indicating if request failed

  • .error_msg: Error message if failed, NA otherwise

  • .chunk: Chunk number for tracking

  • Embedding columns (V1, V2, etc.)

Details

This function processes texts in chunks, creating individual requests for each text within a chunk. The chunk size determines how many texts are processed before writing results to disk. Within each chunk, requests are sent with the specified level of concurrency.