This vignette shows how to generate text embeddings using EndpointR with both Hugging Face and OpenAI providers.
What are Text Embeddings?
Text embeddings are numerical representations of text that capture semantic meaning. Think of them as coordinates in a high-dimensional space where similar texts are closer together. They’re the foundation for:
- Semantic search
- Clustering similar documents
- Finding duplicates
- Building recommendation systems
- Powering RAG (Retrieval-Augmented Generation) applications
EndpointR makes it easy to generate embeddings from your text data using either Hugging Face or OpenAI APIs.
Setup
library(EndpointR)
library(dplyr)
library(tibble)
sample_texts <- tibble(
id = 1:3,
text = c(
"Machine learning is transforming how we process information",
"I love building applications with embeddings",
"Natural language processing enables computers to understand text"
),
category = c("ML", "embeddings", "NLP")
)
Provider Comparison
Before diving into code, let’s understand the key differences:
Feature | Hugging Face | OpenAI |
---|---|---|
Models | Many open-source models | text-embedding-3-small/large, ada-002 |
Dimensions | Model-dependent (often 384, 768) | Configurable (512-3072) |
Pricing | Free tier available, pay for dedicated | Pay per token |
Rate Limits | Varies by tier | Generous for most use cases |
Max Input | Model-dependent | 8,192 tokens per request |
Batching (multiple documents per request) | Supported | Supported |
Hugging Face Embeddings
Setting Up
First, get your API key from Hugging Face and set
it. Then set your endpoint’s URL. Here we’ve chosen an endpoint for
accessing embeddings from the all-mpnet-base-v2
model.
set_api_key("HF_TEST_API_KEY")
embed_url <- "https://router.huggingface.co/hf-inference/models/sentence-transformers/all-mpnet-base-v2/pipeline/feature-extraction"
Single Text
The simplest case - embed one piece of text:
embedding <- hf_embed_text(
text = "I want to understand the meaning of this sentence",
endpoint_url = embed_url,
key_name = "HF_TEST_API_KEY"
)
dim(embedding) # result: a tibble with 768 columns (V1 to V768)
embedding
Batch Processing
For multiple texts, use hf_embed_batch()
which handles
batching automatically. We feed in a vector of inputs and a
batch_size
, and the function takes care of batching our
vector into as many batches as necessary.
texts_to_embed <- c(
"First document about machine learning",
"Second document about deep learning",
"Third document about neural networks",
"Fourth document about data science"
)
batch_embeddings <- hf_embed_batch(
texts = texts_to_embed,
endpoint_url = embed_url,
key_name = "HF_TEST_API_KEY",
batch_size = 2, # process 2 texts per API call
concurrent_requests = 2 # run 2 requests in parallel
)
# Check results
glimpse(batch_embeddings[1,1:10 ]) # truncated for ease
The result includes: - text
: your original text -
.error
and .error_message
: error tracking -
V1
to V768
: the embedding dimensions
Data Frame Integration
Most commonly, you’ll want to embed a column from a data frame:
embedded_df <- hf_embed_df(
df = sample_texts,
text_var = text, # column containing text
id_var = id, # unique identifier column
endpoint_url = embed_url,
key_name = "HF_TEST_API_KEY",
batch_size = 3,
concurrent_requests = 1
)
# Original data + embeddings
names(embedded_df)[1:10] # shows: id, text, category, .error, .error_message, V1, V2...
embedded_df
OpenAI Embeddings
Single Text
OpenAI offers configurable embedding dimensions:
# Default dimensions (1536 for text-embedding-3-small)
embedding <- oai_embed_text(
text = "I want to understand the meaning of this sentence"
)
# Custom dimensions for smaller embeddings
small_embedding <- oai_embed_text(
text = "I want to understand the meaning of this sentence",
model = "text-embedding-3-small",
dimensions = 512 # reduce size by ~67%
)
dim(small_embedding) # 1 row, 512 embedding columns + index
Batch Processing
OpenAI allows multiple texts in a single API call, which
oai_embed_batch()
leverages.
texts_to_embed <- c(
"First document about machine learning",
"Second document about deep learning",
"Third document about neural networks",
"Fourth document about data science"
)
batch_embeddings <- oai_embed_batch(
texts = texts_to_embed,
model = "text-embedding-3-small",
dimensions = 1536, # default for this model
batch_size = 10, # texts per API request
concurrent_requests = 3 # parallel requests
)
batch_embeddings |>
reframe(
total = n(),
succeeded = sum(!.error),
failed = sum(.error)
)
Data Frame Integration
embedded_df <- oai_embed_df(
df = sample_texts,
text_var = text,
id_var = id,
model = "text-embedding-3-large", # higher quality embeddings
dimensions = 3072, # maximum dimensions for this model
batch_size = 20,
concurrent_requests = 5
)
# Extract just the embeddings for downstream use
embedded_df |>
select(starts_with("V"))
Handling Sequence Length
OpenAI Limits
OpenAI has a token limit of 8,192 per request. Since 1 token 4 characters:
long_texts <- tibble(
id = 1:3,
text = c(
paste(rep("word", 100), collapse = " "), # ~400 chars, safe
paste(rep("word", 8000), collapse = " "), # ~32k chars, near limit
paste(rep("word", 10000), collapse = " ") # ~40k chars, too long!
)
)
long_texts |>
mutate(
char_count = nchar(text),
approx_tokens = char_count / 4,
will_fail = approx_tokens > 8192
)
We can truncate our texts with the substr() function but in practice we would of course want to use a more intelligent splitting procedure.
Hugging Face Limits
When using the Inference API, limits vary by model. Check the model’s documentation. Most models handle ~512 tokens well. More modern models can handle more (check model card). Dedicated Inference Endpoints will receive as many requests as the assigned hardware is able to handle.
This code chunk shows you how to chunk up your texts if you’re finding errors due to payload size:
chunk_text <- function(text, max_chars = 2000) {
if (nchar(text) <= max_chars) return(list(text))
# v. simple chunking (consider sentence boundaries/ more intelligent chunking in production)
chunks <- substring(text,
seq(1, nchar(text), max_chars),
seq(max_chars, nchar(text) + max_chars - 1, max_chars))
as.list(chunks[nchar(chunks) > 0])
}
Best Practices
Error Handling
Always check for errors in your results. This chunk shows you how to send off your failures in another batch request, but beware you’ll need to handle the resulting data frames.
results <- oai_embed_batch(texts = texts_to_embed)
# Check overall success
if (any(results$.error)) {
failed <- results |>
filter(.error) |>
select(text, .error_message)
print(failed)
# Retry failed texts with adjusted parameters
retry_texts <- failed$text
retry_results <- oai_embed_batch(
texts = retry_texts,
batch_size = 1, # one at a time
timeout = 30 # longer timeout
)
}
Performance Tips
-
Start small: Begin with
batch_size = 5
andconcurrent_requests = 1
. - Scale gradually: Increase parameters whilst monitoring errors
-
Model selection:
- Hugging Face:
all-MiniLM-L6-v2
for speed (384 dims) - OpenAI:
text-embedding-3-small
with custom dimensions for flexibility
- Hugging Face:
- Consider dedicated endpoints for production Hugging Face deployments
TIP: Check your organisation’s tier on OpenAI, tier 5 organisatons can send many more requests than tier 1. OpenAI Rate Limits
Cost Optimisation
OpenAI: Reduce dimensions to save storage and computation
compact_embeddings <- oai_embed_batch(
texts = texts_to_embed,
model = "text-embedding-3-small",
dimensions = 360
)
WARNING: You’ll need to compare performance vs size trade-off for your particular use-case.
Common Use Cases
Semantic Search
A very basic implementation of semantic search (AKA dense embedding/vector search AKA neural search )
- Embed your document corpus
- Embed search query
- Find similar documents (cosine similarity)
- Extract top 5 most similar
TODO: add code or not?
NOTE: For most use-cases a hybrid approach comprising full-text search and semantic search will yield the best result
Clustering
- Generate embeddings
- Extract embeddings matrix for clustering algorithm
- Run clustering algorithm
- Add clusters to data frame
embeddings_for_clustering <- hf_embed_df(
df = sample_texts,
text_var = text,
id_var = id,
endpoint_url = embed_url,
key_name = "HF_TEST_API_KEY"
)
embedding_matrix <- embeddings_for_clustering |>
select(starts_with("V")) |>
as.matrix()
kmeans_result <- kmeans(embedding_matrix, centers = 2)
clustered_texts <- sample_texts |>
mutate(cluster = kmeans_result$cluster)
TIP: In practice you need to inspect the outputs of the clustering model and tune it. The code to run the model is only a small part of the modelling process
Troubleshooting
Rate Limits
Use fewer concurrent_requests if you’re running into rate limit issues.
results <- hf_embed_batch(
texts = large_text_collection,
batch_size = 3,
concurrent_requests = 1 # sequential processing
)
results <- oai_embed_batch(
texts = large_text_collection,
batch_size = 5, # fewer requests with larger batch size
concurrent_requests = 2
)
Some rate limits are by request, others are by tokens. If you’re running into token limits, the solution is to wait longer between requests. If you’re running into request limits, you could increase batch_size, to embed more data with fewer requests.
Timeouts
Increase value of timeout parameter if sending many requests, or responses begin timing out.
results <- oai_embed_batch(
texts = texts_to_embed,
timeout = 60,
max_retries = 5
)
Memory Issues
Process in chunks for very large datasets
library(purrr)
text_chunks <- split(large_text_vector,
ceiling(seq_along(large_text_vector) / 1000))
all_embeddings <- map(text_chunks, ~{
oai_embed_batch(.x, batch_size = 10)
}) |>
list_rbind()
TIP: You could also write your splits to individual files, and iterate through the files to avoid reading your data into memory all at once.
Next Steps
- See the Improving Performance vignette for optimisation tips
- Check out Hugging Face Inference for classification tasks
- Explore different embedding models for your specific use case
Remember: embeddings are the foundation for many NLP applications. Choose your provider based on your needs for quality, speed, cost, and flexibility.