Using Hugging Face Inference Endpoints
Source:vignettes/hugging_face_inference.Rmd
hugging_face_inference.Rmd
This vignette shows how to embed and classify text with EndpointR using Hugging Face’s inference services.
Setup
library(EndpointR)
library(dplyr)
library(httr2)
library(tibble)
my_data <- tibble(
id = 1:3,
text = c(
"Machine learning is fascinating",
"I love working with embeddings",
"Natural language processing is powerful"
),
category = c("ML", "embeddings", "NLP")
)
Follow Hugging Face’s docs to generate a Hugging Face token, and then register it with EndpointR:
set_api_key("HF_TEST_API_KEY")
Choosing Your Service
Hugging Face offers two inference options:
- Inference API: Free, good for testing
- Dedicated Endpoints: Paid, reliable, fast
For this vignette, we’ll use the Inference API. To switch to dedicated endpoints, just change the URL.
Getting Started
Go to Hugging Face’s models hub and fetch the Inference API’s URL for the model you want to embed your data with. Not all models are available via the Hugging Face Inference API, if you need to use a model that is not available you may need to deploy a Dedicated Inference Endpoint.
Embeddings
Single Text
Embed one piece of text:
# inference api url for embeddings
embed_url <- "https://router.huggingface.co/hf-inference/models/sentence-transformers/all-mpnet-base-v2/pipeline/feature-extraction"
result <- hf_embed_text(
text = "This is a sample text to embed",
endpoint_url = embed_url,
key_name = "HF_API_KEY"
)
The result is a tibble with one row and 384 columns (V1 to V384). Each column is an embedding dimension.
Note: The number of columns depends on your model. Check the model’s Hugging Face page for its embedding size.
List of Texts
Embed multiple texts at once using batching:
texts <- c(
"First text to embed",
"Second text to embed",
"Third text to embed"
)
batch_result <- hf_embed_batch(
texts,
endpoint_url = embed_url,
key_name = "HF_API_KEY",
batch_size = 3 # process 3 texts per request
)
The result includes:
-
text
: your original text -
.error
: TRUE if something went wrong -
.error_message
: what went wrong (if anything) -
V1
toV384
: the embedding values
Data Frame
Most commonly, you’ll want to embed a column in a data frame:
embedding_result <- hf_embed_df(
df = my_data,
text_var = text, # column with your text
id_var = id, # column with unique ids
endpoint_url = embed_url,
key_name = "HF_API_KEY"
)
Check for errors:
embedding_result |> count(.error)
Extract just the embeddings:
embeddings_only <- embedding_result |> select(V1:V384)
Classification
Classification works the same way as embeddings, just with a different URL and output format. If neceessary, you can also provide a custom function for tidying the output.
Single Text
classify_url <- "https://router.huggingface.co/hf-inference/models/distilbert/distilbert-base-uncased-finetuned-sst-2-english"
sentiment <- hf_classify_text(
text = "I love this package!",
endpoint_url = classify_url,
key_name = "HF_API_KEY"
)
Data Frame
classification_result <- hf_classify_df(
df = my_data,
text_var = text,
id_var = id,
endpoint_url = classify_url,
key_name = "HF_API_KEY"
)
The result includes:
- Your original
id
column - Classification labels (e.g., POSITIVE, NEGATIVE)
- Confidence scores
- Error tracking columns.
NOTE: Classification labels are model and task specific.
Using Dedicated Endpoints
To use dedicated endpoints instead of the Inference API:
- Deploy your model to a dedicated endpoint (see Hugging Face docs)
- Get your endpoint URL
- Replace the URL in any function:
# just change this line
dedicated_url <- "https://your-endpoint-name.endpoints.huggingface.cloud"
# everything else stays the same
result <- hf_embed_text(
text = "Sample text",
endpoint_url = dedicated_url, # <- only change
key_name = "HF_API_KEY"
)
Note: Dedicated endpoints take 20-30 seconds to start if they’re idle. Set
max_retries = 5
to give them time to wake up.
Tips
- Start with small batch sizes (3-5) and increase gradually
- The Inference API has rate limits - dedicated endpoints have hardware constraints, increase hardware for higher limits
- For production use, choose dedicated endpoints
- Check the Improving Performance vignette for speed tips
Common Issues
Rate limits: Reduce batch size or add delays between requests
Model not available: Not all models work with the Inference API. Check the model page or use dedicated endpoints.
Timeouts: Increase max_retries
or
reduce batch size
Improving Performance
EndpointR’s functions come with knobs and dials and you can turn to improve throughput and performance. Visit the Improving Performance vignette for more information.
Appendix
Comparison of Inference API vs Dedicated Inference Endpoints
Feature | Inference API | Dedicated Inference Endpoints |
---|---|---|
Accessibility | Public, shared service | Private, dedicated hardware |
Cost | Free (with paid tiers) | Paid service - rent specific hardware |
Hardware | Shared computing resources | Dedicated hardware allocation |
Wait Times | Variable, unknowable in advance | Predictable, minimal queuing, ~30s for first request |
Production Ready | Not recommended for production | Recommended for production use |
Use Case | Casual usage, testing, prototyping | Production applications, consistent performance |
Scalability | Limited by shared resources | Scales with dedicated allocation |
Availability | Subject to shared infrastructure limits | Guaranteed availability during rental period |
Model Coverage | Commonly-used models, models selected by Hugging Face | Virtually all models on the Hub are available |