Embed your documents — bt_do_embedding • BertopicR

Takes a document, or list of documents, and returns a numerical embedding which can be used as features for machine learning model or for semantic similarity search. If you have pre-computed your embeddings you can skip this step. the bt_embed function is designed to be used as one step in a topic modelling pipeline.

Usage

bt_do_embedding(
  embedder,
  documents,
  ...,
  accelerator = NULL,
  progress_bar = TRUE
)

Arguments

embedder: An embedding model (output of bt_make_embedder)
documents: A character vector of the documents to be embedded, e.g. your text variable
...: Optional or additional parameters passed to SentenceTransformer's encode function, e.g. batch_size
accelerator: A string containing the name of a hardware accelerator, e.g. "mps", "cuda". This is currently applied only if the embedder is a sentence transformer or from the flair library. If NULL no accelerator is used for sentence transformer or flair embeddings. GPU usage for spacy embeddings should be specified on embedder creation (bt_make_embedder_spacy)
progress_bar: A logical value indicating whether a progress bar is shown in the console. This is only used if using an embedder from the sentence-transformer package

Value

An array of floating point numbers

Details

Initially this function is built upon the sentence_transformers Python library, but it may be expanded to accept other frameworks. You should feed in your documents as a list. You can use hardware accelerators e.g. GPUs, to speed up computation.

The function currently returns an object with two additional attributes: embedding_model, n_documents, they have been appended to the embeddings for extraction at later steps in the pipeline, e.g. when merging data frames later on it's important to check how many documents we entered.

Examples

docs <- c("i am", "a list of", "documents", "to be embedded")

embedder <- bt_make_embedder_st("aLL-minilm-l6-v2")
#> Error in py_call_impl(callable, call_args$unnamed, call_args$named): ImportError: Can't connect to HTTPS URL because the SSL module is not available.
#> Run `reticulate::py_last_error()` for details.

embeddings <- bt_do_embedding(embedder, docs, accelerator = NULL)
#> Error: object 'embedder' not found