Use Huggingface models to create topic representation

Usage

bt_representation_hf(
fitted_model,
documents,
task,
hf_model,
...,
default_prompt = "keywords",
nr_samples = 500,
nr_repr_docs = 20,
diversity = 10,
custom_prompt = NULL
)

Arguments

fitted_model: The fitted bertopic model
documents: the documents the topic model was fitted to
task: Task defining the pipeline that will be returned. See https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html for more information. Use "text-generation" for gpt-like models and "text2text-generation" for T5-like models
hf_model: The model that will be used by the pipeline to make predictions
...: arguments sent to the transformers.pipeline function
default_prompt: Whether to use the "keywords" or "documents" default prompt. Passing a custom_prompt will render this argument NULL. Default is "keywords" prompt.
nr_samples: Number of sample documents from which the representative docs are chosen
nr_repr_docs: Number of representative documents to be sent to the huggingface model
diversity: diversity of documents to be sent to the huggingface model. 0 = no diversity, 1 = max diversity.
custom_prompt: The custom prompt to be used in the pipeline. If not specified, the "keywords" or "documents" default_prompt will be used. Use "[KEYWORDS]" and "[DOCUMENTS]" in the prompt to decide where the keywords and documents are inserted.

Value

updated representation of each topic

Details

Representative documents are chosen from each topic by sampling (nr_samples) a number of documents from the topic and calculating which of those documents are most representative of the topic by c-tf-idf cosine similarity between the topic and the individual documents. From this the most representative documents (the number is defined by the nr_repr_docs parameter) is extracted and passed to the huggingface model and topic description predicted.