Skip to contents

Use Huggingface models to create topic representation

Usage

bt_representation_hf(
fitted_model,
documents,
task,
hf_model,
...,
default_prompt = "keywords",
nr_samples = 500,
nr_repr_docs = 20,
diversity = 10,
custom_prompt = NULL
)

Arguments

fitted_model

The fitted bertopic model

documents

the documents the topic model was fitted to

task

Task defining the pipeline that will be returned. See https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html for more information. Use "text-generation" for gpt-like models and "text2text-generation" for T5-like models

hf_model

The model that will be used by the pipeline to make predictions

...

arguments sent to the transformers.pipeline function

default_prompt

Whether to use the "keywords" or "documents" default prompt. Passing a custom_prompt will render this argument NULL. Default is "keywords" prompt.

nr_samples

Number of sample documents from which the representative docs are chosen

nr_repr_docs

Number of representative documents to be sent to the huggingface model

diversity

diversity of documents to be sent to the huggingface model. 0 = no diversity, 1 = max diversity.

custom_prompt

The custom prompt to be used in the pipeline. If not specified, the "keywords" or "documents" default_prompt will be used. Use "[KEYWORDS]" and "[DOCUMENTS]" in the prompt to decide where the keywords and documents are inserted.

Value

updated representation of each topic

Details

Representative documents are chosen from each topic by sampling (nr_samples) a number of documents from the topic and calculating which of those documents are most representative of the topic by c-tf-idf cosine similarity between the topic and the individual documents. From this the most representative documents (the number is defined by the nr_repr_docs parameter) is extracted and passed to the huggingface model and topic description predicted.