This creates topic representations based on the KeyBERT algorithm.
Usage
bt_representation_keybert(
fitted_model,
documents,
document_embeddings,
embedding_model,
top_n_words = 10,
nr_repr_docs = 50,
nr_samples = 500,
nr_candidate_words = 100)
Arguments
- fitted_model
Output of bt_fit_model() or another bertopic topic model. The model must have been fitted to data.
- documents
The documents the fitted_model has been fitted to
- document_embeddings
embeddings used to fit the model, these should have the same dimensions as that specified by the embedder you pass as the embedding model
- embedding_model
The model used to create the embeddings passed. This will be used to create word embeddings that will be compared to topic embeddings using cosine similarity.
- top_n_words
number of keywords/phrases to be extracted
- nr_repr_docs
number of documents used to create topic embeddings
- nr_samples
number of samples to select representative docs from for each topic
- nr_candidate_words
number of words to examine per topic
Details
KeyBERT is a python package that is used for extraction of key words from documents. It works by:
Selecting representative documents (nr_repr_docs) for each topic based on the c_tf_idf cosine similarity of documents and their topics. This is achieved by sampling nr_samples documents from each topic and calculating their c_tf_idf score and choosing the top nr_repr_docs from this.
Candidate words (nr_candidate_words) are selected for each topic based on their c_tf_idf scores for that topic
Topic embeddings are created by averaging the embeddings for the representative documents for each topic and compared, using cosine similarity, with candidate word embeddings to give a similarity score for each word and topic
the top_n_words with the highest cosine similarity to a topic are used to represent that topic