Topic Modelling without Optimisation • BertopicR

This vignette will show you how to get a topic model up and running using BertopicR relatively quickly. It should be noted that this approach significantly simplifies and generalises certain steps, and will rarely produce an optimised topic model. To get the most out of your topic modelling, you should refer to the Interacting with Individual Modules vignette.

Preparing the Data

library(BertopicR)
library(dplyr)
library(stringr)
library(tidyr)

First we should load the data to which we would like to fit the model.

sentences <- stringr::sentences

Compiling the Model

If you have read the Modular Approach vignette, you will have seen that we specified each individual component of our topic model (embedding_model, ctfidf_model etc.) and fed those to bt_compile_model. If we wish, we can use entirely default parameters (or a combination of default parameters and specified components) with the same function.

model <- bt_compile_model()
#> 
#> No embedding model provided, defaulting to 'all-miniLM-L6-v2' model as embedder.
#> 
#> No reduction_model provided, using default 'bt_reducer_umap' parameters.
#> 
#> No clustering model provided, using hdbscan with default parameters.
#> 
#> No vectorising model provided, creating model with default parameters
#> 
#> No ctfidf model provided, creating model with default parameters
#> 
#> Model built

Fitting the Model

Now that we have created a model that uses all default parameters, we can simply use the bt_fit_model function to fit the model to our sentences data. It is important to note that as we have not created document embeddings or reduced those embeddings, this will be done internally which can be quite a time consuming process if you choose to run the topic modelling process multiple times.

NOTE: The bertopic model you are working with is a pointer to a python object at a point in memory. This means that the input and the output model cannot be differentiated between without explicitly saving the model before performing this operation. We do not need to specify an output to the bt_fit_model function as the function changes the input model in place. See the Note under the Fit the Model section in the Interacting with Individual Modules vignette for more detail.

bt_fit_model(model, sentences)
#> UMAP(low_memory=False, min_dist=0.0, n_components=5, random_state=42, verbose=True)
#> Mon Mar 31 10:24:08 2025 Construct fuzzy simplicial set
#> Mon Mar 31 10:24:08 2025 Finding Nearest Neighbors
#> Mon Mar 31 10:24:18 2025 Finished Nearest Neighbor Search
#> Mon Mar 31 10:24:20 2025 Construct embedding
#> Mon Mar 31 10:24:22 2025 Finished embedding
#> 
#> The input model has been fitted to the data and updated accordingly

model$get_topic_info() %>% select(-Representative_Docs, - Representation)
#>    Topic Count                                Name
#> 1     -1   290             -1_tall_wall_broke_door
#> 2      0    67               0_tea_corn_ripe_taste
#> 3      1    57                 1_ink_box_pack_cord
#> 4      2    54             2_store_pearl_ring_bank
#> 5      3    52               3_sun_wind_leaves_sky
#> 6      4    38               4_men_words_laugh_fun
#> 7      5    30              5_smell_soap_dust_wood
#> 8      6    29            6_book_wrote_seven_write
#> 9      7    18              7_fell_storm_river_log
#> 10     8    18                8_cat_mouse_dog_cats
#> 11     9    16              9_ship_loss_boat_shore
#> 12    10    15            10_strong_march_end_just
#> 13    11    14         11_score_play_player_struck
#> 14    12    11      12_school_child_children_young
#> 15    13    11 13_music_talked_straight home_swing

That’s it, you have a topic model up and running! If you decided that you wanted to adjust factors, like the minimum size of a topic, or the number of topics you want, you should refer to the Interacting with Individual Modules vignette. You can also refer to the Manipulating the Model vignette to see how you can interpret the topics and reduce the number of outliers identified (if using hdbscan (default) clustering).