Topic Modeling with BERT
Key steps in BERTopic modelling are as following.
- Use “Sentence Embedding” models to embed the sentences of the article
- Reduce the dimensionality of embedding using UMAP
- Cluster these documents (reduced dimensions) using HDBSAN
- Use c-TF-IDF extract keywords, their frequency and IDF for each cluster.
- MMR: Maximize Candidate Relevance. How many words in a topic can represent the topic?
- Intertopic Distance Map
- Use similarity matrix (heatmap), dandogram (hierarchical map), to visualize the topics and key_words.
- Traction of topic over time period. Some may be irrelevant and for other traction may be increasing or decreasing.
# Installation, with sentence-transformers, can be done using pypi: pip install bertopic # If you want to install BERTopic with other embedding models, you can choose one of the following: # Choose an embedding backend pip install bertopic[flair, gensim, spacy, use] # Topic modeling with images pip install bertopic[vision]
Supported Topic Modelling Techniques
BERTopic supports all kinds of topic modeling techniques as below.
- Multi-topic distributions
- Text Generation/LLM
- Merge Models
- Advanced Topic Modeling with BERTopic by PINECONE
- BERTopic by SpaCy
- BERTopic github
- BERTopic by Huggingface
Tools in BERTopic
Best Topic Modeling Tool in BERTopic
BERTopic Model Building
- arXiv Dataset (1.7m+ STEP papers)
- Historical Documents
- News articles