A Detailed Comparative Study of NMF, LDA, and BERT-Based Topic Modeling Using Coherence, Entropy, Jaccard Similarity, and Silhouette Scores
Topic modeling plays an essential role in extracting latent structures from large text corpora. The choice of model and the number of topics which can strongly influence the performance and interpretability of the outcomes. In this work, I compare three widely used models in topic modeling: Latent Dirichlet Allocation, Non-Negative Matrix Factorization, and Bidirectional Encoder Representations from Transformers. The outcomes of the models are studied using Entropy, Jaccard similarity, Coherence, and Silhouette over a wide number of topics. The results show that NMF consistently produces the most interpretable and distinct topics, achieving the highest coherence score, with optimal performance observed at k = 15. LDA yields broader and less coherent topics. In contrast, BERT-based clustering shows low Silhouette scores, indicating weak cluster separation.