Topic Modeling with Gensim#s LDA on Text Data

2024-12-16 12:08:46 4 Views

Code introduction

This function uses gensim's LDA model to perform topic modeling on a list of texts, returning a list of the most common words for each topic.

Technology Stack : gensim, numpy

Code Type : Text analysis

Code Difficulty : Intermediate

                
                    
import gensim
import numpy as np

def topic_modeling(texts, num_topics=10, num_words=5):
    """
    This function performs topic modeling on a list of texts using gensim's LDA (Latent Dirichlet Allocation) algorithm.
    It returns the most common words for each topic.
    """
    # Create a dictionary representation of the documents.
    dictionary = gensim.corpora.Dictionary(texts)
    
    # Create a Bag-of-Words (BoW) representation of the documents.
    corpus = [dictionary.doc2bow(text) for text in texts]
    
    # Train the LDA model.
    lda_model = gensim.models.ldamodel.LdaModel(corpus,
                                                id2word=dictionary,
                                                num_topics=num_topics,
                                                random_state=100,
                                                update_every=1,
                                                passes=10,
                                                alpha='auto',
                                                per_word_topics=True)
    
    # Print the most common words for each topic.
    topic_words = []
    for idx, topic in lda_model.print_topics(-1):
        print('Topic: {} \nWords: {}'.format(idx, topic))
        topic_words.append([word[0] for word in topic[1][:num_words]])
    
    return topic_words

# Example usage:
# texts = ["This is the first document.", "This document is the second document.", "And this is the third one.",
#          "Is this the first document?", "This is the second document."]
# topic_modeling(texts)