Extract Top N Frequent Words from Text using Eli5 and CountVectorizer

  • Share this:

Code introduction


The function accepts a text and an integer n, returning the n most frequently occurring words in the text. It uses the Eli5 library for feature extraction and visualization.


Technology Stack : Eli5, scikit-learn (CountVectorizer), NumPy

Code Type : The type of code

Code Difficulty : Intermediate


                
                    
def random_word_frequency(text, n=5):
    from eli5 import feature_extraction
    from eli5.formatters import table
    from sklearn.feature_extraction.text import CountVectorizer
    import numpy as np

    # Split the text into words
    words = text.split()

    # Create a list of tuples (word, count)
    word_counts = [(word, words.count(word)) for word in set(words)]

    # Sort the list by frequency
    sorted_word_counts = sorted(word_counts, key=lambda x: x[1], reverse=True)

    # Extract the top n words
    top_n_words = [word for word, count in sorted_word_counts[:n]]

    # Create a CountVectorizer instance
    vectorizer = CountVectorizer(vocabulary=top_n_words)

    # Fit and transform the vectorizer to the top n words
    X = vectorizer.fit_transform([' '.join(top_n_words)])

    # Display the top n words and their frequencies
    formatter = table.TableFormatter()
    formatter.format(X, vectorizer, top_n_words)

    return top_n_words