Nltk-based Tokenization, Stopword Removal, and Lemmatization Function

  • Share this:

Code introduction


The function uses NLTK's word_tokenize for tokenization, stopwords to remove stop words, and WordNetLemmatizer for lemmatization, returning a list of lemmatized words.


Technology Stack : NLTK (Natural Language Toolkit), word_tokenize, stopwords, WordNetLemmatizer

Code Type : Python Function

Code Difficulty : Intermediate


                
                    
import random
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def random_nltk_function(sentence):
    # Tokenize the sentence into words
    words = word_tokenize(sentence)
    
    # Remove stopwords from the words
    stop_words = set(stopwords.words('english'))
    filtered_words = [w for w in words if not w.lower() in stop_words]
    
    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(w) for w in filtered_words]
    
    # Return the lemmatized words
    return lemmatized_words