Text Preprocessing: Stopword Removal and Lemmatization

  • Share this:

Code introduction


This function takes a text input, removes stopwords from it, and performs lemmatization on the remaining words.


Technology Stack : Nltk (Natural Language Toolkit), Word tokenization, Stopwords removal, Lemmatization

Code Type : The type of code

Code Difficulty : Intermediate


                
                    
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def remove_stopwords_and_lemmatize(text):
    # Tokenize the text into words
    words = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word.isalpha() and word not in stop_words]
    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word.lower()) for word in filtered_words]
    return lemmatized_words

# JSON Explanation