Text Tokenization, Stopword Removal, and Lemmatization in Python

  • Share this:

Code introduction


This function first tokenizes the input text into words using nltk's word_tokenize method, then removes stopwords, and finally lemmatizes the remaining words.


Technology Stack : nltk library's word_tokenize, stopwords, WordNetLemmatizer

Code Type : Text processing

Code Difficulty : Intermediate


                
                    
import random
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def remove_stopwords_and_lemmatize(text):
    # Tokenize the text into words
    words = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word.isalnum() and word.lower() not in stop_words]
    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
    return ' '.join(lemmatized_words)