ML: Natural Language Processing (NLP) and Transformers

Introduction to NLP

Natural Language Processing (NLP) is the art of making computers understand human language—because clearly, humans struggle with that too. From chatbots that pretend to care about your problems to spam filters that fail miserably, NLP is everywhere. Common applications include sentiment analysis (reading between the lines of your passive-aggressive texts), machine translation (so you can pretend you speak French), and text summarization (because reading is hard). Of course, NLP isn’t perfect—ambiguity, context understanding, and polysemy make sure of that. But hey, we try.

Tokenization, Stopword Removal, and Stemming/Lemmatization

Tokenization

Tokenization is like cutting a cake into pieces—except instead of cake, it’s text, and instead of joy, it’s preprocessing. Words or subwords are split so machines can digest them without choking.

Stopword Removal

Imagine your friend telling a story but skipping all the boring words. Stopword removal does that. Words like “the,” “is,” and “and” contribute nothing to the conversation (just like that one coworker in every meeting).

Stemming vs. Lemmatization

Stemming chops words down to their roots like a lazy butcher (e.g., “running” becomes “run”). Lemmatization, on the other hand, is the overachiever, ensuring words are reduced properly (e.g., “better” turns into “good”). Here’s a quick Python example:

from nltk.stem import PorterStemmer, WordNetLemmatizer
ps = PorterStemmer()
lemma = WordNetLemmatizer()
print(ps.stem("running"))  # Output: run
print(lemma.lemmatize("running", pos="v"))  # Output: run

Word Embeddings: Word2Vec, GloVe, FastText

Word2Vec

Word2Vec is like teaching a machine word relationships through gossip. Using Skip-gram and CBOW models, it learns that “king” is to “queen” as “man” is to “woman.”

GloVe

GloVe (Global Vectors) creates word embeddings based on how often words appear together. It turns words into numerical representations, proving that words can, in fact, be reduced to numbers.

FastText

FastText is Word2Vec but with superpowers. It looks at subwords, meaning it can handle misspellings. So next time someone types “thier” instead of “their,” FastText has their back.

Sentiment Analysis using LSTMs and GRUs

Why RNNs Suck (But We Use Them Anyway)

Recurrent Neural Networks (RNNs) try to handle sequential data but suffer from “short-term memory loss”—like a goldfish. Enter LSTMs and GRUs, which remember important information longer than five seconds.

Building a Sentiment Analysis Model

Let’s train a model to detect if movie reviews are positive or negative—because obviously, reading the review is too much work.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, SpatialDropout1D

model = Sequential([
    Embedding(5000, 128, input_length=100),
    SpatialDropout1D(0.2),
    LSTM(196, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])

Transformer-Based Models (BERT, GPT, T5)

Transformers: Because RNNs Were a Mistake

Unlike RNNs, Transformers don’t process text sequentially like a slow-moving queue at Starbucks. Instead, they use the self-attention mechanism to understand context efficiently.

BERT: The Context King

BERT reads text bidirectionally (like reading an entire sentence instead of just the first few words and guessing). This makes it fantastic for NLP tasks like text classification and question-answering.

GPT: The Chatbot Overlord

GPT is autoregressive, meaning it predicts the next word based on previous words—kind of like autocomplete on steroids. It’s the reason AI chatbots seem a little too real.

T5: The Overachiever

T5 reframes every NLP task as a text-to-text problem. Translation? Text-to-text. Summarization? Text-to-text. Making your boss believe you actually worked today? Probably text-to-text.

from transformers import pipeline
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
print(classifier("I absolutely love this movie!"))

Hands-On Exercises

Exercise 1: Text Preprocessing with NLTK and SpaCy

  • Tokenize text using NLTK and SpaCy.
  • Remove stopwords and compare stemming vs. lemmatization.

Exercise 2: Training Word Embeddings

  • Train Word2Vec embeddings on a text corpus.
  • Visualize word similarities using PCA or t-SNE.

Exercise 3: Sentiment Analysis with LSTMs

  • Train an LSTM-based sentiment classifier on the IMDb dataset.

Exercise 4: Fine-Tuning BERT for Text Classification

  • Fine-tune a BERT model on a classification dataset using Hugging Face.

Summary

  • We explored NLP techniques from tokenization to transformers.
  • Learned that RNNs have memory issues, but LSTMs and GRUs try their best.
  • Used transformer models like BERT and GPT to process text like a boss.

References


There you go—a complete NLP module, filled with enough sarcasm to make it enjoyable. Let me know if you want any tweaks! 🚀