ML: Natural Language Processing (NLP) and Transformers

Introduction to NLP

Natural Language Processing (NLP) is the art of making computers understand human language—because clearly, humans struggle with that too. From chatbots that pretend to care about your problems to spam filters that fail miserably, NLP is everywhere. Common applications include sentiment analysis (reading between the lines of your passive-aggressive texts), machine translation (so you can pretend you speak French), and text summarization (because reading is hard). Of course, NLP isn’t perfect—ambiguity, context understanding, and polysemy make sure of that. But hey, we try.

Tokenization, Stopword Removal, and Stemming/Lemmatization

Tokenization

Tokenization is like cutting a cake into pieces—except instead of cake, it’s text, and instead of joy, it’s preprocessing. Words or subwords are split so machines can digest them without choking.

Stopword Removal

Imagine your friend telling a story but skipping all the boring words. Stopword removal does that. Words like “the,” “is,” and “and” contribute nothing to the conversation (just like that one coworker in every meeting).

Stemming vs. Lemmatization

Stemming chops words down to their roots like a lazy butcher (e.g., “running” becomes “run”). Lemmatization, on the other hand, is the overachiever, ensuring words are reduced properly (e.g., “better” turns into “good”). Here’s a quick Python example:

from nltk.stem import PorterStemmer, WordNetLemmatizer
ps = PorterStemmer()
lemma = WordNetLemmatizer()
print(ps.stem("running"))  # Output: run
print(lemma.lemmatize("running", pos="v"))  # Output: run

Word Embeddings: Word2Vec, GloVe, FastText

Word2Vec

Word2Vec is like teaching a machine word relationships through gossip. Using Skip-gram and CBOW models, it learns that “king” is to “queen” as “man” is to “woman.”

GloVe

GloVe (Global Vectors) creates word embeddings based on how often words appear together. It turns words into numerical representations, proving that words can, in fact, be reduced to numbers.

FastText

FastText is Word2Vec but with superpowers. It looks at subwords, meaning it can handle misspellings. So next time someone types “thier” instead of “their,” FastText has their back.

Sentiment Analysis using LSTMs and GRUs

Why RNNs Suck (But We Use Them Anyway)

Recurrent Neural Networks (RNNs) try to handle sequential data but suffer from “short-term memory loss”—like a goldfish. Enter LSTMs and GRUs, which remember important information longer than five seconds.

Building a Sentiment Analysis Model

Let’s train a model to detect if movie reviews are positive or negative—because obviously, reading the review is too much work.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, SpatialDropout1D

model = Sequential([
    Embedding(5000, 128, input_length=100),
    SpatialDropout1D(0.2),
    LSTM(196, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])

Transformer-Based Models (BERT, GPT, T5)

Transformers: Because RNNs Were a Mistake

Unlike RNNs, Transformers don’t process text sequentially like a slow-moving queue at Starbucks. Instead, they use the self-attention mechanism to understand context efficiently.

BERT: The Context King

BERT reads text bidirectionally (like reading an entire sentence instead of just the first few words and guessing). This makes it fantastic for NLP tasks like text classification and question-answering.

GPT: The Chatbot Overlord

GPT is autoregressive, meaning it predicts the next word based on previous words—kind of like autocomplete on steroids. It’s the reason AI chatbots seem a little too real.

T5: The Overachiever

T5 reframes every NLP task as a text-to-text problem. Translation? Text-to-text. Summarization? Text-to-text. Making your boss believe you actually worked today? Probably text-to-text.

from transformers import pipeline
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
print(classifier("I absolutely love this movie!"))

Hands-On Exercises

Exercise 1: Text Preprocessing with NLTK and SpaCy

Tokenize text using NLTK and SpaCy.
Remove stopwords and compare stemming vs. lemmatization.

Exercise 2: Training Word Embeddings

Train Word2Vec embeddings on a text corpus.
Visualize word similarities using PCA or t-SNE.

Exercise 3: Sentiment Analysis with LSTMs

Train an LSTM-based sentiment classifier on the IMDb dataset.

Exercise 4: Fine-Tuning BERT for Text Classification

Fine-tune a BERT model on a classification dataset using Hugging Face.

Summary

We explored NLP techniques from tokenization to transformers.
Learned that RNNs have memory issues, but LSTMs and GRUs try their best.
Used transformer models like BERT and GPT to process text like a boss.

References

Jurafsky, D., & Martin, J. H. (2021). Speech and Language Processing.
Vaswani, A., et al. (2017). Attention Is All You Need.
Hugging Face Transformers Documentation: https://huggingface.co/docs/transformers
NLTK Documentation: https://www.nltk.org/
SpaCy Documentation: https://spacy.io/

There you go—a complete NLP module, filled with enough sarcasm to make it enjoyable. Let me know if you want any tweaks! 🚀

K3S: High Availability & Scaling

Nginx: Hands-On Projects

Datascience

Rizki Sasri Dwitama

Title here

ML: Natural Language Processing (NLP) and Transformers

Introduction to NLP

Tokenization, Stopword Removal, and Stemming/Lemmatization

Tokenization

Stopword Removal

Stemming vs. Lemmatization

Word Embeddings: Word2Vec, GloVe, FastText

Word2Vec

GloVe

FastText

Sentiment Analysis using LSTMs and GRUs

Why RNNs Suck (But We Use Them Anyway)

Building a Sentiment Analysis Model

Transformer-Based Models (BERT, GPT, T5)

Transformers: Because RNNs Were a Mistake

BERT: The Context King

GPT: The Chatbot Overlord

T5: The Overachiever

Hands-On Exercises

Exercise 1: Text Preprocessing with NLTK and SpaCy

Exercise 2: Training Word Embeddings

Exercise 3: Sentiment Analysis with LSTMs

Exercise 4: Fine-Tuning BERT for Text Classification

Summary

References

ML: Natural Language Processing (NLP) and Transformers

Introduction to NLP#

Tokenization, Stopword Removal, and Stemming/Lemmatization#

Tokenization#

Stopword Removal#

Stemming vs. Lemmatization#

Word Embeddings: Word2Vec, GloVe, FastText#

Word2Vec#

GloVe#

FastText#

Sentiment Analysis using LSTMs and GRUs#

Why RNNs Suck (But We Use Them Anyway)#

Building a Sentiment Analysis Model#

Transformer-Based Models (BERT, GPT, T5)#

Transformers: Because RNNs Were a Mistake#

BERT: The Context King#

GPT: The Chatbot Overlord#

T5: The Overachiever#

Hands-On Exercises#

Exercise 1: Text Preprocessing with NLTK and SpaCy#

Exercise 2: Training Word Embeddings#

Exercise 3: Sentiment Analysis with LSTMs#

Exercise 4: Fine-Tuning BERT for Text Classification#

Summary#

References#

Introduction to NLP

Tokenization, Stopword Removal, and Stemming/Lemmatization

Tokenization

Stopword Removal

Stemming vs. Lemmatization

Word Embeddings: Word2Vec, GloVe, FastText

Word2Vec

GloVe

FastText

Sentiment Analysis using LSTMs and GRUs

Why RNNs Suck (But We Use Them Anyway)

Building a Sentiment Analysis Model

Transformer-Based Models (BERT, GPT, T5)

Transformers: Because RNNs Were a Mistake

BERT: The Context King

GPT: The Chatbot Overlord

T5: The Overachiever

Hands-On Exercises

Exercise 1: Text Preprocessing with NLTK and SpaCy

Exercise 2: Training Word Embeddings

Exercise 3: Sentiment Analysis with LSTMs

Exercise 4: Fine-Tuning BERT for Text Classification

Summary

References