ML Preojects: Sentiment Analysis with NLP
Sentiment Analysis with NLP: A Hands-On Deep Learning Project
Introduction
Sentiment analysis is a popular Natural Language Processing (NLP) application used to determine the sentiment behind a piece of text. In this project, we will preprocess text data, train deep learning models such as LSTMs and Transformers, and deploy our model as a REST API for real-time sentiment analysis.
Step 1: Dataset Overview
For this project, we will use the IMDB Movie Reviews dataset, available on Kaggle (kaggle.com).
Dataset Description
The dataset contains 50,000 movie reviews labeled as positive or negative. This binary classification problem makes it ideal for sentiment analysis using deep learning techniques.
Accessing the Dataset
Download the Dataset:
- Visit the Kaggle page: IMDB Movie Reviews
- Download the
IMDB Dataset.csvfile.
Load the Dataset:
import pandas as pd
# Load dataset
df = pd.read_csv('IMDB Dataset.csv')
# Display first five rows
print(df.head())Step 2: Text Preprocessing
Text preprocessing is crucial for NLP tasks. We will clean the text data by:
- Converting text to lowercase
- Removing punctuation and stopwords
- Tokenizing and stemming/lemmatizing words
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove punctuation
tokens = word_tokenize(text)
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words('english')]
return ' '.join(tokens)
df['cleaned_review'] = df['review'].apply(preprocess_text)Step 3: Preparing Data for Model Training
We will split the dataset into training and testing sets and tokenize the text for deep learning models.
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Define features and labels
X = df['cleaned_review']
y = df['sentiment'].map({'positive': 1, 'negative': 0})
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Tokenize text
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=200)Step 4: Training an LSTM Model
We will train an LSTM model for sentiment classification.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
# Build LSTM model
model = Sequential([
Embedding(input_dim=5000, output_dim=128, input_length=200),
LSTM(128, return_sequences=True),
LSTM(64),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_test, y_test))Step 5: Deploying the Model as a REST API
Once trained, we will deploy our sentiment analysis model using Flask.
from flask import Flask, request, jsonify
import pickle
import tensorflow as tf
# Save the trained model
model.save('sentiment_model.h5')
tokenizer_json = tokenizer.to_json()
with open('tokenizer.json', 'w') as f:
f.write(tokenizer_json)
# Load the model
model = tf.keras.models.load_model('sentiment_model.h5')
def predict_sentiment(text):
sequence = tokenizer.texts_to_sequences([text])
padded = pad_sequences(sequence, maxlen=200)
prediction = model.predict(padded)
return 'Positive' if prediction > 0.5 else 'Negative'
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
text = data['text']
result = predict_sentiment(text)
return jsonify({'Sentiment': result})
if __name__ == '__main__':
app.run(debug=True)Testing the API
Run app.py and test using Postman or cURL:
curl -X POST http://127.0.0.1:5000/predict -H "Content-Type: application/json" -d '{"text": "This movie was amazing!"}'Conclusion
This project demonstrated how to preprocess text data, train LSTMs for sentiment classification, and deploy the model as a REST API. Future improvements include using transformer-based models like BERT for better accuracy.