Sentiment Analysis with NLP
Sentiment Analysis with NLP: A Hands-On Machine Learning Project
Introduction
Sentiment analysis is a powerful Natural Language Processing (NLP) technique used to determine the sentiment of a given text. In this project, we will preprocess text data, train deep learning models such as LSTMs and Transformer-based models, and deploy the NLP model as a REST API for real-time sentiment analysis.
Step 1: Dataset Overview
For this project, we will use the IMDB Movie Reviews Dataset, which is publicly available on Kaggle. (kaggle.com)
Dataset Description
The dataset contains 50,000 movie reviews labeled as positive or negative. This will be used as our training data for sentiment classification.
Accessing the Dataset
Download the Dataset:
- Visit the Kaggle page: IMDB Movie Reviews Dataset
- Click on the “Download” button to obtain the
IMDB Dataset.csvfile.
Load the Dataset:
import pandas as pd
# Load dataset
df = pd.read_csv('IMDB Dataset.csv')
# Display first five rows
print(df.head())Step 2: Data Preprocessing
Before training our sentiment analysis model, we need to clean and preprocess the text data.
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
def preprocess_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'[^a-zA-Z]', ' ', text) # Remove special characters
words = word_tokenize(text)
words = [word for word in words if word not in stopwords.words('english')]
return ' '.join(words)
df['cleaned_review'] = df['review'].apply(preprocess_text)Step 3: Splitting Data into Training and Testing Sets
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Define features and target variable
X = df['cleaned_review']
y = df['sentiment'].map({'positive': 1, 'negative': 0})
# Tokenize text
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X)
X = tokenizer.texts_to_sequences(X)
X = pad_sequences(X, maxlen=100)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)Step 4: Training a Deep Learning Model (LSTM)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
# Define model
model = Sequential([
Embedding(input_dim=10000, output_dim=128, input_length=100),
LSTM(64, return_sequences=True),
LSTM(32),
Dense(1, activation='sigmoid')
])
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=32)Step 5: Deploying the Model as an API using FastAPI
Once we have a trained model, we can deploy it as a REST API using FastAPI.
Create a FastAPI Service
from fastapi import FastAPI
import pickle
import numpy as np
from tensorflow.keras.models import load_model
# Save the trained model
model.save('sentiment_model.h5')
# Load the model
model = load_model('sentiment_model.h5')
tokenizer_path = 'tokenizer.pkl'
with open(tokenizer_path, 'wb') as f:
pickle.dump(tokenizer, f)
with open(tokenizer_path, 'rb') as f:
tokenizer = pickle.load(f)
app = FastAPI()
@app.post("/predict")
def predict_sentiment(review: str):
sequence = tokenizer.texts_to_sequences([review])
padded_sequence = pad_sequences(sequence, maxlen=100)
prediction = model.predict(np.array(padded_sequence))
sentiment = "Positive" if prediction[0][0] > 0.5 else "Negative"
return {"Sentiment": sentiment}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)Testing the API
Save the above script as app.py and run it. Then, test the API using Postman or cURL.
curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"review": "This movie was absolutely fantastic!"}'Conclusion
In this project, we successfully built and deployed a sentiment analysis model using deep learning. We covered data preprocessing, model training, evaluation, and API deployment. This project can be extended by integrating more advanced transformer-based models like BERT for improved accuracy.