Sentiment Analysis with NLP

Sentiment Analysis with NLP: A Hands-On Machine Learning Project

Introduction

Sentiment analysis is a powerful Natural Language Processing (NLP) technique used to determine the sentiment of a given text. In this project, we will preprocess text data, train deep learning models such as LSTMs and Transformer-based models, and deploy the NLP model as a REST API for real-time sentiment analysis.


Step 1: Dataset Overview

For this project, we will use the IMDB Movie Reviews Dataset, which is publicly available on Kaggle. (kaggle.com)

Dataset Description

The dataset contains 50,000 movie reviews labeled as positive or negative. This will be used as our training data for sentiment classification.

Accessing the Dataset

  1. Download the Dataset:

  2. Load the Dataset:

import pandas as pd

# Load dataset
df = pd.read_csv('IMDB Dataset.csv')

# Display first five rows
print(df.head())

Step 2: Data Preprocessing

Before training our sentiment analysis model, we need to clean and preprocess the text data.

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z]', ' ', text)  # Remove special characters
    words = word_tokenize(text)
    words = [word for word in words if word not in stopwords.words('english')]
    return ' '.join(words)

df['cleaned_review'] = df['review'].apply(preprocess_text)

Step 3: Splitting Data into Training and Testing Sets

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define features and target variable
X = df['cleaned_review']
y = df['sentiment'].map({'positive': 1, 'negative': 0})

# Tokenize text
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X)
X = tokenizer.texts_to_sequences(X)
X = pad_sequences(X, maxlen=100)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Training a Deep Learning Model (LSTM)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Define model
model = Sequential([
    Embedding(input_dim=10000, output_dim=128, input_length=100),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=32)

Step 5: Deploying the Model as an API using FastAPI

Once we have a trained model, we can deploy it as a REST API using FastAPI.

Create a FastAPI Service

from fastapi import FastAPI
import pickle
import numpy as np
from tensorflow.keras.models import load_model

# Save the trained model
model.save('sentiment_model.h5')

# Load the model
model = load_model('sentiment_model.h5')

tokenizer_path = 'tokenizer.pkl'
with open(tokenizer_path, 'wb') as f:
    pickle.dump(tokenizer, f)

with open(tokenizer_path, 'rb') as f:
    tokenizer = pickle.load(f)

app = FastAPI()

@app.post("/predict")
def predict_sentiment(review: str):
    sequence = tokenizer.texts_to_sequences([review])
    padded_sequence = pad_sequences(sequence, maxlen=100)
    prediction = model.predict(np.array(padded_sequence))
    sentiment = "Positive" if prediction[0][0] > 0.5 else "Negative"
    return {"Sentiment": sentiment}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Testing the API

Save the above script as app.py and run it. Then, test the API using Postman or cURL.

curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"review": "This movie was absolutely fantastic!"}'

Conclusion

In this project, we successfully built and deployed a sentiment analysis model using deep learning. We covered data preprocessing, model training, evaluation, and API deployment. This project can be extended by integrating more advanced transformer-based models like BERT for improved accuracy.