ML Preojects: Sentiment Analysis with NLP

Sentiment Analysis with NLP: A Hands-On Deep Learning Project

Introduction

Sentiment analysis is a popular Natural Language Processing (NLP) application used to determine the sentiment behind a piece of text. In this project, we will preprocess text data, train deep learning models such as LSTMs and Transformers, and deploy our model as a REST API for real-time sentiment analysis.


Step 1: Dataset Overview

For this project, we will use the IMDB Movie Reviews dataset, available on Kaggle (kaggle.com).

Dataset Description

The dataset contains 50,000 movie reviews labeled as positive or negative. This binary classification problem makes it ideal for sentiment analysis using deep learning techniques.

Accessing the Dataset

  1. Download the Dataset:

  2. Load the Dataset:

import pandas as pd

# Load dataset
df = pd.read_csv('IMDB Dataset.csv')

# Display first five rows
print(df.head())

Step 2: Text Preprocessing

Text preprocessing is crucial for NLP tasks. We will clean the text data by:

  • Converting text to lowercase
  • Removing punctuation and stopwords
  • Tokenizing and stemming/lemmatizing words
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words('english')]
    return ' '.join(tokens)

df['cleaned_review'] = df['review'].apply(preprocess_text)

Step 3: Preparing Data for Model Training

We will split the dataset into training and testing sets and tokenize the text for deep learning models.

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define features and labels
X = df['cleaned_review']
y = df['sentiment'].map({'positive': 1, 'negative': 0})

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tokenize text
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=200)

Step 4: Training an LSTM Model

We will train an LSTM model for sentiment classification.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Build LSTM model
model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    LSTM(128, return_sequences=True),
    LSTM(64),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_test, y_test))

Step 5: Deploying the Model as a REST API

Once trained, we will deploy our sentiment analysis model using Flask.

from flask import Flask, request, jsonify
import pickle
import tensorflow as tf

# Save the trained model
model.save('sentiment_model.h5')

tokenizer_json = tokenizer.to_json()
with open('tokenizer.json', 'w') as f:
    f.write(tokenizer_json)

# Load the model
model = tf.keras.models.load_model('sentiment_model.h5')

def predict_sentiment(text):
    sequence = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(sequence, maxlen=200)
    prediction = model.predict(padded)
    return 'Positive' if prediction > 0.5 else 'Negative'

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    text = data['text']
    result = predict_sentiment(text)
    return jsonify({'Sentiment': result})

if __name__ == '__main__':
    app.run(debug=True)

Testing the API

Run app.py and test using Postman or cURL:

curl -X POST http://127.0.0.1:5000/predict -H "Content-Type: application/json" -d '{"text": "This movie was amazing!"}'

Conclusion

This project demonstrated how to preprocess text data, train LSTMs for sentiment classification, and deploy the model as a REST API. Future improvements include using transformer-based models like BERT for better accuracy.