![[StormlightArchive.jpg|banner]]
# **Book Review Sentiment Analysis**
**Project Overview**: The project focuses on the sentiment analysis of reviews from a popular fantasy novel, *The Way of Kings* by Brian Sanderson. As a fan of the series, I wanted to analyze what other readers felt about the book. Reviews were collected from Goodreads, focusing on review content, ratings, and dates, while excluding user-specific information.
---
### **Project Goals**
1. **Data Collection**: Extracted Goodreads reviews using Python.
2. **Demonstrate Preparation**: Cleaned and formatted data, optimized 'content' for feature engineering.
3. **Sentiment Analysis**: Identify sentiment trends, correlations between ratings and sentiment, key themes in reviews, temporal impacts, and actionable insights using `VADER` and `RoBERTa` pre-trained models.
4. **Feedback Extraction**: Derived actionable feedback from sentiment analysis and reviews content.
---
**Project Tasks**
- [x] Data Collection
- [x] Data Cleaning
- [x] EDA
- [x] Sentiment Analysis
- [x] Model Comparison
- [x] Identify key themes
- [x] Actionable feedback
---
### **Skills Demonstrated**
- **API and Automation**: Automated review extraction using web scraping techniques
- **Data Cleaning**: Processed raw data, removed inconsistencies, and prepared data for analysis
- **Forecasting & Business Intelligence**: Implemented pre-trained models to derive sentiment insights
- **Data Visualization**: Created visualizations to present finding clearly
---
### **Data Preparation and Cleaning Process**
Cleaned input DataFrame by:
1. Dropping rows with missing 'content'.
2. Cleaning HTML tags and converting "content" to lowercase strings
3. Converting "date" to datetime format.
4. Removing duplicate rows.
5. Saved cleaned DataFrame to a `cleaned_reviews.csv`.
---
### **Data Management & Python Highlights**
- **Python Proficiency**: Utilized Python libraries like `Pandas`, `Numpy`, `Matplotlib` for data manipulation, cleaning, and visualizations.
- **Sentiment Analysis**: Performed sentiment analysis using `VADER` (Valence Aware Dictionary and Sentiment Reasoner) for sentiment scoring and `RoBerta` for advanced sentiment classification.
- **Aggregations**: Aggregated sentiment data by rating and time to find trends and identify patterns in the reviews.
---
### **Key Analyses and Insights**
**1. Trends Over Time**
Analyzed review trends to understand how sentiment and ratings evolved over key events:
1. Initial Launch (2010).
2. Release of sequels (_Words of Radiance_, 2014; _Oathbringer_, 2017; _Rhythm of War_, 2020).
**2. Sentiment Analysis**
- Conducted sentiment analysis on review rating using:
- **VADER**: A lexicon-based sentiment analysis tool.
- **RoBerta**: A transformer-based pre-trained model.
- Analyzed the correlation between ratings and sentiment scores
Findings:
- Neutral sentiments were higher for lower ratings, indicating large volumes of non-negative reviews
**3. Visualization Highlights**
- Sentiment Distribution by Rating
- Visualized sentiment scores (positive, neutral, negative) across ratings.
- Trends over Time
- Explored the distribution of ratings and sentiment changes over months and years
---
### **Actionable Feedback**
Key Takeaways:
- There is a significant spike in reviews during the release of newer books in the series indicating a renewal of interest in older books.
- Readers struggled to finish the book or felt it dragged on.
- Some readers were overwhelmed by the plot and the number of characters leading to confusion.
**Recommendations**:
- Create marketing strategies that promote the sales of previous releases. (e.g., store placement, special editions, and discounts)
- Create unique merchandise based on interests expressed in reviews. (e.g., characters, phrases, etc.)
- Improve pacing and structure in future installments to reduce perceived drag.
- Provide supplementary material (e.g., character guides, summaries, and other online resources) to assist readers.
---
### Data Visualizations & Key Metrics:
Created various visualizations to better understand the data:
- Sentiment Distribution by Rating: Used VADER and RoBerta to calculate sentiment scores and visualize how they change with different ratings.
- Trend Over Time: Analyzed the distribution of rating overs time to spot major trends and events that impacted reviews.
---
# Data Cleaning & Exploration
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from bs4 import BeautifulSoup
plt.style.use('seaborn-v0_8-pastel')
import nltk
nltk.download('vader_lexicon')
df = pd.read_csv('reviews.csv')
df.head()
```
![[Screenshot 2024-11-23 at 1.28.18 AM.png]]
## Cleaning/Preparation Dataset
```python
from bs4 import BeautifulSoup
import pandas as pd
import re
def clean_data(df, output_path):
def clean_text(text):
text = BeautifulSoup(text, "html.parser").get_text()
text = re.sub(r"http\S+|www\S+", "", text)
text = re.sub(r"[^a-zA-Z\s]", "", text)
return text.lower().strip()
df_cleaned = df.dropna(subset=['content']).copy()
df_cleaned['content'] = df_cleaned['content'].astype(str).apply(clean_text)
if 'date' in df_cleaned.columns:
df_cleaned['date'] = pd.to_datetime(df_cleaned['date'], unit='ms', errors='coerce')
df_cleaned = df_cleaned.drop_duplicates()
df_cleaned.to_csv(output_path, index=False)
return df_cleaned
output_path = '/Users/alpha/Desktop/cleaned_reviews.csv'
df_cleaned = clean_data(df, output_path)
```
![[Screenshot 2024-12-03 at 12.17.27 AM.png]]
## EDA
```python
ax = df['rating'].value_counts().sort_index().plot(kind='bar',
title='Count of Reviews by Stars',
figsize=(10,5))
ax.set_ylables('Count')
ax.set_xlabel('Review Stars')
plt.show()
```
![[Pasted image 20241123013857.png]]
```python
rating_trends = df_cleaned.groupby(['year_month', 'rating']).size().unstack(fill_value=0)
rating_trends.plot(kind='line', figsize=(10, 6), title='Rating Distribution Over Time (1-5)')
plt.ylabel('Count')
plt.xlabel('Year-Month')
plt.tight_layout()
plt.show()
```
![[Pasted image 20241129151540.png]]
Chronological events that impacted spikes in reviews:
1. Initial launch of book (2010)
2. Release of 2nd book Word of Radiance (2014)
3. Release of 3rd book Oathbringer (2017)
4. Release of 4th book Rhythm of War/Covid 2019-20
# Sentiment Analysis
- Vader (Valance Aware Dictionary and sEntiment Reasoner)
- Roberta Pre-trained Model
## Vader Sentiment Scoring
```python
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm
from tqdm import tqdm
sia = SentimentIntensityAnalyzer()
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
text = row['content']
myid = i
res[myid] = sia.polarity_scores(text)
vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={'index': 'id'})
df = df.reset_index().rename(columns={'index': 'id'})
vaders = vaders.merge(df, how='left', on='id')
```
![[Screenshot 2024-11-23 at 1.36.52 AM.png]]
## Roberta Pre-trained Model Scoring
```python
import torch
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax
MODEL = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
# Fine-tune a pretrained model: https://huggingface.co/docs/transformers/training
def polarity_scores_roberta(example):
try:
encoded_text = tokenizer(example,
max_length=512,
padding='max_length',
truncation=True,
return_tensors='pt')
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
return {
"roberta_neg": scores[0],
"roberta_neu": scores[1],
"roberta_pos": scores[2],
}
# Identify errors when running model
except Exception as e:
print(f"Error processing text: {example}, Error: {e}")
return {"roberta_neg": None, "roberta_neu": None, "roberta_pos": None}
res = {}
for i, row in tqdm(df_cleaned.iterrows(), total=len(df_cleaned)):
text = row['content']
myid = row['id']
vader_result = {f"vader_{key}": value for key, value in sia.polarity_scores(text).items()}
roberta_result = polarity_scores_roberta(text)
res[myid] = {**vader_result, **roberta_result}
#Runtime 110m 37.1s
```
# Model Evaluation and Comparison
```python
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
sns.barplot(data=results_df, x='rating', y='vader_pos', ax=axs[0, 0])
sns.barplot(data=results_df, x='rating', y='vader_neu', ax=axs[0, 1])
sns.barplot(data=results_df, x='rating', y='vader_neg', ax=axs[1, 0])
sns.barplot(data=results_df, x='rating', y='vader_compound', ax=axs[1, 1])
axs[0, 0].set_title('Positive')
axs[0, 1].set_title('Neutral')
axs[1, 0].set_title('Negative')
axs[1, 1].set_title('Compound')
plt.tight_layout()
plt.show()
```
![[Pasted image 20241203000523.png]]
Roberta Sentiment by Rating
```python
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
sns.barplot(data=results_df, x='rating', y='roberta_pos', ax=axs[0])
sns.barplot(data=results_df, x='rating', y='roberta_neu', ax=axs[1])
sns.barplot(data=results_df, x='rating', y='roberta_neg', ax=axs[2])
axs[0].set_title('Postive')
axs[1].set_title('Neutral')
axs[2].set_title('Negative')
plt.show()
```
![[Pasted image 20241128133452.png]]
```python
sns.pairplot(data=results_df,
vars=['vader_neg', 'vader_neu', 'vader_pos', 'vader_compound', 'roberta_neg', 'roberta_neu', 'roberta_pos'],
hue='rating',
palette='tab10')
plt.show()
```
![[Pasted image 20241202235534.png]]
**Consistency of Sentiment and Rating**: Are sentiment scores strongly correlated with rating? High ratings should generally correlate with higher positive sentiment and lower negative sentiment.
**Model Agreement**: Are VADER and RoBerta in agreement, or do they differ significantly? Difference could provide insights into model strength and weaknesses.
**Sentiment Distribution by Rating**: Can you observe clusters of specific sentiment for specific rating categories?
## Average Sentiment Score Breakdown
- Created new df grouping data by either rating or average sentiment score
- Saved aggregated results to `aggregated_table.csv`
- Created an excel table:
- Formatted numbers
- Conditional formatting, color grading lowest to highest values
```python
aggregated_results = results_df.groupby('rating')[['vader_neg', 'vader_neu', 'vader_pos', 'vader_compound', 'roberta_neg', 'roberta_neu', 'roberta_pos']].mean()
print(aggregated_results)
```
![[Screenshot 2024-11-29 at 1.13.26 PM.png]]
```python
df = pd.DataFrame(aggregated_results)
df.to_csv('aggregated_table.csv', index=False)
```
![[Screenshot 2024-12-02 at 11.53.14 PM.png]]
**When comparing the average VADER & ROBERTA scores:**
- **VADER Sentiment Scores:**
- Negative Sentiment (`vader_neg`): Decrease in ratings equals an increase in negative sentiment
- Neutral Sentiment (`vader_neu`): Remains consistent only slightly declining at higher ratings
- Positive Sentiment (`vader_pos`): Increase in ratings equals an increase in positive sentiment
- **RoBERTa Sentiment Scores:**
- Negative Sentiment (`roberta_neg`): Reacts similar to VADER.
- Neutral Sentiment (`roberta_neu`): Peaks around ratings 2-3 and declines significantly at higher ratings.
- Positive Sentiment (`roberta_pos`): Peaks around rating 3, then slightly declines at rating 5.
# Negative Sentiment Analysis
Negative Reviews Collection
```python
import pandas as pd
file_path = '/Users/alpha/Desktop/sentiment_analysis.csv'
data = pd.read_csv(file_path)
negative_threshold = 0.80
negative_content = data[
(data['roberta_neg'] > negative_threshold)
]['content']
collected_reviews = negative_content.tolist()
print("Collected Negative Reviews:")
for review in collected_reviews:
print(review)
```
Negative Phrases Collection
```python
from collections import Counter
from itertools import islice
import re
import pandas as pd
def get_ngrams(texts, stopwords, n=2):
ngram_counts = Counter()
for text in texts:
words = [word for word in re.findall(r'\b\w+\b', text.lower()) if word not in stopwords]
ngrams = zip(*(islice(words, i, None) for i in range(n)))
ngram_counts.update([' '.join(ngram) for ngram in ngrams])
return ngram_counts
file_path = '/Users/alpha/Desktop/sentiment_analysis.csv'
data = pd.read_csv(file_path)
negative_threshold = 0.7
negative_content = data[
(data['roberta_neg'] > negative_threshold)
]['content']
custom_stopwords = set([
"the", "and", "to", "of", "a", "in", "that", "it", "is", "was",
"i", "for", "with", "as", "on", "this", "but", "are", "at", "by", "an", "be"
])
small_sample = negative_content.sample(n=min(10000, len(negative_content)), random_state=42)
negative_trigram_counts = get_ngrams(small_sample, custom_stopwords, n=4)
negative_common_trigrams = negative_trigram_counts.most_common(50)
for phrase, _ in negative_common_trigrams:
print(phrase)
```
## **Key criticisms from reviews**:
1. Length of the book:
- Readers struggled to finish the book or felt it dragged on.
2. Confusion:
- Some readers were overwhelmed by the plot and the number of characters.
**Recommendations**:
- Improve pacing and structure in future installments to reduce perceived drag.
- Provide supplementary material (e.g., character guides, summaries, and other online resources) to assist readers.