Main - parispe

![[StormlightArchive.jpg|banner]] # **Book Review Sentiment Analysis** **Project Overview**: The project focuses on the sentiment analysis of reviews from a popular fantasy novel, *The Way of Kings* by Brian Sanderson. As a fan of the series, I wanted to analyze what other readers felt about the book. Reviews were collected from Goodreads, focusing on review content, ratings, and dates, while excluding user-specific information. --- ### **Project Goals** 1. **Data Collection**: Extracted Goodreads reviews using Python. 2. **Demonstrate Preparation**: Cleaned and formatted data, optimized 'content' for feature engineering. 3. **Sentiment Analysis**: Identify sentiment trends, correlations between ratings and sentiment, key themes in reviews, temporal impacts, and actionable insights using `VADER` and `RoBERTa` pre-trained models. 4. **Feedback Extraction**: Derived actionable feedback from sentiment analysis and reviews content. --- **Project Tasks** - [x] Data Collection - [x] Data Cleaning - [x] EDA - [x] Sentiment Analysis - [x] Model Comparison - [x] Identify key themes - [x] Actionable feedback --- ### **Skills Demonstrated** - **API and Automation**: Automated review extraction using web scraping techniques - **Data Cleaning**: Processed raw data, removed inconsistencies, and prepared data for analysis - **Forecasting & Business Intelligence**: Implemented pre-trained models to derive sentiment insights - **Data Visualization**: Created visualizations to present finding clearly --- ### **Data Preparation and Cleaning Process** Cleaned input DataFrame by: 1. Dropping rows with missing 'content'. 2. Cleaning HTML tags and converting "content" to lowercase strings 3. Converting "date" to datetime format. 4. Removing duplicate rows. 5. Saved cleaned DataFrame to a `cleaned_reviews.csv`. --- ### **Data Management & Python Highlights** - **Python Proficiency**: Utilized Python libraries like `Pandas`, `Numpy`, `Matplotlib` for data manipulation, cleaning, and visualizations. - **Sentiment Analysis**: Performed sentiment analysis using `VADER` (Valence Aware Dictionary and Sentiment Reasoner) for sentiment scoring and `RoBerta` for advanced sentiment classification. - **Aggregations**: Aggregated sentiment data by rating and time to find trends and identify patterns in the reviews. --- ### **Key Analyses and Insights** **1. Trends Over Time** Analyzed review trends to understand how sentiment and ratings evolved over key events: 1. Initial Launch (2010). 2. Release of sequels (_Words of Radiance_, 2014; _Oathbringer_, 2017; _Rhythm of War_, 2020). **2. Sentiment Analysis** - Conducted sentiment analysis on review rating using: - **VADER**: A lexicon-based sentiment analysis tool. - **RoBerta**: A transformer-based pre-trained model. - Analyzed the correlation between ratings and sentiment scores Findings: - Neutral sentiments were higher for lower ratings, indicating large volumes of non-negative reviews **3. Visualization Highlights** - Sentiment Distribution by Rating - Visualized sentiment scores (positive, neutral, negative) across ratings. - Trends over Time - Explored the distribution of ratings and sentiment changes over months and years --- ### **Actionable Feedback** Key Takeaways: - There is a significant spike in reviews during the release of newer books in the series indicating a renewal of interest in older books. - Readers struggled to finish the book or felt it dragged on. - Some readers were overwhelmed by the plot and the number of characters leading to confusion. **Recommendations**: - Create marketing strategies that promote the sales of previous releases. (e.g., store placement, special editions, and discounts) - Create unique merchandise based on interests expressed in reviews. (e.g., characters, phrases, etc.) - Improve pacing and structure in future installments to reduce perceived drag. - Provide supplementary material (e.g., character guides, summaries, and other online resources) to assist readers. --- ### Data Visualizations & Key Metrics: Created various visualizations to better understand the data: - Sentiment Distribution by Rating: Used VADER and RoBerta to calculate sentiment scores and visualize how they change with different ratings. - Trend Over Time: Analyzed the distribution of rating overs time to spot major trends and events that impacted reviews. --- # Data Cleaning & Exploration ```python import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import re from bs4 import BeautifulSoup plt.style.use('seaborn-v0_8-pastel') import nltk nltk.download('vader_lexicon') df = pd.read_csv('reviews.csv') df.head() ``` ![[Screenshot 2024-11-23 at 1.28.18 AM.png]] ## Cleaning/Preparation Dataset ```python from bs4 import BeautifulSoup import pandas as pd import re def clean_data(df, output_path): def clean_text(text): text = BeautifulSoup(text, "html.parser").get_text() text = re.sub(r"http\S+|www\S+", "", text) text = re.sub(r"[^a-zA-Z\s]", "", text) return text.lower().strip() df_cleaned = df.dropna(subset=['content']).copy() df_cleaned['content'] = df_cleaned['content'].astype(str).apply(clean_text) if 'date' in df_cleaned.columns: df_cleaned['date'] = pd.to_datetime(df_cleaned['date'], unit='ms', errors='coerce') df_cleaned = df_cleaned.drop_duplicates() df_cleaned.to_csv(output_path, index=False) return df_cleaned output_path = '/Users/alpha/Desktop/cleaned_reviews.csv' df_cleaned = clean_data(df, output_path) ``` ![[Screenshot 2024-12-03 at 12.17.27 AM.png]] ## EDA ```python ax = df['rating'].value_counts().sort_index().plot(kind='bar', title='Count of Reviews by Stars', figsize=(10,5)) ax.set_ylables('Count') ax.set_xlabel('Review Stars') plt.show() ``` ![[Pasted image 20241123013857.png]] ```python rating_trends = df_cleaned.groupby(['year_month', 'rating']).size().unstack(fill_value=0) rating_trends.plot(kind='line', figsize=(10, 6), title='Rating Distribution Over Time (1-5)') plt.ylabel('Count') plt.xlabel('Year-Month') plt.tight_layout() plt.show() ``` ![[Pasted image 20241129151540.png]] Chronological events that impacted spikes in reviews: 1. Initial launch of book (2010) 2. Release of 2nd book Word of Radiance (2014) 3. Release of 3rd book Oathbringer (2017) 4. Release of 4th book Rhythm of War/Covid 2019-20 # Sentiment Analysis - Vader (Valance Aware Dictionary and sEntiment Reasoner) - Roberta Pre-trained Model ## Vader Sentiment Scoring ```python from nltk.sentiment import SentimentIntensityAnalyzer from tqdm.notebook import tqdm from tqdm import tqdm sia = SentimentIntensityAnalyzer() res = {} for i, row in tqdm(df.iterrows(), total=len(df)): text = row['content'] myid = i res[myid] = sia.polarity_scores(text) vaders = pd.DataFrame(res).T vaders = vaders.reset_index().rename(columns={'index': 'id'}) df = df.reset_index().rename(columns={'index': 'id'}) vaders = vaders.merge(df, how='left', on='id') ``` ![[Screenshot 2024-11-23 at 1.36.52 AM.png]] ## Roberta Pre-trained Model Scoring ```python import torch from transformers import AutoTokenizer from transformers import AutoModelForSequenceClassification from scipy.special import softmax MODEL = "nlptown/bert-base-multilingual-uncased-sentiment" tokenizer = AutoTokenizer.from_pretrained(MODEL) model = AutoModelForSequenceClassification.from_pretrained(MODEL) # Fine-tune a pretrained model: https://huggingface.co/docs/transformers/training def polarity_scores_roberta(example): try: encoded_text = tokenizer(example, max_length=512, padding='max_length', truncation=True, return_tensors='pt') output = model(**encoded_text) scores = output[0][0].detach().numpy() scores = softmax(scores) return { "roberta_neg": scores[0], "roberta_neu": scores[1], "roberta_pos": scores[2], } # Identify errors when running model except Exception as e: print(f"Error processing text: {example}, Error: {e}") return {"roberta_neg": None, "roberta_neu": None, "roberta_pos": None} res = {} for i, row in tqdm(df_cleaned.iterrows(), total=len(df_cleaned)): text = row['content'] myid = row['id'] vader_result = {f"vader_{key}": value for key, value in sia.polarity_scores(text).items()} roberta_result = polarity_scores_roberta(text) res[myid] = {**vader_result, **roberta_result} #Runtime 110m 37.1s ``` # Model Evaluation and Comparison ```python fig, axs = plt.subplots(2, 2, figsize=(15, 10)) sns.barplot(data=results_df, x='rating', y='vader_pos', ax=axs[0, 0]) sns.barplot(data=results_df, x='rating', y='vader_neu', ax=axs[0, 1]) sns.barplot(data=results_df, x='rating', y='vader_neg', ax=axs[1, 0]) sns.barplot(data=results_df, x='rating', y='vader_compound', ax=axs[1, 1]) axs[0, 0].set_title('Positive') axs[0, 1].set_title('Neutral') axs[1, 0].set_title('Negative') axs[1, 1].set_title('Compound') plt.tight_layout() plt.show() ``` ![[Pasted image 20241203000523.png]] Roberta Sentiment by Rating ```python fig, axs = plt.subplots(1, 3, figsize=(15, 5)) sns.barplot(data=results_df, x='rating', y='roberta_pos', ax=axs[0]) sns.barplot(data=results_df, x='rating', y='roberta_neu', ax=axs[1]) sns.barplot(data=results_df, x='rating', y='roberta_neg', ax=axs[2]) axs[0].set_title('Postive') axs[1].set_title('Neutral') axs[2].set_title('Negative') plt.show() ``` ![[Pasted image 20241128133452.png]] ```python sns.pairplot(data=results_df, vars=['vader_neg', 'vader_neu', 'vader_pos', 'vader_compound', 'roberta_neg', 'roberta_neu', 'roberta_pos'], hue='rating', palette='tab10') plt.show() ``` ![[Pasted image 20241202235534.png]] **Consistency of Sentiment and Rating**: Are sentiment scores strongly correlated with rating? High ratings should generally correlate with higher positive sentiment and lower negative sentiment. **Model Agreement**: Are VADER and RoBerta in agreement, or do they differ significantly? Difference could provide insights into model strength and weaknesses. **Sentiment Distribution by Rating**: Can you observe clusters of specific sentiment for specific rating categories? ## Average Sentiment Score Breakdown - Created new df grouping data by either rating or average sentiment score - Saved aggregated results to `aggregated_table.csv` - Created an excel table: - Formatted numbers - Conditional formatting, color grading lowest to highest values ```python aggregated_results = results_df.groupby('rating')[['vader_neg', 'vader_neu', 'vader_pos', 'vader_compound', 'roberta_neg', 'roberta_neu', 'roberta_pos']].mean() print(aggregated_results) ``` ![[Screenshot 2024-11-29 at 1.13.26 PM.png]] ```python df = pd.DataFrame(aggregated_results) df.to_csv('aggregated_table.csv', index=False) ``` ![[Screenshot 2024-12-02 at 11.53.14 PM.png]] **When comparing the average VADER & ROBERTA scores:** - **VADER Sentiment Scores:** - Negative Sentiment (`vader_neg`): Decrease in ratings equals an increase in negative sentiment - Neutral Sentiment (`vader_neu`): Remains consistent only slightly declining at higher ratings - Positive Sentiment (`vader_pos`): Increase in ratings equals an increase in positive sentiment - **RoBERTa Sentiment Scores:** - Negative Sentiment (`roberta_neg`): Reacts similar to VADER. - Neutral Sentiment (`roberta_neu`): Peaks around ratings 2-3 and declines significantly at higher ratings. - Positive Sentiment (`roberta_pos`): Peaks around rating 3, then slightly declines at rating 5. # Negative Sentiment Analysis Negative Reviews Collection ```python import pandas as pd file_path = '/Users/alpha/Desktop/sentiment_analysis.csv' data = pd.read_csv(file_path) negative_threshold = 0.80 negative_content = data[ (data['roberta_neg'] > negative_threshold) ]['content'] collected_reviews = negative_content.tolist() print("Collected Negative Reviews:") for review in collected_reviews: print(review) ``` Negative Phrases Collection ```python from collections import Counter from itertools import islice import re import pandas as pd def get_ngrams(texts, stopwords, n=2): ngram_counts = Counter() for text in texts: words = [word for word in re.findall(r'\b\w+\b', text.lower()) if word not in stopwords] ngrams = zip(*(islice(words, i, None) for i in range(n))) ngram_counts.update([' '.join(ngram) for ngram in ngrams]) return ngram_counts file_path = '/Users/alpha/Desktop/sentiment_analysis.csv' data = pd.read_csv(file_path) negative_threshold = 0.7 negative_content = data[ (data['roberta_neg'] > negative_threshold) ]['content'] custom_stopwords = set([ "the", "and", "to", "of", "a", "in", "that", "it", "is", "was", "i", "for", "with", "as", "on", "this", "but", "are", "at", "by", "an", "be" ]) small_sample = negative_content.sample(n=min(10000, len(negative_content)), random_state=42) negative_trigram_counts = get_ngrams(small_sample, custom_stopwords, n=4) negative_common_trigrams = negative_trigram_counts.most_common(50) for phrase, _ in negative_common_trigrams: print(phrase) ``` ## **Key criticisms from reviews**: 1. Length of the book: - Readers struggled to finish the book or felt it dragged on. 2. Confusion: - Some readers were overwhelmed by the plot and the number of characters. **Recommendations**: - Improve pacing and structure in future installments to reduce perceived drag. - Provide supplementary material (e.g., character guides, summaries, and other online resources) to assist readers.