# **Online Retail Sales Analysis**
**Project Overview:** This project involves analyzing transactional data from a UK-based online retailer specializing in unique occasion gifts, covering sales transactions from December 2010 to December 2011. The dataset includes essential columns such as InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. The analysis was conducted using Excel, PostgreSQL, and Tableau to uncover key performance indicators, trends, and insights.
---
### **Project Goals**
1. **Calculate Quarterly KPIs**: Track total sales, quantities sold, and average order value.
2. **Identify Products**: Analyze revenue and quantities for the top 10 products per quarter.
3. **Perform RFM Analysis**: Segment customers for targeted marketing strategies.
4. **Create Data Visualizations**: Create interactive dashboards using Tableau.
---
### **Skills Demonstrated**
- **Database Management & SQL**: Structured queries, CTEs, aggregations, data cleaning, and filtering.
- **Data Analysis**: RFM segmentations, KPIs, customer retention, and churn analysis.
- **Visualization & Insights**: Dashboards in Tableau with actionable KPIs.
---
### **Data Preparation and Cleaning Process**
1. **Created** `Sales` column for analysis
2. **Converted** `InvoiceDate` column to datetime format
SQL
3. **Data Import and Formatting**: Created a `sales_history` table, reformatted data types, and resolved import errors.
4. **NULL Value Handling**:
- Used `StockCode` correlations to fill missing descriptions.
- Filled `CustomerID` values based on stock-to-customer linkages.
- **Result**: Reduced NULL values to less than 0.06%, allowing for minimal skew for analysis.
Python
5. Identified canceled orders by cleaning `InvoiceNo`, rather than removing adjusted values to include canceled orders in sales value.
---
### **Data Management & SQL Highlights**
- **SQL Proficiency**: Utilized SELECT, JOIN, GROUP BY, and ORDER BY for data analysis.
- **CTEs for Intermediate Calculations**: Created views for customer sales, product sales, and quarterly AOV.
- **Aggregations**: Employed SUM, COUNT, and AVG to analyze sales data by quarter, month, and product.
---
### **Data Visualizations & Key Metrics**
Created comprehensive Tableau dashboards to visualize:
- Quarterly KPI metrics (total sales, quantity, AOV)
- Product sales and quantities across top-performing items
- Daily, weekly, and monthly sales trends
Custom Calculated Fields and Parameters
- Date parameters for time-based filtering.
- Fields for current/previous quarter comparisons and quarterly/yearly filters.
- Visual indicators for trend deviations and average benchmarks.
This project effectively leveraged SQL, PostgreSQL, and Tableau to produce valuable insights into the retailer’s sales performance, ensuring accurate, efficient, and visually compelling data presentation.
# Data Cleaning/ Preparation
### SQL
Identifying NULL values
discription 1454/541909 0.29%, customerid 135080/541909 24.93%
customerid had a significant number of NULL values
```SQL
SELECT
COUNT(CASE WHEN InvoiceNo IS NULL THEN 1 END) AS NullInvoiceNo,
COUNT(CASE WHEN StockCode IS NULL THEN 1 END) AS NullStockCode,
COUNT(CASE WHEN Description IS NULL THEN 1 END) AS NullDescription,
COUNT(CASE WHEN Quantity IS NULL THEN 1 END) AS NullQuantity,
COUNT(CASE WHEN InvoiceDate IS NULL THEN 1 END) AS NullInvoiceDate,
COUNT(CASE WHEN UnitPrice IS NULL THEN 1 END) AS NullUnitPrice,
COUNT(CASE WHEN CustomerID IS NULL THEN 1 END) AS NullCustomerID,
COUNT(CASE WHEN Country IS NULL THEN 1 END) AS NullCountry
FROM sales_history;
```
Output:
![[Screenshot 2024-10-28 at 2.09.31 PM.png]]
Considering the number of NULL values affects around 25% of the data it would best to fill in the blank values. After understanding the correlations between columns, I determined that it would be best to match the stockcode to description and customerid as a way to fill in blank values. Stockcode or SKU is a unique code assigned to products in inventory which in terms are closely related to correlating product information and sales transactions.
Filling in NULL values for description
Out of 1454 NULL values only 122 remained NULL
```SQL
UPDATE sales_history AS s1
SET Description = (
SELECT Description
FROM sales_history AS s2
WHERE s1.StockCode = s2.StockCode
AND s2.Description IS NOT NULL
AND s2.Description <> ''
LIMIT 1
)
WHERE Description IS NULL
AND StockCode IS NOT NULL;
```
Filling in NULL values for customerid
Out of 135080 NULL values only 3009 remained NULL
```SQL
SELECT *
FROM sales_history
WHERE description is NULL;
UPDATE sales_history AS s1
SET CustomerID = (
SELECT CustomerID
FROM sales_history AS s2
WHERE s1.StockCode = s2.StockCode
AND s2.CustomerID IS NOT NULL
LIMIT 1
)
WHERE CustomerID IS NULL
AND StockCode IS NOT NULL;
```
As a result the total number of remaining NULL values was reduced to less than .06%, which was a reasonable amount to remove and would not skew the data when conducting further analysis.
### Python
Creating Sales Columns for Monetary segment of RFM analysis
```python
new_df['Sales'] = new_df['Quantity'] * new_df['UnitPrice']
```
Formatting date
```python
new_df['InvoiceDate'] = pd.to_datetime(new_df['InvoiceDate'])
```
Creating `CleanedInvoiceNo` column to take canceled orders into considerations when calculating monetary value to give the most accurate results, while also creating a column indicating which columns are returns for potential analysis
```python
new_df['IsCanceled'] = new_df['InvoiceNo'].str.startswith('C')
new_df['CleanedInvoiceNo'] = new_df['InvoiceNo'].str.replace('C', '', regex=False)
```
![[Screenshot 2024-12-13 at 3.55.10 PM.png]]
# Data Analysis
### Market Analysis
Customer First Transaction
```SQL
SELECT customerid, MIN(invoicedate) AS acquisition_date
FROM sales_analysis_view
GROUP BY customerid
Order BY acquisition_date;
```
![[Screenshot 2024-11-08 at 7.45.59 PM.png]]
Registration Rolling Total
```SQL
WITH reg_dates AS (
SELECT
customerid,
-- Earliest registration date using MIN() function
MIN(invoicedate) AS reg_date
FROM sales_analysis_view
GROUP BY customerid),
regs AS (
SELECT
-- Truncating by month
DATE_TRUNC('month', reg_date) :: DATE AS reg_month,
COUNT(DISTINCT customerid) AS regs
FROM reg_dates
GROUP BY reg_month)
SELECT
-- registrations running total by month
TO_CHAR(reg_month, 'FMMonth') AS month_name,
SUM(regs) OVER (ORDER BY reg_month ASC) AS regs_rt
FROM regs
-- Order by month in ascending order
ORDER BY delivr_month ASC;
```
![[Screenshot 2024-11-01 at 3.43.20 PM.png]]
MoM Customer Acquisition
```SQL
-- MoM Customer Acqusition
SELECT
-- TO_CHAR used to directly extract month name
TO_CHAR(DATE_TRUNC('month', acquisition_date), 'Month') AS acquisition_month,
COUNT(*) AS new_customers
FROM (
SELECT customerid, MIN(invoicedate) AS acquisition_date
FROM sales_analysis_view
GROUP BY customerid
) AS customer_acquisitions
GROUP BY DATE_TRUNC('month', acquisition_date)
ORDER BY DATE_TRUNC('month', acquisition_date);
```
![[Screenshot 2024-11-01 at 2.10.43 PM 2.png]]
MoM Retention Rate
```SQL
WITH user_monthly_activity AS (
SELECT DISTINCT
DATE_TRUNC('month', invoicedate) AS purchase_month,
customerid
FROM sales_analysis_view
)
SELECT
TO_CHAR(previous.purchase_month, 'FMMonth') AS month,
ROUND(COUNT(DISTINCT current.customerid) :: NUMERIC /
GREATEST(COUNT(DISTINCT previous.customerid), 1), 2) AS retention_rate
FROM user_monthly_activity AS previous
LEFT JOIN user_monthly_activity AS current
ON previous.customerid = current.customerid
AND current.purchase_month = previous.purchase_month + INTERVAL '1 month'
GROUP BY previous.purchase_month
ORDER BY previous.purchase_month ASC;
```
![[Screenshot 2024-11-01 at 2.58.47 PM.png]]
Quarterly Sales
```SQL
SELECT
TO_CHAR(DATE_TRUNC('quarter', invoicedate), 'YYYY "Q"Q') AS quarter,
SUM(sales) AS total_sales
FROM sales_analysis_view
GROUP BY quarter
```
![[Screenshot 2024-11-08 at 7.53.40 PM.png]]
Monthly Sales
```SQL
SELECT
TO_CHAR(DATE_TRUNC('month', acquisition_date), 'Month') AS month,
SUM(sales) AS total_sales
FROM sales_analysis_view
GROUP BY month
```
![[Screenshot 2024-11-08 at 7.51.34 PM.png]]
Weekly Sales
```SQL
SELECT
DATE_TRUNC('week', invoicedate) :: DATE AS week,
SUM(sales) AS total_sales
FROM sales_analysis_view
GROUP BY week
```
![[Screenshot 2024-11-08 at 7.55.11 PM.png]]
Market Share
```SQL
SELECT Description,
SUM(UnitPrice * Quantity) AS TotalSales,
ROUND((SUM(UnitPrice * Quantity) / (SELECT SUM(UnitPrice * Quantity) FROM sales_history) * 100), 2) AS MarketShare
FROM sales_history
GROUP BY Description
ORDER BY TotalSales DESC;
```
![[Screenshot 2024-11-08 at 7.48.19 PM.png]]
## RFM Analysis
Initially, I attempted to perform this analysis using SQL but was dissatisfied with the final output. SQL lacked the feature engineering capabilities provided by certain Python libraries, which prompted me to explore alternative approaches for transforming the data. Ultimately, I settled on this method.
```python
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, RobustScaler
# Today's date for recency calculation
reference_date = pd.to_datetime('2012-01-01')
# Aggregated RFM metrics
rfm = new_df.groupby('CustomerID').agg(
Recency=('InvoiceDate', lambda x: (reference_date - x.max()).days),
Frequency=('CleanedInvoiceNo', 'count'),
Monetary=('Sales', 'sum')
).reset_index()
# Log transformations
rfm['LogFrequency'] = np.log1p(rfm['Frequency']) # log1p handles zero values
# RobustScaler for Monetary
robust_scaler = RobustScaler()
rfm['MonetaryScaled'] = robust_scaler.fit_transform(rfm[['Monetary']])
rfm['MonetaryScore'] = pd.qcut(rfm['MonetaryScaled'], q=5, labels=[1, 2, 3, 4, 5])
# Min-Max Scaling for Scores
minmax_scaler = MinMaxScaler(feature_range=(1, 5))
rfm['RecencyScore'] = minmax_scaler.fit_transform(rfm[['Recency']])
rfm['RecencyScore'] = (6 - rfm['RecencyScore']).round() # Reverse and round
rfm['FrequencyScore'] = minmax_scaler.fit_transform(rfm[['LogFrequency']]).round()
# Combine Scores into RFM Segment
rfm['RFM_Segment'] = (
rfm['RecencyScore'].astype(int).astype(str) +
rfm['FrequencyScore'].astype(int).astype(str) +
rfm['MonetaryScore'].astype(int).astype(str)
)
conditions = [
rfm['RFM_Segment'].isin(['555', '554', '544', '545', '454', '455', '445']),
rfm['RFM_Segment'].isin(['543', '444', '435', '355', '354', '345', '344', '335']),
rfm['RFM_Segment'].isin([
'553', '551', '552', '541', '542', '533', '532', '531', '452', '451', '442', '441', '431', '453', '433', '432',
'423', '353', '352', '351', '342', '341', '333', '323'
]),
rfm['RFM_Segment'].isin(['512', '511', '422', '421', '412', '411', '311']),
rfm['RFM_Segment'].isin([
'525', '524', '523', '522', '521', '515', '514', '513', '425', '424', '413', '414', '415', '315', '314', '313'
]),
rfm['RFM_Segment'].isin(['535', '534', '443', '434', '343', '334', '325', '324']),
rfm['RFM_Segment'].isin(['331', '321', '312', '221', '213', '231', '241', '251']),
rfm['RFM_Segment'].isin([
'255', '254', '245', '244', '253', '252', '243', '242', '235', '234', '225', '224', '153', '152', '145', '143',
'142', '135', '134', '133', '125', '124'
]),
rfm['RFM_Segment'].isin(['155', '144', '214', '215', '115', '114', '113']),
rfm['RFM_Segment'].isin(['332', '322', '231', '241', '251', '233', '232', '223', '222', '132', '123', '122', '212', '211']),
rfm['RFM_Segment'].isin(['111', '112', '121', '131', '141', '151'])
]
labels = [
'Champions', 'Loyal', 'Potential Loyalist', 'New Customers', 'Promising',
'Need Attention', 'About To Sleep', 'At Risk', 'Cannot Lose Them',
'Hibernating Customers', 'Lost Customers'
]
# Assign labels to RFM segments
rfm['RFM_Label'] = np.select(conditions, labels, default='Unknown')
rfm_sorted = rfm.sort_values(by='RFM_Label')
rfm_sorted.to_csv('rfm_analysis.csv', index=False)
```
```python
segment_summary = rfm['RFM_Label'].value_counts(normalize=True).reset_index()
segment_summary.columns = ['RFM_Label', 'Count']
segment_summary['Percentage'] = (segment_summary['Count'] * 100).round(2)
print(segment_summary[['RFM_Label', 'Percentage']])
```
![[Screenshot 2024-12-20 at 12.41.35 AM.png]]
Insights and Recommendations
1. Need Attention (27.77%)
- This is the largest segment, suggesting that a significant portion of the customer base is not as engaged but hasn't moved to 'At Risk' or 'Lost Customer' segments.
- Feedback
- Develop ways to re-engage customers possibly through promotions or direct emails.
- Analyze common characteristics (low-frequency, long recency) and address these issues.
- Offer loyalty programs or incentives.
2. Potential Loyalist (17.25%)
- This segment represents customers who are actively engaging but haven't yet reached 'Loyal' or 'Champions' segments.
- Feedback
- Similar to before personalized offers or benefits to encourage customers to buy more.
- Analyze their purchasing patterns to create target marketing campaigns
- Loyalty reward programs such as early access to products or discounts
3. Promising (16.95%)
- These customers have shown some activity but aren't fully engaged.
- Feedback
- Focus on building relationships, having 1on1 conversations and diving deeper into their business needs.
- Tracking purchase patterns and providing tailored promotions.
# Data Validation
Chart 1-3 (Total sales, Quantity, & Average order value) Quarterly KPI Bins
Problem 1: Identified that Q4 sales/quantity KPI was significantly overvalued in Tableau due to data form Q4 2010 and Q4 2011 values merging
Solution: Removal of 2010 data or add a filter by year
Problem 2: Average order value didn't match in Tableau. Identified that invoiceno varied between PostgreSQL & Tableau
Solution: When importing data from Tableau it set the invoice column as a numeric data type rather than a string
```SQL
SELECT
TO_CHAR(DATE_TRUNC('quarter', invoicedate), 'YYYY "Q"Q') AS quarter,
SUM(sales) AS total_sales,
SUM(quantity) AS total_quantity,
ROUND(SUM(sales) / COUNT(DISTINCT invoiceno), 2) AS average_order_value,
COUNT(DISTINCT(invoiceno)) AS unique_orders
FROM sales_analysis_view
GROUP BY quarter;
```
![[Screenshot 2024-10-27 at 1.26.43 AM.png]]
Chart 4 (Product Sales & Quantity) Bar Chart
```SQL
SELECT
EXTRACT(quarter FROM invoicedate) AS quarter,
description,
SUM(Sales) AS TotalSales,
SUM(Quantity) AS TotalQuantity
FROM sales_analysis_view
-- Filtering by quarter
WHERE EXTRACT(quarter FROM invoicedate) = 4
GROUP BY quarter, description
ORDER BY quarter, totalsales DESC;
```
![[Screenshot 2024-10-27 at 1.32.06 AM.png]]
Chart 5 (Sales/Quantity Trends Over Time) Line Chart
```SQL
SELECT
DATE_TRUNC('day', invoicedate) :: DATE AS day,
SUM(sales) AS totalsales,
SUM(quantity) AS totalqty
FROM sales_analysis_view
GROUP BY day
```
![[Screenshot 2024-11-08 at 7.57.07 PM.png]]
---
# Data Visualization Creation
Custom Parameters/ Calculated Fields
- Created a date parameter that filtered visuals by a specific time-frame
- Custom fields filtering current/previous quarter data
- Custom fields filtering specific yearly and quarterly data
- Custom field calculating % Difference of different quarters
- Custom identifiers implemented through conditional formulas:
- indicates the 'loss' of value compared to the current point in time '⬤'
- categorizing a data point as 'above' & 'below' the average of a value within a given range
### Tableau DashBoard | [Link](https://public.tableau.com/views/AnalysisEndtoEnd/Dashboard?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link)
<div class='tableauPlaceholder' id='viz1732939945002'></div>
<div class='tableauPlaceholder' id='viz1732939945002'></div>
<div class='tableauPlaceholder' id='viz1732939945002'></div>
<div class='tableauPlaceholder' id='viz1732939945002'></div>
<div class='tableauPlaceholder' id='viz1732939945002'></div>
<div class='tableauPlaceholder' id='viz1732939945002'></div>