Main - parispe

# **Online Retail Sales Analysis** **Project Overview:** This project involves analyzing transactional data from a UK-based online retailer specializing in unique occasion gifts, covering sales transactions from December 2010 to December 2011. The dataset includes essential columns such as InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. The analysis was conducted using Excel, PostgreSQL, and Tableau to uncover key performance indicators, trends, and insights. --- ### **Project Goals** 1. **Calculate Quarterly KPIs**: Track total sales, quantities sold, and average order value. 2. **Identify Products**: Analyze revenue and quantities for the top 10 products per quarter. 3. **Perform RFM Analysis**: Segment customers for targeted marketing strategies. 4. **Create Data Visualizations**: Create interactive dashboards using Tableau. --- ### **Skills Demonstrated** - **Database Management & SQL**: Structured queries, CTEs, aggregations, data cleaning, and filtering. - **Data Analysis**: RFM segmentations, KPIs, customer retention, and churn analysis. - **Visualization & Insights**: Dashboards in Tableau with actionable KPIs. --- ### **Data Preparation and Cleaning Process** 1. **Created** `Sales` column for analysis 2. **Converted** `InvoiceDate` column to datetime format SQL 3. **Data Import and Formatting**: Created a `sales_history` table, reformatted data types, and resolved import errors. 4. **NULL Value Handling**: - Used `StockCode` correlations to fill missing descriptions. - Filled `CustomerID` values based on stock-to-customer linkages. - **Result**: Reduced NULL values to less than 0.06%, allowing for minimal skew for analysis. Python 5. Identified canceled orders by cleaning `InvoiceNo`, rather than removing adjusted values to include canceled orders in sales value. --- ### **Data Management & SQL Highlights** - **SQL Proficiency**: Utilized SELECT, JOIN, GROUP BY, and ORDER BY for data analysis. - **CTEs for Intermediate Calculations**: Created views for customer sales, product sales, and quarterly AOV. - **Aggregations**: Employed SUM, COUNT, and AVG to analyze sales data by quarter, month, and product. --- ### **Data Visualizations & Key Metrics** Created comprehensive Tableau dashboards to visualize: - Quarterly KPI metrics (total sales, quantity, AOV) - Product sales and quantities across top-performing items - Daily, weekly, and monthly sales trends Custom Calculated Fields and Parameters - Date parameters for time-based filtering. - Fields for current/previous quarter comparisons and quarterly/yearly filters. - Visual indicators for trend deviations and average benchmarks. This project effectively leveraged SQL, PostgreSQL, and Tableau to produce valuable insights into the retailer’s sales performance, ensuring accurate, efficient, and visually compelling data presentation. # Data Cleaning/ Preparation ### SQL Identifying NULL values discription 1454/541909 0.29%, customerid 135080/541909 24.93% customerid had a significant number of NULL values ```SQL SELECT COUNT(CASE WHEN InvoiceNo IS NULL THEN 1 END) AS NullInvoiceNo, COUNT(CASE WHEN StockCode IS NULL THEN 1 END) AS NullStockCode, COUNT(CASE WHEN Description IS NULL THEN 1 END) AS NullDescription, COUNT(CASE WHEN Quantity IS NULL THEN 1 END) AS NullQuantity, COUNT(CASE WHEN InvoiceDate IS NULL THEN 1 END) AS NullInvoiceDate, COUNT(CASE WHEN UnitPrice IS NULL THEN 1 END) AS NullUnitPrice, COUNT(CASE WHEN CustomerID IS NULL THEN 1 END) AS NullCustomerID, COUNT(CASE WHEN Country IS NULL THEN 1 END) AS NullCountry FROM sales_history; ``` Output: ![[Screenshot 2024-10-28 at 2.09.31 PM.png]] Considering the number of NULL values affects around 25% of the data it would best to fill in the blank values. After understanding the correlations between columns, I determined that it would be best to match the stockcode to description and customerid as a way to fill in blank values. Stockcode or SKU is a unique code assigned to products in inventory which in terms are closely related to correlating product information and sales transactions. Filling in NULL values for description Out of 1454 NULL values only 122 remained NULL ```SQL UPDATE sales_history AS s1 SET Description = ( SELECT Description FROM sales_history AS s2 WHERE s1.StockCode = s2.StockCode AND s2.Description IS NOT NULL AND s2.Description <> '' LIMIT 1 ) WHERE Description IS NULL AND StockCode IS NOT NULL; ``` Filling in NULL values for customerid Out of 135080 NULL values only 3009 remained NULL ```SQL SELECT * FROM sales_history WHERE description is NULL; UPDATE sales_history AS s1 SET CustomerID = ( SELECT CustomerID FROM sales_history AS s2 WHERE s1.StockCode = s2.StockCode AND s2.CustomerID IS NOT NULL LIMIT 1 ) WHERE CustomerID IS NULL AND StockCode IS NOT NULL; ``` As a result the total number of remaining NULL values was reduced to less than .06%, which was a reasonable amount to remove and would not skew the data when conducting further analysis. ### Python Creating Sales Columns for Monetary segment of RFM analysis ```python new_df['Sales'] = new_df['Quantity'] * new_df['UnitPrice'] ``` Formatting date ```python new_df['InvoiceDate'] = pd.to_datetime(new_df['InvoiceDate']) ``` Creating `CleanedInvoiceNo` column to take canceled orders into considerations when calculating monetary value to give the most accurate results, while also creating a column indicating which columns are returns for potential analysis ```python new_df['IsCanceled'] = new_df['InvoiceNo'].str.startswith('C') new_df['CleanedInvoiceNo'] = new_df['InvoiceNo'].str.replace('C', '', regex=False) ``` ![[Screenshot 2024-12-13 at 3.55.10 PM.png]] # Data Analysis ### Market Analysis Customer First Transaction ```SQL SELECT customerid, MIN(invoicedate) AS acquisition_date FROM sales_analysis_view GROUP BY customerid Order BY acquisition_date; ``` ![[Screenshot 2024-11-08 at 7.45.59 PM.png]] Registration Rolling Total ```SQL WITH reg_dates AS ( SELECT customerid, -- Earliest registration date using MIN() function MIN(invoicedate) AS reg_date FROM sales_analysis_view GROUP BY customerid), regs AS ( SELECT -- Truncating by month DATE_TRUNC('month', reg_date) :: DATE AS reg_month, COUNT(DISTINCT customerid) AS regs FROM reg_dates GROUP BY reg_month) SELECT -- registrations running total by month TO_CHAR(reg_month, 'FMMonth') AS month_name, SUM(regs) OVER (ORDER BY reg_month ASC) AS regs_rt FROM regs -- Order by month in ascending order ORDER BY delivr_month ASC; ``` ![[Screenshot 2024-11-01 at 3.43.20 PM.png]] MoM Customer Acquisition ```SQL -- MoM Customer Acqusition SELECT -- TO_CHAR used to directly extract month name TO_CHAR(DATE_TRUNC('month', acquisition_date), 'Month') AS acquisition_month, COUNT(*) AS new_customers FROM ( SELECT customerid, MIN(invoicedate) AS acquisition_date FROM sales_analysis_view GROUP BY customerid ) AS customer_acquisitions GROUP BY DATE_TRUNC('month', acquisition_date) ORDER BY DATE_TRUNC('month', acquisition_date); ``` ![[Screenshot 2024-11-01 at 2.10.43 PM 2.png]] MoM Retention Rate ```SQL WITH user_monthly_activity AS ( SELECT DISTINCT DATE_TRUNC('month', invoicedate) AS purchase_month, customerid FROM sales_analysis_view ) SELECT TO_CHAR(previous.purchase_month, 'FMMonth') AS month, ROUND(COUNT(DISTINCT current.customerid) :: NUMERIC / GREATEST(COUNT(DISTINCT previous.customerid), 1), 2) AS retention_rate FROM user_monthly_activity AS previous LEFT JOIN user_monthly_activity AS current ON previous.customerid = current.customerid AND current.purchase_month = previous.purchase_month + INTERVAL '1 month' GROUP BY previous.purchase_month ORDER BY previous.purchase_month ASC; ``` ![[Screenshot 2024-11-01 at 2.58.47 PM.png]] Quarterly Sales ```SQL SELECT TO_CHAR(DATE_TRUNC('quarter', invoicedate), 'YYYY "Q"Q') AS quarter, SUM(sales) AS total_sales FROM sales_analysis_view GROUP BY quarter ``` ![[Screenshot 2024-11-08 at 7.53.40 PM.png]] Monthly Sales ```SQL SELECT TO_CHAR(DATE_TRUNC('month', acquisition_date), 'Month') AS month, SUM(sales) AS total_sales FROM sales_analysis_view GROUP BY month ``` ![[Screenshot 2024-11-08 at 7.51.34 PM.png]] Weekly Sales ```SQL SELECT DATE_TRUNC('week', invoicedate) :: DATE AS week, SUM(sales) AS total_sales FROM sales_analysis_view GROUP BY week ``` ![[Screenshot 2024-11-08 at 7.55.11 PM.png]] Market Share ```SQL SELECT Description, SUM(UnitPrice * Quantity) AS TotalSales, ROUND((SUM(UnitPrice * Quantity) / (SELECT SUM(UnitPrice * Quantity) FROM sales_history) * 100), 2) AS MarketShare FROM sales_history GROUP BY Description ORDER BY TotalSales DESC; ``` ![[Screenshot 2024-11-08 at 7.48.19 PM.png]] ## RFM Analysis Initially, I attempted to perform this analysis using SQL but was dissatisfied with the final output. SQL lacked the feature engineering capabilities provided by certain Python libraries, which prompted me to explore alternative approaches for transforming the data. Ultimately, I settled on this method. ```python import numpy as np import pandas as pd from sklearn.preprocessing import MinMaxScaler, RobustScaler # Today's date for recency calculation reference_date = pd.to_datetime('2012-01-01') # Aggregated RFM metrics rfm = new_df.groupby('CustomerID').agg( Recency=('InvoiceDate', lambda x: (reference_date - x.max()).days), Frequency=('CleanedInvoiceNo', 'count'), Monetary=('Sales', 'sum') ).reset_index() # Log transformations rfm['LogFrequency'] = np.log1p(rfm['Frequency']) # log1p handles zero values # RobustScaler for Monetary robust_scaler = RobustScaler() rfm['MonetaryScaled'] = robust_scaler.fit_transform(rfm[['Monetary']]) rfm['MonetaryScore'] = pd.qcut(rfm['MonetaryScaled'], q=5, labels=[1, 2, 3, 4, 5]) # Min-Max Scaling for Scores minmax_scaler = MinMaxScaler(feature_range=(1, 5)) rfm['RecencyScore'] = minmax_scaler.fit_transform(rfm[['Recency']]) rfm['RecencyScore'] = (6 - rfm['RecencyScore']).round() # Reverse and round rfm['FrequencyScore'] = minmax_scaler.fit_transform(rfm[['LogFrequency']]).round() # Combine Scores into RFM Segment rfm['RFM_Segment'] = ( rfm['RecencyScore'].astype(int).astype(str) + rfm['FrequencyScore'].astype(int).astype(str) + rfm['MonetaryScore'].astype(int).astype(str) ) conditions = [ rfm['RFM_Segment'].isin(['555', '554', '544', '545', '454', '455', '445']), rfm['RFM_Segment'].isin(['543', '444', '435', '355', '354', '345', '344', '335']), rfm['RFM_Segment'].isin([ '553', '551', '552', '541', '542', '533', '532', '531', '452', '451', '442', '441', '431', '453', '433', '432', '423', '353', '352', '351', '342', '341', '333', '323' ]), rfm['RFM_Segment'].isin(['512', '511', '422', '421', '412', '411', '311']), rfm['RFM_Segment'].isin([ '525', '524', '523', '522', '521', '515', '514', '513', '425', '424', '413', '414', '415', '315', '314', '313' ]), rfm['RFM_Segment'].isin(['535', '534', '443', '434', '343', '334', '325', '324']), rfm['RFM_Segment'].isin(['331', '321', '312', '221', '213', '231', '241', '251']), rfm['RFM_Segment'].isin([ '255', '254', '245', '244', '253', '252', '243', '242', '235', '234', '225', '224', '153', '152', '145', '143', '142', '135', '134', '133', '125', '124' ]), rfm['RFM_Segment'].isin(['155', '144', '214', '215', '115', '114', '113']), rfm['RFM_Segment'].isin(['332', '322', '231', '241', '251', '233', '232', '223', '222', '132', '123', '122', '212', '211']), rfm['RFM_Segment'].isin(['111', '112', '121', '131', '141', '151']) ] labels = [ 'Champions', 'Loyal', 'Potential Loyalist', 'New Customers', 'Promising', 'Need Attention', 'About To Sleep', 'At Risk', 'Cannot Lose Them', 'Hibernating Customers', 'Lost Customers' ] # Assign labels to RFM segments rfm['RFM_Label'] = np.select(conditions, labels, default='Unknown') rfm_sorted = rfm.sort_values(by='RFM_Label') rfm_sorted.to_csv('rfm_analysis.csv', index=False) ``` ```python segment_summary = rfm['RFM_Label'].value_counts(normalize=True).reset_index() segment_summary.columns = ['RFM_Label', 'Count'] segment_summary['Percentage'] = (segment_summary['Count'] * 100).round(2) print(segment_summary[['RFM_Label', 'Percentage']]) ``` ![[Screenshot 2024-12-20 at 12.41.35 AM.png]] Insights and Recommendations 1. Need Attention (27.77%) - This is the largest segment, suggesting that a significant portion of the customer base is not as engaged but hasn't moved to 'At Risk' or 'Lost Customer' segments. - Feedback - Develop ways to re-engage customers possibly through promotions or direct emails. - Analyze common characteristics (low-frequency, long recency) and address these issues. - Offer loyalty programs or incentives. 2. Potential Loyalist (17.25%) - This segment represents customers who are actively engaging but haven't yet reached 'Loyal' or 'Champions' segments. - Feedback - Similar to before personalized offers or benefits to encourage customers to buy more. - Analyze their purchasing patterns to create target marketing campaigns - Loyalty reward programs such as early access to products or discounts 3. Promising (16.95%) - These customers have shown some activity but aren't fully engaged. - Feedback - Focus on building relationships, having 1on1 conversations and diving deeper into their business needs. - Tracking purchase patterns and providing tailored promotions. # Data Validation Chart 1-3 (Total sales, Quantity, & Average order value) Quarterly KPI Bins Problem 1: Identified that Q4 sales/quantity KPI was significantly overvalued in Tableau due to data form Q4 2010 and Q4 2011 values merging Solution: Removal of 2010 data or add a filter by year Problem 2: Average order value didn't match in Tableau. Identified that invoiceno varied between PostgreSQL & Tableau Solution: When importing data from Tableau it set the invoice column as a numeric data type rather than a string ```SQL SELECT TO_CHAR(DATE_TRUNC('quarter', invoicedate), 'YYYY "Q"Q') AS quarter, SUM(sales) AS total_sales, SUM(quantity) AS total_quantity, ROUND(SUM(sales) / COUNT(DISTINCT invoiceno), 2) AS average_order_value, COUNT(DISTINCT(invoiceno)) AS unique_orders FROM sales_analysis_view GROUP BY quarter; ``` ![[Screenshot 2024-10-27 at 1.26.43 AM.png]] Chart 4 (Product Sales & Quantity) Bar Chart ```SQL SELECT EXTRACT(quarter FROM invoicedate) AS quarter, description, SUM(Sales) AS TotalSales, SUM(Quantity) AS TotalQuantity FROM sales_analysis_view -- Filtering by quarter WHERE EXTRACT(quarter FROM invoicedate) = 4 GROUP BY quarter, description ORDER BY quarter, totalsales DESC; ``` ![[Screenshot 2024-10-27 at 1.32.06 AM.png]] Chart 5 (Sales/Quantity Trends Over Time) Line Chart ```SQL SELECT DATE_TRUNC('day', invoicedate) :: DATE AS day, SUM(sales) AS totalsales, SUM(quantity) AS totalqty FROM sales_analysis_view GROUP BY day ``` ![[Screenshot 2024-11-08 at 7.57.07 PM.png]] --- # Data Visualization Creation Custom Parameters/ Calculated Fields - Created a date parameter that filtered visuals by a specific time-frame - Custom fields filtering current/previous quarter data - Custom fields filtering specific yearly and quarterly data - Custom field calculating % Difference of different quarters - Custom identifiers implemented through conditional formulas: - indicates the 'loss' of value compared to the current point in time '⬤' - categorizing a data point as 'above' & 'below' the average of a value within a given range ### Tableau DashBoard | [Link](https://public.tableau.com/views/AnalysisEndtoEnd/Dashboard?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link) <div class='tableauPlaceholder' id='viz1732939945002'></div> <div class='tableauPlaceholder' id='viz1732939945002'></div> <div class='tableauPlaceholder' id='viz1732939945002'></div> <div class='tableauPlaceholder' id='viz1732939945002'></div> <div class='tableauPlaceholder' id='viz1732939945002'></div> <div class='tableauPlaceholder' id='viz1732939945002'></div>