🛒 Shop Data Explorer – End-to-End Data Engineering & Analytics App¶

🎯 Project Goal¶

Shop Data Explorer is an interactive analytics application that simulates a retail sales environment.
It was built to demonstrate an end-to-end pipeline: from raw data with inconsistencies → through ETL in Python & SQL → to clean, structured datasets → to business insights presented in dashboards.

The goal was to practice Data Engineering concepts (ETL, SQL query building, Python pipelines, cleaning) while also adding Data Analytics & Visualization layers for business users.

🧰 Technologies and Tools¶

Category	Technology	Purpose
Database	`SQLite`	Simulated transactional database
Backend / ETL	`Python`, `pandas`	Data extraction, cleaning, transformation
Query Layer	`SQL`	Dynamic query building & aggregation
Visualization	`Streamlit`, `Seaborn`, `Matplotlib`, `HTML/CSS`, `Plotly`	Interactive dashboards & custom leaderboards
Version Control	`GitHub`	Code versioning and documentation

🗃️ Database Creation¶

The database was created from scratch to simulate real-world messy data:

Inconsistent category names (e.g. "electronics" vs "Electronics" vs "elektronika")
Country mismatches (e.g. "PL", "Polska", "Poland")
Missing values in customers and orders
Fake nulls ("null", "none", empty strings)

➡️ This design forced the ETL pipeline to handle cleaning, mapping, and transformation, just like in real business systems.

🗃️UML diagram of the database schema:

📊 Visualization Section¶

1. 🥇 Top Performance Mode¶

This mode shows leaderboards of best or worst performing entities, filtered by year:

Top Customers
Top Products
Top Categories
Top Months
Top Countries

Each leaderboard is generated dynamically with SQL CTE queries, cleaned in Python, and rendered in Streamlit with custom HTML badges (🥇 🥈 🥉 🚫 📉).

📸 Customers – Best in Sales Over the Years (Leaderboard)¶

Leaderboard ranking customers by total sales per year, calculated with SQL window functions and visualized in Streamlit with custom HTML badges 🥇🥈🥉.

📸 Screenshot: Products that have been selling the Best over the Years (Leaderboard)¶

This leaderboard dynamically ranks products based on total sales per year, using SQL CTEs and window functions.
It is then cleaned in Python (pandas) and displayed in Streamlit with custom HTML badges 🥇🥈🥉.

📸 Screenshot: Months that had worst selling scores in 2024 (Leaderboard)¶

This leaderboard dynamically ranks products based on slaes in 2024, using SQL CTEs and window functions.
It is then cleaned in Python (pandas) and displayed in Streamlit with custom HTML badges 🥇🥈🥉.

¶

2. 🔍 Deep Dive: Client (Work in progress)¶

This mode allows a detailed look into a single customer’s activity:

Monthly sales trends (total and average)
Categories purchased by the client
Product purchase breakdown

📸 Monthly Sales Overview – Adam Zielony¶

This chart shows monthly total and average sales for the customer Adam Zielony.
Data is extracted using a SQL CTE, aggregated by month, and then cleaned in Python:

WITH CustomerMonthlySales AS (
    SELECT 
        strftime('%Y', o.OrderDate) AS OrderYear,
        strftime('%m', o.OrderDate) AS Month,
        SUM(od.UnitPrice * od.Quantity) AS TotalSales,
        AVG(od.UnitPrice * od.Quantity) AS AvgSales
    FROM Customers c
    JOIN Orders o ON c.CustomerID = o.CustomerID
    JOIN OrderDetails od ON od.OrderID = o.OrderID
    WHERE CONCAT(c.FirstName, ' ', c.LastName) = 'Adam Zielony'
    GROUP BY OrderYear, Month
)
SELECT * FROM CustomerMonthlySales;

📸 Monthly Sales Overview – Adam Zielony (screanshot):

3. 📦 Product Insights (Work in Progress)¶

A new module for analyzing product-level performance in more detail:

Cross-year product comparisons
Price and sales evolution
Category-product breakdowns

🚀 Possible Updates & Improvements¶

🌐 Deployment of the app in the cloud (Streamlit Cloud or DigitalOcean VPS)
🗄️ Migrating from SQLite → PostgreSQL on AWS RDS / Azure SQL
☁️ Using cloud storage (AWS S3, Azure Blob) for input data instead of local DB
🔄 Automating ETL jobs with Apache Airflow / Prefect
🧩 Integrating external data sources (e.g. fetching datasets via Python API or SAP connectors)
📦 Expanding Product Insights with advanced visualizations (time-series, cohort analysis, cross-sell analysis)
🛠️ Adding unit tests & data validation with pytest or Great Expectations
🔐 Handling data security and access control (roles, masking sensitive fields)
⚡ Scaling ETL pipelines with PySpark for distributed data processing
🔄 Migrating transformations from pandas → PySpark DataFrames for larger datasets
🏗️ Testing Spark SQL alongside SQLAlchemy queries for performance comparison

🧠 What I Learned¶

🗄️ Designing a database schema with realistic dirty data
🐍 Building dynamic SQL query builders in Python (SELECT, WHERE, filters)
⚙️ Implementing ETL pipelines (cleaning, mapping, joining multiple tables)
📊 Ranking with SQL RANK() OVER and aggregations
🎨 Creating interactive dashboards in Streamlit with custom HTML & CSS
🚀 Applying a Data Engineering mindset: structured, reusable, scalable solutions

🎯 For Recruiter¶

This project proves I can deliver an end-to-end data solution:

🗃️ From raw, inconsistent data
⚙️ Through SQL + Python ETL pipelines
📊 To clean dashboards in Streamlit

➡️ Core skills: SQL | Python (pandas) | ETL | Streamlit | Data Visualization