πŸ›’ Shop Data Explorer – End-to-End Data Engineering & Analytics AppΒΆ

🎯 Project Goal¢

Shop Data Explorer is an interactive analytics application that simulates a retail sales environment.
It was built to demonstrate an end-to-end pipeline: from raw data with inconsistencies β†’ through ETL in Python & SQL β†’ to clean, structured datasets β†’ to business insights presented in dashboards.

The goal was to practice Data Engineering concepts (ETL, SQL query building, Python pipelines, cleaning) while also adding Data Analytics & Visualization layers for business users.


🧰 Technologies and Tools¢

Category Technology Purpose
Database SQLite Simulated transactional database
Backend / ETL Python, pandas Data extraction, cleaning, transformation
Query Layer SQL Dynamic query building & aggregation
Visualization Streamlit, Seaborn, Matplotlib, HTML/CSS, Plotly Interactive dashboards & custom leaderboards
Version Control GitHub Code versioning and documentation

πŸ—ƒοΈ Database CreationΒΆ

The database was created from scratch to simulate real-world messy data:

  • Inconsistent category names (e.g. "electronics" vs "Electronics" vs "elektronika")
  • Country mismatches (e.g. "PL", "Polska", "Poland")
  • Missing values in customers and orders
  • Fake nulls ("null", "none", empty strings)

➑️ This design forced the ETL pipeline to handle cleaning, mapping, and transformation, just like in real business systems.

πŸ—ƒοΈUML diagram of the database schema:

image.png


πŸ“Š Visualization SectionΒΆ

1. πŸ₯‡ Top Performance ModeΒΆ

This mode shows leaderboards of best or worst performing entities, filtered by year:

  • Top Customers
  • Top Products
  • Top Categories
  • Top Months
  • Top Countries

Each leaderboard is generated dynamically with SQL CTE queries, cleaned in Python, and rendered in Streamlit with custom HTML badges (πŸ₯‡ πŸ₯ˆ πŸ₯‰ 🚫 πŸ“‰).

πŸ“Έ Customers – Best in Sales Over the Years (Leaderboard)ΒΆ

Leaderboard ranking customers by total sales per year, calculated with SQL window functions and visualized in Streamlit with custom HTML badges πŸ₯‡πŸ₯ˆπŸ₯‰.

image-2.png

πŸ“Έ Screenshot: Products that have been selling the Best over the Years (Leaderboard)ΒΆ

This leaderboard dynamically ranks products based on total sales per year, using SQL CTEs and window functions.
It is then cleaned in Python (pandas) and displayed in Streamlit with custom HTML badges πŸ₯‡πŸ₯ˆπŸ₯‰.

image-3.png

πŸ“Έ Screenshot: Months that had worst selling scores in 2024 (Leaderboard)ΒΆ

This leaderboard dynamically ranks products based on slaes in 2024, using SQL CTEs and window functions.
It is then cleaned in Python (pandas) and displayed in Streamlit with custom HTML badges πŸ₯‡πŸ₯ˆπŸ₯‰.

image-4.pngΒΆ

2. πŸ” Deep Dive: Client (Work in progress)ΒΆ

This mode allows a detailed look into a single customer’s activity:

  • Monthly sales trends (total and average)
  • Categories purchased by the client
  • Product purchase breakdown

πŸ“Έ Monthly Sales Overview – Adam ZielonyΒΆ

This chart shows monthly total and average sales for the customer Adam Zielony.
Data is extracted using a SQL CTE, aggregated by month, and then cleaned in Python:

WITH CustomerMonthlySales AS (
    SELECT 
        strftime('%Y', o.OrderDate) AS OrderYear,
        strftime('%m', o.OrderDate) AS Month,
        SUM(od.UnitPrice * od.Quantity) AS TotalSales,
        AVG(od.UnitPrice * od.Quantity) AS AvgSales
    FROM Customers c
    JOIN Orders o ON c.CustomerID = o.CustomerID
    JOIN OrderDetails od ON od.OrderID = o.OrderID
    WHERE CONCAT(c.FirstName, ' ', c.LastName) = 'Adam Zielony'
    GROUP BY OrderYear, Month
)
SELECT * FROM CustomerMonthlySales;

πŸ“Έ Monthly Sales Overview – Adam Zielony (screanshot):

image-5.png


3. πŸ“¦ Product Insights (Work in Progress)ΒΆ

A new module for analyzing product-level performance in more detail:

  • Cross-year product comparisons
  • Price and sales evolution
  • Category-product breakdowns

πŸš€ Possible Updates & ImprovementsΒΆ

  • 🌐 Deployment of the app in the cloud (Streamlit Cloud or DigitalOcean VPS)
  • πŸ—„οΈ Migrating from SQLite β†’ PostgreSQL on AWS RDS / Azure SQL
  • ☁️ Using cloud storage (AWS S3, Azure Blob) for input data instead of local DB
  • πŸ”„ Automating ETL jobs with Apache Airflow / Prefect
  • 🧩 Integrating external data sources (e.g. fetching datasets via Python API or SAP connectors)
  • πŸ“¦ Expanding Product Insights with advanced visualizations (time-series, cohort analysis, cross-sell analysis)
  • πŸ› οΈ Adding unit tests & data validation with pytest or Great Expectations
  • πŸ” Handling data security and access control (roles, masking sensitive fields)
  • ⚑ Scaling ETL pipelines with PySpark for distributed data processing
  • πŸ”„ Migrating transformations from pandas β†’ PySpark DataFrames for larger datasets
  • πŸ—οΈ Testing Spark SQL alongside SQLAlchemy queries for performance comparison

🧠 What I Learned¢

  • πŸ—„οΈ Designing a database schema with realistic dirty data
  • 🐍 Building dynamic SQL query builders in Python (SELECT, WHERE, filters)
  • βš™οΈ Implementing ETL pipelines (cleaning, mapping, joining multiple tables)
  • πŸ“Š Ranking with SQL RANK() OVER and aggregations
  • 🎨 Creating interactive dashboards in Streamlit with custom HTML & CSS
  • πŸš€ Applying a Data Engineering mindset: structured, reusable, scalable solutions

🎯 For Recruiter¢

This project proves I can deliver an end-to-end data solution:

  • πŸ—ƒοΈ From raw, inconsistent data
  • βš™οΈ Through SQL + Python ETL pipelines
  • πŸ“Š To clean dashboards in Streamlit

➑️ Core skills: SQL | Python (pandas) | ETL | Streamlit | Data Visualization