π Shop Data Explorer β End-to-End Data Engineering & Analytics AppΒΆ
π― Project GoalΒΆ
Shop Data Explorer is an interactive analytics application that simulates a retail sales environment.
It was built to demonstrate an end-to-end pipeline: from raw data with inconsistencies β through ETL in Python & SQL β to clean, structured datasets β to business insights presented in dashboards.
The goal was to practice Data Engineering concepts (ETL, SQL query building, Python pipelines, cleaning) while also adding Data Analytics & Visualization layers for business users.
π§° Technologies and ToolsΒΆ
Category | Technology | Purpose |
---|---|---|
Database | SQLite |
Simulated transactional database |
Backend / ETL | Python , pandas |
Data extraction, cleaning, transformation |
Query Layer | SQL |
Dynamic query building & aggregation |
Visualization | Streamlit , Seaborn , Matplotlib , HTML/CSS , Plotly |
Interactive dashboards & custom leaderboards |
Version Control | GitHub |
Code versioning and documentation |
ποΈ Database CreationΒΆ
The database was created from scratch to simulate real-world messy data:
- Inconsistent category names (e.g. "electronics" vs "Electronics" vs "elektronika")
- Country mismatches (e.g. "PL", "Polska", "Poland")
- Missing values in customers and orders
- Fake nulls (
"null"
,"none"
, empty strings)
β‘οΈ This design forced the ETL pipeline to handle cleaning, mapping, and transformation, just like in real business systems.
ποΈUML diagram of the database schema:
π Visualization SectionΒΆ
1. π₯ Top Performance ModeΒΆ
This mode shows leaderboards of best or worst performing entities, filtered by year:
- Top Customers
- Top Products
- Top Categories
- Top Months
- Top Countries
Each leaderboard is generated dynamically with SQL CTE queries, cleaned in Python, and rendered in Streamlit with custom HTML badges (π₯ π₯ π₯ π« π).
πΈ Customers β Best in Sales Over the Years (Leaderboard)ΒΆ
Leaderboard ranking customers by total sales per year, calculated with SQL window functions and visualized in Streamlit with custom HTML badges π₯π₯π₯.
πΈ Screenshot: Products that have been selling the Best over the Years (Leaderboard)ΒΆ
This leaderboard dynamically ranks products based on total sales per year, using SQL CTEs and window functions.
It is then cleaned in Python (pandas) and displayed in Streamlit with custom HTML badges π₯π₯π₯.
πΈ Screenshot: Months that had worst selling scores in 2024 (Leaderboard)ΒΆ
This leaderboard dynamically ranks products based on slaes in 2024, using SQL CTEs and window functions.
It is then cleaned in Python (pandas) and displayed in Streamlit with custom HTML badges π₯π₯π₯.
ΒΆ
2. π Deep Dive: Client (Work in progress)ΒΆ
This mode allows a detailed look into a single customerβs activity:
- Monthly sales trends (total and average)
- Categories purchased by the client
- Product purchase breakdown
πΈ Monthly Sales Overview β Adam ZielonyΒΆ
This chart shows monthly total and average sales for the customer Adam Zielony.
Data is extracted using a SQL CTE, aggregated by month, and then cleaned in Python:
WITH CustomerMonthlySales AS (
SELECT
strftime('%Y', o.OrderDate) AS OrderYear,
strftime('%m', o.OrderDate) AS Month,
SUM(od.UnitPrice * od.Quantity) AS TotalSales,
AVG(od.UnitPrice * od.Quantity) AS AvgSales
FROM Customers c
JOIN Orders o ON c.CustomerID = o.CustomerID
JOIN OrderDetails od ON od.OrderID = o.OrderID
WHERE CONCAT(c.FirstName, ' ', c.LastName) = 'Adam Zielony'
GROUP BY OrderYear, Month
)
SELECT * FROM CustomerMonthlySales;
πΈ Monthly Sales Overview β Adam Zielony (screanshot):
3. π¦ Product Insights (Work in Progress)ΒΆ
A new module for analyzing product-level performance in more detail:
- Cross-year product comparisons
- Price and sales evolution
- Category-product breakdowns
π Possible Updates & ImprovementsΒΆ
- π Deployment of the app in the cloud (Streamlit Cloud or DigitalOcean VPS)
- ποΈ Migrating from SQLite β PostgreSQL on AWS RDS / Azure SQL
- βοΈ Using cloud storage (AWS S3, Azure Blob) for input data instead of local DB
- π Automating ETL jobs with Apache Airflow / Prefect
- π§© Integrating external data sources (e.g. fetching datasets via Python API or SAP connectors)
- π¦ Expanding Product Insights with advanced visualizations (time-series, cohort analysis, cross-sell analysis)
- π οΈ Adding unit tests & data validation with
pytest
orGreat Expectations
- π Handling data security and access control (roles, masking sensitive fields)
- β‘ Scaling ETL pipelines with PySpark for distributed data processing
- π Migrating transformations from pandas β PySpark DataFrames for larger datasets
- ποΈ Testing Spark SQL alongside SQLAlchemy queries for performance comparison
π§ What I LearnedΒΆ
- ποΈ Designing a database schema with realistic dirty data
- π Building dynamic SQL query builders in Python (
SELECT
,WHERE
, filters) - βοΈ Implementing ETL pipelines (cleaning, mapping, joining multiple tables)
- π Ranking with SQL
RANK() OVER
and aggregations - π¨ Creating interactive dashboards in Streamlit with custom HTML & CSS
- π Applying a Data Engineering mindset: structured, reusable, scalable solutions
π― For RecruiterΒΆ
This project proves I can deliver an end-to-end data solution:
- ποΈ From raw, inconsistent data
- βοΈ Through SQL + Python ETL pipelines
- π To clean dashboards in Streamlit
β‘οΈ Core skills: SQL | Python (pandas) | ETL | Streamlit | Data Visualization