🚀 Project Goal¶
The application predicts marathon completion time based on user data such as: gender, 5 km time, 10 km time. The model uses regression and machine learning.
The user:
- enters their data
- the AI model converts it into data passed to the ML model
- can compare their target time with the predicted time
- receives a predicted half-marathon time
- gets feedback based on statistical data
🧰 Technologies & Tools¶
Category | Technology | Usage |
---|---|---|
ML / Prediction | pycaret.regression |
Creating and training a regression model |
Interface | streamlit |
UI for data input and result presentation |
Validation | pandera |
Input data validation |
Storage | DigitalOcean Spaces (S3) |
Model and data storage |
AI & LLM | OpenAI GPT-4o , instructor |
Generating personalized feedback |
Monitoring | langfuse |
Monitoring AI model behavior |
Visualization | matplotlib , pandas |
Chart generation, data analysis |
Other | joblib , boto3 , dotenv |
Model serialization, cloud integration, environment |
📅 Development Process¶
1. 🔢 Data & Analysis:¶
- Dataset includes: gender, age, age group, times for 5km/10km/15km/20km/full distance, and paces for all
- Data cleaning and preparation, removing outliers
2. 🤖 Model Creation:¶
- Using
pycaret
to choose the best regression model - Evaluation using metrics (MAE, MSE)
- Achieved MAE of 162 seconds (approx. 2.7 minutes)
3. 📂 Serialization:¶
- Save model using
.pkl
viapycaret.regression: save_model
- Upload model and reference data to DigitalOcean Spaces
4. 🌐 Streamlit App:¶
- UI for data input
- Integration with ML model
- Loads data from cloud
- Charts comparing user time vs target time
📸 Screenshots & Code¶
1️⃣ Data Exploration, Model Creation and Cloud Upload¶
🧪 1. ML Setup and Data Preparation¶
setup(
data=df_to_model,
session_id=123,
categorical_features=["sex"],
target="time"
)
🤖 2. Selecting Best Model from Multiple Algorithms¶
best_model = compare_models()
🔧 3. Hyperparameter Tuning¶
best_model = tune_model(best_model)
🏁 4. Finalizing (Train on Full Dataset)¶
best_model = finalize_model(best_model)
💾 5. Save Final Model to .pkl
¶
save_model(best_model, "marathon_pipeline")
☁️ 6. Upload to DigitalOcean Cloud¶
session = boto3.session.Session()
client = session.client(
's3',
region_name=os.environ.get("REGION_NAME"),
endpoint_url=os.environ.get("ENPOINT_URL_KEY"),
aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY")
)
with open("marathon_pipeline.pkl", "rb") as model_file:
client.put_object(
Bucket="maraton.data",
Key="maraton_model/model.pkl",
Body=model_file,
ContentType="application/octet-stream"
)
2️⃣ UI: User Data Input¶
The app offers two data entry modes:
🧠 Option 1: Semantic (Natural Language) Input¶
User can describe themselves, for example:
"I’m a woman, 28 years old, I run 5 km in 27 minutes and 10 km in 58 minutes."
AI model (GPT-4o + Pydantic) transforms it into JSON structure:
{
"sex": 0,
"time_5km": "00:27:00",
"time_10km": "00:58:00"
}
📸 Screen: Semantic data input
📝 Option 2: Classic Streamlit Form¶
User can also manually input their data:
📸 Screen: Form
🎯 3️⃣ Additional Functionality: User Goal Setting¶
User can input a target half-marathon time (e.g., 2:30).
System compares this with the model’s predicted time.
📸 Screen: Goal selection
📊 4️⃣ Result Visualization and Personalized Feedback¶
After data is processed, the app:
- predicts half-marathon time (e.g., 04:32:15)
- compares it to the user’s goal (e.g., 04:00:00)
- generates feedback based on race data:
- shows predicted time in hh:mm:ss and in words, e.g., “1h 20min”
- compares with average results in similar user group
- shows how much user needs to improve to hit their goal
📸 Screen: Feedback