🚀 Project Goal¶

The application predicts marathon completion time based on user data such as: gender, 5 km time, 10 km time. The model uses regression and machine learning.

The user:

  • enters their data
  • the AI model converts it into data passed to the ML model
  • can compare their target time with the predicted time
  • receives a predicted half-marathon time
  • gets feedback based on statistical data

🧰 Technologies & Tools¶

Category Technology Usage
ML / Prediction pycaret.regression Creating and training a regression model
Interface streamlit UI for data input and result presentation
Validation pandera Input data validation
Storage DigitalOcean Spaces (S3) Model and data storage
AI & LLM OpenAI GPT-4o, instructor Generating personalized feedback
Monitoring langfuse Monitoring AI model behavior
Visualization matplotlib, pandas Chart generation, data analysis
Other joblib, boto3, dotenv Model serialization, cloud integration, environment

📅 Development Process¶

1. 🔢 Data & Analysis:¶

  • Dataset includes: gender, age, age group, times for 5km/10km/15km/20km/full distance, and paces for all
  • Data cleaning and preparation, removing outliers

2. 🤖 Model Creation:¶

  • Using pycaret to choose the best regression model
  • Evaluation using metrics (MAE, MSE)
  • Achieved MAE of 162 seconds (approx. 2.7 minutes)

3. 📂 Serialization:¶

  • Save model using .pkl via pycaret.regression: save_model
  • Upload model and reference data to DigitalOcean Spaces

4. 🌐 Streamlit App:¶

  • UI for data input
  • Integration with ML model
  • Loads data from cloud
  • Charts comparing user time vs target time

📸 Screenshots & Code¶

1️⃣ Data Exploration, Model Creation and Cloud Upload¶

🧪 1. ML Setup and Data Preparation¶

setup(
    data=df_to_model, 
    session_id=123, 
    categorical_features=["sex"], 
    target="time"
)

🤖 2. Selecting Best Model from Multiple Algorithms¶

best_model = compare_models()

🔧 3. Hyperparameter Tuning¶

best_model = tune_model(best_model)

🏁 4. Finalizing (Train on Full Dataset)¶

best_model = finalize_model(best_model)

💾 5. Save Final Model to .pkl¶

save_model(best_model, "marathon_pipeline")

☁️ 6. Upload to DigitalOcean Cloud¶

session = boto3.session.Session()

client = session.client(
    's3',
    region_name=os.environ.get("REGION_NAME"),
    endpoint_url=os.environ.get("ENPOINT_URL_KEY"),
    aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY")
)

with open("marathon_pipeline.pkl", "rb") as model_file:
    client.put_object(
        Bucket="maraton.data",
        Key="maraton_model/model.pkl",
        Body=model_file,
        ContentType="application/octet-stream"
    )

2️⃣ UI: User Data Input¶

The app offers two data entry modes:

🧠 Option 1: Semantic (Natural Language) Input¶

User can describe themselves, for example:

"I’m a woman, 28 years old, I run 5 km in 27 minutes and 10 km in 58 minutes."

AI model (GPT-4o + Pydantic) transforms it into JSON structure:

{
  "sex": 0,
  "time_5km": "00:27:00",
  "time_10km": "00:58:00"
}

📸 Screen: Semantic data input
image.png


📝 Option 2: Classic Streamlit Form¶

User can also manually input their data:
📸 Screen: Form
image-2.png


🎯 3️⃣ Additional Functionality: User Goal Setting¶

User can input a target half-marathon time (e.g., 2:30).
System compares this with the model’s predicted time.

📸 Screen: Goal selection
image-4.png


📊 4️⃣ Result Visualization and Personalized Feedback¶

After data is processed, the app:

  • predicts half-marathon time (e.g., 04:32:15)
  • compares it to the user’s goal (e.g., 04:00:00)
  • generates feedback based on race data:
    • shows predicted time in hh:mm:ss and in words, e.g., “1h 20min”
    • compares with average results in similar user group
    • shows how much user needs to improve to hit their goal

📸 Screen: Feedback
image-3.png