Compare commits

...

10 Commits

Author SHA1 Message Date
Alexis Bruneteau
ff71d052e6 Track individual model files instead of single multitask model
Some checks failed
MLOps CI/CD Pipeline / test (push) Failing after 5m3s
MLOps CI/CD Pipeline / train (push) Has been skipped
MLOps CI/CD Pipeline / deploy (push) Has been skipped
The training script creates separate model files for each task
(match_winner, map_winner, score_team1, score_team2, round_diff, total_maps)
so DVC needs to track each file individually.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 20:55:28 +02:00
Alexis Bruneteau
9520395ee9 Fix DVC output path overlap in train stage
Changed from tracking entire models/ directory to specific model file
to resolve conflict with models/metrics.json metric tracking.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 20:51:16 +02:00
Alexis Bruneteau
9440f4eecd Implement multi-task learning pipeline for CSGO predictions
Created comprehensive multi-objective modeling system:

**6 Prediction Tasks:**
1. Match Winner (Binary Classification) - Who wins the match?
2. Map Winner (Binary Classification) - Who wins this specific map?
3. Team 1 Score (Regression) - Predict exact round score for team 1
4. Team 2 Score (Regression) - Predict exact round score for team 2
5. Round Difference (Regression) - Predict score margin
6. Total Maps (Regression) - Predict number of maps in match

**Implementation:**
- Updated preprocessing to generate all target variables
- Created train_multitask.py with separate models per task
- Classification tasks use Random Forest Classifier
- Regression tasks use Random Forest Regressor
- All models logged to MLflow experiment 'csgo-match-prediction-multitask'
- Metrics tracked per task (accuracy/precision for classification, MAE/RMSE for regression)
- Updated DVC pipeline to use new training script

**No Data Leakage:**
- All features are pre-match only (rankings, map, starting side)
- Target variables properly separated and saved with 'target_' prefix

This enables comprehensive match analysis and multiple betting/analytics use cases.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 20:28:06 +02:00
Alexis Bruneteau
a28a363dd9 Add comprehensive pre-match features for better predictions
Enhanced feature engineering with legitimate pre-match information:

New features:
- Map one-hot encoding (Dust2, Mirage, Inferno, etc.)
- rank_sum: Combined team strength indicator
- rank_ratio: Relative team strength
- team1_is_favorite: Whether team 1 has better ranking
- both_top_tier: Both teams in top 10
- underdog_matchup: Large ranking difference (>50)

All features are known before match starts - no data leakage.
Expected to improve model performance while maintaining integrity.

Current feature count: ~20 (4 base + 3 rank + ~10 maps + 3 indicators)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 20:24:07 +02:00
Alexis Bruneteau
6995102d76 Remove map_wins features - they contain match outcome data
The map_wins_1 and map_wins_2 columns represent maps won DURING
the current match, not historical performance. This is data leakage
as these values are only known during/after the match.

Now using only truly pre-match features:
- rank_1, rank_2: Team rankings before match
- starting_ct: Which team starts CT side
- rank_diff: Derived ranking difference

This should finally give realistic model performance based solely
on information available before the match begins.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 20:17:07 +02:00
Alexis Bruneteau
efaf5ff0e1 Fix critical data leakage in feature engineering
Removed features that contain match outcome information:
- result_1, result_2 (actual match scores - only known after match)
- ct_1, t_2, t_1, ct_2 (rounds won per side - only known after match)
- total_rounds, round_diff (derived from results)

These features caused perfect 1.0 accuracy because the model was
essentially "cheating" by knowing the match outcome.

Now using only pre-match information:
- Team rankings (rank_1, rank_2)
- Historical map performance (map_wins_1, map_wins_2)
- Starting side (starting_ct)
- Derived: rank_diff, map_wins_diff

This will give realistic model performance based on what would
actually be known before a match starts.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 20:01:46 +02:00
Alexis Bruneteau
cb7b80ca6a Fix MLflow model logging warnings
Added input_example parameter to auto-infer model signature and
explicitly set artifact_path parameter to remove deprecation warnings.

This improves MLflow tracking by:
- Auto-generating model signature from training data
- Using correct parameter names for MLflow 3.x
- Enabling better model serving and inference validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 20:01:05 +02:00
Alexis Bruneteau
22db96b3eb Simplify MLflow auth to use native env var support
Reverted to simpler approach - MLflow natively supports
MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD environment
variables for HTTP Basic Auth.

Removed the manual URI construction since it's not needed.
The workflow already sets these env vars correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 19:57:15 +02:00
Alexis Bruneteau
a4ddfb57be Use HTTP Basic Auth for MLflow authentication
Changed MLflow authentication to use HTTP Basic Auth by embedding
credentials in the tracking URI (https://user:pass@host).

This is the standard authentication method for MLflow when using
basic auth, rather than relying on environment variables alone.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 19:53:02 +02:00
Alexis Bruneteau
bc5d96981a Fix MLflow authentication in training script
Added explicit environment variable configuration for MLflow credentials.
The credentials are now properly passed through from CI/CD environment
to the MLflow client.

Changes:
- Check for MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD env vars
- Explicitly set them in os.environ for MLflow to use
- Added connection success message for debugging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-01 19:47:22 +02:00
4 changed files with 362 additions and 33 deletions

View File

@ -16,9 +16,9 @@ stages:
cache: false cache: false
train: train:
cmd: python src/models/train.py cmd: python src/models/train_multitask.py
deps: deps:
- src/models/train.py - src/models/train_multitask.py
- data/processed/train.csv - data/processed/train.csv
- data/processed/test.csv - data/processed/test.csv
params: params:
@ -26,7 +26,12 @@ stages:
- train.max_depth - train.max_depth
- train.random_state - train.random_state
outs: outs:
- models/model.pkl - models/model_match_winner.pkl
- models/model_map_winner.pkl
- models/model_score_team1.pkl
- models/model_score_team2.pkl
- models/model_round_diff.pkl
- models/model_total_maps.pkl
metrics: metrics:
- models/metrics.json: - models/metrics.json:
cache: false cache: false

View File

@ -21,23 +21,50 @@ def load_raw_data():
def engineer_features(df): def engineer_features(df):
"""Create features for match prediction""" """Create features for match prediction"""
# Basic features from results # Only use features that would be known BEFORE the match starts
# Base features
features = df[[ features = df[[
'result_1', 'result_2', 'starting_ct', 'starting_ct', # Which team starts as CT (known before match)
'ct_1', 't_2', 't_1', 'ct_2', 'rank_1', 'rank_2', # Team rankings (known before match)
'rank_1', 'rank_2', 'map_wins_1', 'map_wins_2'
]].copy() ]].copy()
# Engineered features # Rank-based features
features['rank_diff'] = features['rank_1'] - features['rank_2'] features['rank_diff'] = features['rank_1'] - features['rank_2']
features['map_wins_diff'] = features['map_wins_1'] - features['map_wins_2'] features['rank_sum'] = features['rank_1'] + features['rank_2']
features['total_rounds'] = features['result_1'] + features['result_2'] features['rank_ratio'] = features['rank_1'] / (features['rank_2'] + 1) # +1 to avoid division by zero
features['round_diff'] = features['result_1'] - features['result_2']
# Target: match_winner (1 or 2) -> convert to 0 or 1 # Map encoding (one-hot encoding for map types)
target = df['match_winner'] - 1 map_dummies = pd.get_dummies(df['_map'], prefix='map')
features = pd.concat([features, map_dummies], axis=1)
return features, target # Team strength indicators
features['team1_is_favorite'] = (features['rank_1'] < features['rank_2']).astype(int)
features['both_top_tier'] = ((features['rank_1'] <= 10) & (features['rank_2'] <= 10)).astype(int)
features['underdog_matchup'] = (abs(features['rank_diff']) > 50).astype(int)
# Multi-task targets
targets = {}
# Task 1: Match Winner (Binary Classification)
targets['match_winner'] = df['match_winner'] - 1 # Convert 1/2 to 0/1
# Task 2: Exact Score (Regression - two outputs)
targets['score_team1'] = df['result_1']
targets['score_team2'] = df['result_2']
# Task 3: Round Difference (Regression)
targets['round_diff'] = df['result_1'] - df['result_2']
# Task 4: Map Count for the match (Multi-class)
# Group by match_id to get total maps played
match_maps = df.groupby('match_id').size().to_dict()
targets['total_maps'] = df['match_id'].map(match_maps)
# Task 5: Map Winner (Binary Classification for this specific map)
targets['map_winner'] = df['map_winner'] - 1 # Convert 1/2 to 0/1
return features, targets
def save_metrics(X_train, X_test, y_train, y_test): def save_metrics(X_train, X_test, y_train, y_test):
"""Save dataset metrics""" """Save dataset metrics"""
@ -63,44 +90,67 @@ def main():
print("Loading raw data...") print("Loading raw data...")
df = load_raw_data() df = load_raw_data()
print(f"Loaded {len(df)} matches") print(f"Loaded {len(df)} maps")
print("Engineering features...") print("Engineering features...")
X, y = engineer_features(df) X, targets = engineer_features(df)
print(f"Created {X.shape[1]} features") print(f"Created {X.shape[1]} features")
print(f"Created {len(targets)} prediction targets:")
for target_name in targets.keys():
print(f" - {target_name}")
print("Splitting data...") print("Splitting data...")
X_train, X_test, y_train, y_test = train_test_split( # Use match_winner for stratification
X, y, X_train, X_test, idx_train, idx_test = train_test_split(
X, X.index,
test_size=params["test_size"], test_size=params["test_size"],
random_state=params["random_state"], random_state=params["random_state"],
stratify=y stratify=targets['match_winner']
) )
print("Saving processed data...") print("Saving processed data...")
Path("data/processed").mkdir(parents=True, exist_ok=True) Path("data/processed").mkdir(parents=True, exist_ok=True)
# Save full features # Save train set with all targets
full_features = X.copy()
full_features['target'] = y
full_features.to_csv("data/processed/features.csv", index=False)
# Save train set
train_data = X_train.copy() train_data = X_train.copy()
train_data['target'] = y_train for target_name, target_values in targets.items():
train_data[f'target_{target_name}'] = target_values.iloc[idx_train].values
train_data.to_csv("data/processed/train.csv", index=False) train_data.to_csv("data/processed/train.csv", index=False)
# Save test set # Save test set with all targets
test_data = X_test.copy() test_data = X_test.copy()
test_data['target'] = y_test for target_name, target_values in targets.items():
test_data[f'target_{target_name}'] = target_values.iloc[idx_test].values
test_data.to_csv("data/processed/test.csv", index=False) test_data.to_csv("data/processed/test.csv", index=False)
# Save metrics # Save full features with all targets
save_metrics(X_train, X_test, y_train, y_test) full_features = X.copy()
for target_name, target_values in targets.items():
full_features[f'target_{target_name}'] = target_values.values
full_features.to_csv("data/processed/features.csv", index=False)
print("Preprocessing completed successfully!") # Save metrics
print("\nDataset statistics:")
print(f"Train set: {len(X_train)} samples") print(f"Train set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples") print(f"Test set: {len(X_test)} samples")
print(f"Features: {X.shape[1]}")
metrics = {
"n_samples": len(X),
"n_train": len(X_train),
"n_test": len(X_test),
"n_features": X.shape[1],
"targets": list(targets.keys()),
"class_balance_match_winner": {
"class_0": int((targets['match_winner'] == 0).sum()),
"class_1": int((targets['match_winner'] == 1).sum())
}
}
with open("data/processed/data_metrics.json", "w") as f:
json.dump(metrics, f, indent=2)
print("Preprocessing completed successfully!")
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@ -14,12 +14,20 @@ from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_sc
import pandas as pd import pandas as pd
# Configure MLflow # Configure MLflow
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI", "https://mlflow.sortifal.dev")) # MLflow will automatically use MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD env vars
tracking_uri = os.getenv("MLFLOW_TRACKING_URI", "https://mlflow.sortifal.dev")
mlflow.set_tracking_uri(tracking_uri)
if os.getenv("MLFLOW_TRACKING_USERNAME") and os.getenv("MLFLOW_TRACKING_PASSWORD"):
print(f"MLflow configured with authentication for {tracking_uri}")
else:
print(f"MLflow configured without authentication for {tracking_uri}")
# Try to set experiment, but handle auth errors gracefully # Try to set experiment, but handle auth errors gracefully
USE_MLFLOW = True USE_MLFLOW = True
try: try:
mlflow.set_experiment("csgo-match-prediction") mlflow.set_experiment("csgo-match-prediction")
print(f"Connected to MLflow at {mlflow.get_tracking_uri()}")
except Exception as e: except Exception as e:
print(f"Warning: Could not connect to MLflow: {e}") print(f"Warning: Could not connect to MLflow: {e}")
print("Training will continue without MLflow tracking.") print("Training will continue without MLflow tracking.")
@ -129,7 +137,13 @@ def main():
# Try to log model to MLflow (if permissions allow) # Try to log model to MLflow (if permissions allow)
try: try:
mlflow.sklearn.log_model(model, "model") # Create input example for model signature
input_example = X_train.head(1)
mlflow.sklearn.log_model(
model,
artifact_path="model",
input_example=input_example
)
print("\nModel logged to MLflow successfully!") print("\nModel logged to MLflow successfully!")
except Exception as e: except Exception as e:
print(f"\nWarning: Could not log model to MLflow: {e}") print(f"\nWarning: Could not log model to MLflow: {e}")

View File

@ -0,0 +1,260 @@
"""
Multi-task model training pipeline for CSGO match prediction.
Trains separate models for different prediction objectives and logs to MLflow.
"""
import mlflow
import mlflow.sklearn
import yaml
import json
import pickle
import os
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
mean_absolute_error, mean_squared_error, r2_score
)
import pandas as pd
import numpy as np
# Configure MLflow
# MLflow will automatically use MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD env vars
tracking_uri = os.getenv("MLFLOW_TRACKING_URI", "https://mlflow.sortifal.dev")
mlflow.set_tracking_uri(tracking_uri)
if os.getenv("MLFLOW_TRACKING_USERNAME") and os.getenv("MLFLOW_TRACKING_PASSWORD"):
print(f"MLflow configured with authentication for {tracking_uri}")
else:
print(f"MLflow configured without authentication for {tracking_uri}")
# Try to set experiment, but handle auth errors gracefully
USE_MLFLOW = True
try:
mlflow.set_experiment("csgo-match-prediction-multitask")
print(f"Connected to MLflow at {mlflow.get_tracking_uri()}")
except Exception as e:
print(f"Warning: Could not connect to MLflow: {e}")
print("Training will continue without MLflow tracking.")
USE_MLFLOW = False
def load_params():
"""Load training parameters from params.yaml"""
with open("params.yaml") as f:
params = yaml.safe_load(f)
return params["train"]
def load_data():
"""Load preprocessed training and test data"""
train_df = pd.read_csv("data/processed/train.csv")
test_df = pd.read_csv("data/processed/test.csv")
# Separate features and targets
feature_cols = [col for col in train_df.columns if not col.startswith('target_')]
target_cols = [col for col in train_df.columns if col.startswith('target_')]
X_train = train_df[feature_cols]
X_test = test_df[feature_cols]
# Extract all targets
targets_train = {col.replace('target_', ''): train_df[col] for col in target_cols}
targets_test = {col.replace('target_', ''): test_df[col] for col in target_cols}
return X_train, X_test, targets_train, targets_test
def train_classification_model(X_train, y_train, params, task_name):
"""Train a classification model"""
print(f"\n[{task_name}] Training Random Forest Classifier...")
model = RandomForestClassifier(
n_estimators=params["n_estimators"],
max_depth=params["max_depth"],
random_state=params["random_state"],
n_jobs=-1
)
model.fit(X_train, y_train)
return model
def train_regression_model(X_train, y_train, params, task_name):
"""Train a regression model"""
print(f"\n[{task_name}] Training Random Forest Regressor...")
model = RandomForestRegressor(
n_estimators=params["n_estimators"],
max_depth=params["max_depth"],
random_state=params["random_state"],
n_jobs=-1
)
model.fit(X_train, y_train)
return model
def evaluate_classification(model, X_test, y_test, task_name):
"""Evaluate classification model"""
print(f"[{task_name}] Evaluating...")
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
metrics = {
f"{task_name}_accuracy": float(accuracy_score(y_test, y_pred)),
f"{task_name}_precision": float(precision_score(y_test, y_pred, zero_division=0)),
f"{task_name}_recall": float(recall_score(y_test, y_pred, zero_division=0)),
f"{task_name}_f1_score": float(f1_score(y_test, y_pred, zero_division=0)),
f"{task_name}_roc_auc": float(roc_auc_score(y_test, y_pred_proba))
}
return metrics
def evaluate_regression(model, X_test, y_test, task_name):
"""Evaluate regression model"""
print(f"[{task_name}] Evaluating...")
y_pred = model.predict(X_test)
metrics = {
f"{task_name}_mae": float(mean_absolute_error(y_test, y_pred)),
f"{task_name}_mse": float(mean_squared_error(y_test, y_pred)),
f"{task_name}_rmse": float(np.sqrt(mean_squared_error(y_test, y_pred))),
f"{task_name}_r2": float(r2_score(y_test, y_pred))
}
return metrics
def save_models(models, all_metrics):
"""Save all models and metrics locally"""
Path("models").mkdir(parents=True, exist_ok=True)
# Save each model
for task_name, model in models.items():
model_path = f"models/model_{task_name}.pkl"
with open(model_path, "wb") as f:
pickle.dump(model, f)
print(f"Saved {task_name} model to {model_path}")
# Save all metrics
with open("models/metrics.json", "w") as f:
json.dump(all_metrics, f, indent=2)
print(f"Metrics saved to models/metrics.json")
def main():
"""Main multi-task training pipeline"""
print("=" * 70)
print("CSGO Match Prediction - Multi-Task Model Training")
print("=" * 70)
# Load parameters and data
params = load_params()
X_train, X_test, targets_train, targets_test = load_data()
print(f"\nDataset info:")
print(f" Training samples: {len(X_train)}")
print(f" Test samples: {len(X_test)}")
print(f" Features: {X_train.shape[1]}")
print(f" Prediction tasks: {len(targets_train)}")
# Define tasks
tasks = {
'match_winner': {'type': 'classification', 'description': 'Match Winner Prediction'},
'map_winner': {'type': 'classification', 'description': 'Map Winner Prediction'},
'score_team1': {'type': 'regression', 'description': 'Team 1 Score Prediction'},
'score_team2': {'type': 'regression', 'description': 'Team 2 Score Prediction'},
'round_diff': {'type': 'regression', 'description': 'Round Difference Prediction'},
'total_maps': {'type': 'regression', 'description': 'Total Maps Prediction'}
}
models = {}
all_metrics = {}
if USE_MLFLOW:
with mlflow.start_run(run_name="multitask-rf-csgo"):
# Log parameters
mlflow.log_params(params)
mlflow.log_param("n_features", X_train.shape[1])
mlflow.log_param("n_train_samples", len(X_train))
mlflow.log_param("n_test_samples", len(X_test))
mlflow.log_param("n_tasks", len(tasks))
# Train and evaluate each task
for task_name, task_config in tasks.items():
print(f"\n{'='*70}")
print(f"Task: {task_config['description']}")
print(f"{'='*70}")
if task_name not in targets_train:
print(f"Warning: {task_name} not found in training data, skipping...")
continue
y_train = targets_train[task_name]
y_test = targets_test[task_name]
# Train model based on task type
if task_config['type'] == 'classification':
model = train_classification_model(X_train, y_train, params, task_name)
metrics = evaluate_classification(model, X_test, y_test, task_name)
else:
model = train_regression_model(X_train, y_train, params, task_name)
metrics = evaluate_regression(model, X_test, y_test, task_name)
models[task_name] = model
all_metrics.update(metrics)
# Log metrics to MLflow
mlflow.log_metrics(metrics)
# Print results
print(f"\n{task_name} Results:")
for metric, value in metrics.items():
print(f" {metric}: {value:.4f}")
# Save models and metrics
save_models(models, all_metrics)
# Print summary
print("\n" + "=" * 70)
print("Training Summary:")
print("=" * 70)
print(f"Models trained: {len(models)}")
print(f"Total metrics: {len(all_metrics)}")
print("=" * 70)
print(f"\nMLflow run ID: {mlflow.active_run().info.run_id}")
print(f"View run at: {mlflow.get_tracking_uri()}")
else:
# Train without MLflow
for task_name, task_config in tasks.items():
print(f"\n{'='*70}")
print(f"Task: {task_config['description']}")
print(f"{'='*70}")
if task_name not in targets_train:
print(f"Warning: {task_name} not found in training data, skipping...")
continue
y_train = targets_train[task_name]
y_test = targets_test[task_name]
# Train model based on task type
if task_config['type'] == 'classification':
model = train_classification_model(X_train, y_train, params, task_name)
metrics = evaluate_classification(model, X_test, y_test, task_name)
else:
model = train_regression_model(X_train, y_train, params, task_name)
metrics = evaluate_regression(model, X_test, y_test, task_name)
models[task_name] = model
all_metrics.update(metrics)
# Print results
print(f"\n{task_name} Results:")
for metric, value in metrics.items():
print(f" {metric}: {value:.4f}")
# Save models and metrics
save_models(models, all_metrics)
print("\n" + "=" * 70)
print("Training Summary:")
print("=" * 70)
print(f"Models trained: {len(models)}")
print(f"Total metrics: {len(all_metrics)}")
print("=" * 70)
print("\nMulti-task training pipeline completed successfully!")
if __name__ == "__main__":
main()