The training script creates separate model files for each task
(match_winner, map_winner, score_team1, score_team2, round_diff, total_maps)
so DVC needs to track each file individually.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changed from tracking entire models/ directory to specific model file
to resolve conflict with models/metrics.json metric tracking.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Created comprehensive multi-objective modeling system:
**6 Prediction Tasks:**
1. Match Winner (Binary Classification) - Who wins the match?
2. Map Winner (Binary Classification) - Who wins this specific map?
3. Team 1 Score (Regression) - Predict exact round score for team 1
4. Team 2 Score (Regression) - Predict exact round score for team 2
5. Round Difference (Regression) - Predict score margin
6. Total Maps (Regression) - Predict number of maps in match
**Implementation:**
- Updated preprocessing to generate all target variables
- Created train_multitask.py with separate models per task
- Classification tasks use Random Forest Classifier
- Regression tasks use Random Forest Regressor
- All models logged to MLflow experiment 'csgo-match-prediction-multitask'
- Metrics tracked per task (accuracy/precision for classification, MAE/RMSE for regression)
- Updated DVC pipeline to use new training script
**No Data Leakage:**
- All features are pre-match only (rankings, map, starting side)
- Target variables properly separated and saved with 'target_' prefix
This enables comprehensive match analysis and multiple betting/analytics use cases.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Enhanced feature engineering with legitimate pre-match information:
New features:
- Map one-hot encoding (Dust2, Mirage, Inferno, etc.)
- rank_sum: Combined team strength indicator
- rank_ratio: Relative team strength
- team1_is_favorite: Whether team 1 has better ranking
- both_top_tier: Both teams in top 10
- underdog_matchup: Large ranking difference (>50)
All features are known before match starts - no data leakage.
Expected to improve model performance while maintaining integrity.
Current feature count: ~20 (4 base + 3 rank + ~10 maps + 3 indicators)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The map_wins_1 and map_wins_2 columns represent maps won DURING
the current match, not historical performance. This is data leakage
as these values are only known during/after the match.
Now using only truly pre-match features:
- rank_1, rank_2: Team rankings before match
- starting_ct: Which team starts CT side
- rank_diff: Derived ranking difference
This should finally give realistic model performance based solely
on information available before the match begins.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Removed features that contain match outcome information:
- result_1, result_2 (actual match scores - only known after match)
- ct_1, t_2, t_1, ct_2 (rounds won per side - only known after match)
- total_rounds, round_diff (derived from results)
These features caused perfect 1.0 accuracy because the model was
essentially "cheating" by knowing the match outcome.
Now using only pre-match information:
- Team rankings (rank_1, rank_2)
- Historical map performance (map_wins_1, map_wins_2)
- Starting side (starting_ct)
- Derived: rank_diff, map_wins_diff
This will give realistic model performance based on what would
actually be known before a match starts.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added input_example parameter to auto-infer model signature and
explicitly set artifact_path parameter to remove deprecation warnings.
This improves MLflow tracking by:
- Auto-generating model signature from training data
- Using correct parameter names for MLflow 3.x
- Enabling better model serving and inference validation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Reverted to simpler approach - MLflow natively supports
MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD environment
variables for HTTP Basic Auth.
Removed the manual URI construction since it's not needed.
The workflow already sets these env vars correctly.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changed MLflow authentication to use HTTP Basic Auth by embedding
credentials in the tracking URI (https://user:pass@host).
This is the standard authentication method for MLflow when using
basic auth, rather than relying on environment variables alone.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added explicit environment variable configuration for MLflow credentials.
The credentials are now properly passed through from CI/CD environment
to the MLflow client.
Changes:
- Check for MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD env vars
- Explicitly set them in os.environ for MLflow to use
- Added connection success message for debugging
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changed cache configuration:
- Moved Install Poetry step before cache setup
- Updated cache path to ~/.cache/pypoetry/virtualenvs (actual venv location)
- Removed **/poetry.lock wildcard in favor of direct poetry.lock reference
- This ensures the virtualenv itself is cached, not just metadata
This should significantly speed up CI/CD runs by reusing installed packages.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changed dvc pull to specifically pull data/raw.dvc instead of all
outputs. The processed data and model files are generated by the
DVC pipeline (dvc repro), not pulled from remote storage.
This prevents errors about missing processed files that haven't
been generated yet.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
DVC needs credentials to be configured via 'dvc remote modify' command
rather than just environment variables. This fixes 403 Forbidden errors
when accessing MinIO/S3 storage.
Changes:
- Added dvc remote modify commands to set access_key_id and secret_access_key
- Applied to both pull and push operations in test and train jobs
- Added .dvc/config.local to .gitignore to prevent credential leaks
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Configure DVC to use AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
from Gitea secrets (DVC_ID and DVC_PASSWORD) for MinIO/S3 access.
Changes:
- Added DVC credentials to all DVC operations (pull/push)
- Changed poetry install to use --no-root flag for faster installs
- Credentials applied to both test and train jobs
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added actions/cache@v3 to cache Poetry and pip dependencies across
workflow runs. This significantly speeds up CI/CD by avoiding
full reinstallation when poetry.lock hasn't changed.
Cache strategy:
- Cache key based on OS and poetry.lock hash
- Caches ~/.cache/pypoetry and ~/.cache/pip
- Falls back to OS-specific cache if exact match not found
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added dvc-s3>=3.2.0 to dependencies to enable DVC to work with
S3-compatible storage backends like MinIO.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Regenerate lock file to include pyyaml>=6.0.0 added to dependencies.
This resolves the poetry.lock sync issue with pyproject.toml.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>