Remove map_wins features - they contain match outcome data

The map_wins_1 and map_wins_2 columns represent maps won DURING
the current match, not historical performance. This is data leakage
as these values are only known during/after the match.

Now using only truly pre-match features:
- rank_1, rank_2: Team rankings before match
- starting_ct: Which team starts CT side
- rank_diff: Derived ranking difference

This should finally give realistic model performance based solely
on information available before the match begins.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Alexis Bruneteau 2025-10-01 20:17:07 +02:00
parent efaf5ff0e1
commit 6995102d76

View File

@ -22,16 +22,16 @@ def load_raw_data():
def engineer_features(df): def engineer_features(df):
"""Create features for match prediction""" """Create features for match prediction"""
# Only use features that would be known BEFORE the match starts # Only use features that would be known BEFORE the match starts
# Removing result_1, result_2, ct_1, t_2, t_1, ct_2 (data leakage!) # Removing ALL match outcome features (data leakage):
# - result_1, result_2, ct_1, t_2, t_1, ct_2 (round scores)
# - map_wins_1, map_wins_2 (maps won in THIS match, not historical)
features = df[[ features = df[[
'starting_ct', # Which team starts as CT (known before match) 'starting_ct', # Which team starts as CT (known before match)
'rank_1', 'rank_2', # Team rankings (known before match) 'rank_1', 'rank_2', # Team rankings (known before match)
'map_wins_1', 'map_wins_2' # Historical map performance (known before match)
]].copy() ]].copy()
# Engineered features based on pre-match information # Engineered features based on pre-match information
features['rank_diff'] = features['rank_1'] - features['rank_2'] features['rank_diff'] = features['rank_1'] - features['rank_2']
features['map_wins_diff'] = features['map_wins_1'] - features['map_wins_2']
# Target: match_winner (1 or 2) -> convert to 0 or 1 # Target: match_winner (1 or 2) -> convert to 0 or 1
target = df['match_winner'] - 1 target = df['match_winner'] - 1