Calculating Winning or Losing Streak of Players in Python DataFrame
Problem Description
In this article, we will discuss how to calculate the winning or losing streak of players in a given tennis match DataFrame. We have a DataFrame with columns tourney_date
, player1_id
, player2_id
, and target
. The target column represents whether player 1 won (1) or lost (0).
Table of Contents
Introduction
The problem requires us to calculate the winning or losing streak of players in a given tennis match DataFrame. This involves iterating through each row of the DataFrame, updating the streaks for each player based on their target values, and finally joining this streak information with the original DataFrame.
Problem Context
We are working with a DataFrame tennis_data_processed
that contains columns tourney_date
, player1_id
, player2_id
, and target
. The target
column indicates whether player 1 won (1) or lost (0).
tennis_data_processed = {
'tourney_date': ['2022-10-31', '2023-02-06', '2023-02-06', '2023-02-06', '2023-02-06', '2023-02-20', '2023-02-20'],
'player1_id': [100000, 123456, 100000, 100000, 345612, 432154, 100000],
'player2_id': [209950, 100000, 543210, 876543, 100000, 100000, 929292],
'target': [0, 0, 1, 1, 1, 1, 0]
}
Requirements and Assumptions
We assume that the player_id
values in the DataFrame are unique. This uniqueness is crucial for calculating streaks correctly.
Step-by-Step Solution
Step 1: Data Preparation
First, we import necessary libraries: pandas
(for data manipulation) and numpy
(for numerical operations).
import pandas as pd
import numpy as np
# For convenience: df instead of tennis_data_processed
df = pd.DataFrame(tennis_data_processed)
Step 2: Initialize Dictionary to Track Streaks
We create a dictionary that tracks the current streak for each player. This dictionary is initialized with dict.fromkeys(np.unique(df.filter(like="player")), 0)
. The np.unique
function returns an array of unique values in df.filter(like="player")
, which corresponds to player1_id
and player2_id
.
cnt = dict.fromkeys(np.unique(df.filter(like="player")), 0)
Step 3: Calculate Streaks for Each Player
We iterate through each row of the DataFrame, updating the streaks for each player based on their target values.
streaks = []
for p1, p2, t in df[["player1_id", "player2_id", "target"]].to_numpy():
streaks.append([cnt[p1], cnt[p2]])
cnt[p1] = max(cnt[p1] + 1 - t, +1)
cnt[p2] = min(cnt[p2] - 1 + t, -1)
Step 4: Join Streak Information with Original DataFrame
Finally, we join the streak information with the original DataFrame.
out = df.join(pd.DataFrame(streaks, columns=["player1_streak", "player2_streak"]))
The resulting out
DataFrame contains the original columns plus two new columns for each player’s streak (player1_streak
and player2_streak
).
Last modified on 2024-09-04