Drop certain rows based on quantity of rows with specific values

September 24, 2022

I am newer data science and am working on a project to analyze sports statistics. I have a dataset of hockey statistics for a group of players over multiple seasons. Players have anywhere between 1 row to 12 rows representing their season statistics over however many seasons they’ve played.

Example:

    Player  Season  Pos GP  G   A   P   +/- PIM P/GP    ... PPG PPP SHG SHP OTG GWG S   S%  TOI/GP  FOW%
0   Nathan MacKinnon    2022    1   65  32  56  88  22  42  1.35    ... 7   27  0   0   1   5   299 10.7    21.07   45.4
1   Nathan MacKinnon    2021    1   48  20  45  65  22  37  1.35    ... 8   25  0   0   0   2   206 9.7 20.37   48.5
2   Nathan MacKinnon    2020    1   69  35  58  93  13  12  1.35    ... 12  31  0   0   2   4   318 11.0    21.22   43.1
3   Nathan MacKinnon    2019    1   82  41  58  99  20  34  1.21    ... 12  37  0   0   1   6   365 11.2    22.08   43.7
4   Nathan MacKinnon    2018    1   74  39  58  97  11  55  1.31    ... 12  32  0   1   3   12  284 13.7    19.90   41.9
5   Nathan MacKinnon    2017    1   82  16  37  53  -14 16  0.65    ... 2   14  2   2   2   4   251 6.4 19.95   50.6
6   Nathan MacKinnon    2016    1   72  21  31  52  -4  20  0.72    ... 7   16  0   1   0   6   245 8.6 18.87   48.4
7   Nathan MacKinnon    2015    1   64  14  24  38  -7  34  0.59    ... 3   7   0   0   0   2   192 7.3 17.05   47.0
8   Nathan MacKinnon    2014    1   82  24  39  63  20  26  0.77    ... 8   17  0   0   0   5   241 10.0    17.35   42.9
9   J.T. Compher    2022    2   70  18  15  33  6   25  0.47    ... 4   6   1   1   0   0   102 17.7    16.32   51.4
10  J.T. Compher    2021    2   48  10  8   18  10  19  0.38    ... 1   2   0   0   0   2   47  21.3    14.22   45.9
11  J.T. Compher    2020    2   67  11  20  31  9   18  0.46    ... 1   5   0   3   1   3   106 10.4    16.75   47.7
12  J.T. Compher    2019    2   66  16  16  32  -8  31  0.48    ... 4   9   3   3   0   3   118 13.6    17.48   49.2
13  J.T. Compher    2018    2   69  13  10  23  -29 20  0.33    ... 4   7   2   2   2   3   131 9.9 16.00   45.1
14  J.T. Compher    2017    2   21  3   2   5   0   4   0.24    ... 1   1   0   0   0   1   30  10.0    14.93   47.6
15  Darren Helm 2022    1   68  7   8   15  -5  14  0.22    ... 0   0   1   2   0   1   93  7.5 10.55   44.2
16  Darren Helm 2021    1   47  3   5   8   -3  10  0.17    ... 0   0   0   0   0   0   83  3.6 14.68   66.7
17  Darren Helm 2020    1   68  9   7   16  -6  37  0.24    ... 0   0   1   2   0   0   102 8.8 13.73   53.6
18  Darren Helm 2019    1   61  7   10  17  -11 20  0.28    ... 0   0   1   4   0   0   107 6.5 14.57   44.4
19  Darren Helm 2018    1   75  13  18  31  3   39  0.41    ... 0   0   2   4   0   0   141 9.2 15.57   44.1

[sample of my dataset][1]
[1]: https://i.stack.imgur.com/7CsUd.png

If any player has played more than 6 seasons, I want to drop the row corresponding to Season 2021. This is because COVID drastically shortened the season and it is causing issues as I work with averages.

As you can see from the screenshot, Nathan MacKinnon has played 9 seasons. Across those 9 seasons, except for 2021, he plays in no fewer than 64 games. Due to the shortened season of 2021, he only got 48 games.
Removing Season 2021 results in an Average Games Played of 73.75.
Keeping Season 2021 in the data, the Average Games Played becomes 70.89.

While not drastic, it compounds into the other metrics as well.

I have been trying this for a little while now, but as I mentioned, I am new to this world and am struggling to figure out how to accomplish this.

I don’t want to just completely drop ALL rows for 2021 across all players, though, as some players only have 1-5 years’ worth of data and for those players, I need to use as much data as I can and remove 1 row from a player with only 2 seasons would also negatively skew averages.

I would really appreciate some assistance from anyone more experienced than me!

>Solution :

This can be accomplished by using groupby and apply. For example:

edited_players = (players
                  .groupby("Player")
                  .apply(lambda subset: subset if len(subset) <= 6 else subset.query("Season != 2021"))
)

Round brackets for formatting purposes.

The combination of groupby and apply basically feeds a grouped subset of your dataframe to a function. So, first all the rows of Nathan MacKinnon will be used, then rows for J.T. Compher, then Darren Helm rows, etc.

The function used is an anonymous/lambda function which operates under the following logic: "if the dataframe subset that I receive has 6 or fewer rows, I’ll return the subset unedited. Otherwise, I will filter out rows within that subset which have the value 2021 in the Season column".