tidy up a list and remove duplicates

November 28, 2021

Hello I want to tidy up this list to keep only the highest result.
Example like: ‘M’, ‘m’ and ‘O’, ‘o’ are duplicate in the list I want to keep them as their greatest result.

Here is my list :

list_no_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:12 PM', '6.15'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:06 PM', '4.87'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]

and I want this as output:

list_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]

The problem is, there is a dash in a list and I don’t know how to arrange the list while keeping it.

Thanks.

>Solution :

The Python package pandas (https://pandas.pydata.org/) is a great way to do this, especially when your 2d list gets very large.

To start we would need to import the package:

import pandas as pd

We can put your 2d list in a pandas Dataframe and give it some sensible colums:

list_no_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:12 PM', '6.15'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:06 PM', '4.87'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]
df = pd.DataFrame(data=list_no_ok, columns=["col1", "col2", "status", "date", "value"])
df["value"] = pd.to_numeric(df["value"], errors='ignore')
df = df.sort_values(by=["value"])
df

    col1    col2    status      date                        value
2   J       j       In progress -                           -
3   K       k       Finished    1 October 2020 11:32 AM     10.00
1   B       b       Finished    30 September 2020 2:52 PM   3.46
7   O       o       Finished    1 October 2020 5:06 PM      4.87
8   O       o       Finished    1 October 2020 5:37 PM      5.90
4   M       m       Finished    1 October 2020 2:12 PM      6.15
6   N       n       Finished    1 October 2020 3:38 PM      7.69
0   A       a       Finished    30 September 2020 12:46 PM  8.08
5   M       m       Finished    1 October 2020 2:20 PM      9.10

Here we have put the 2d list in a Dataframe, gave it column names, we converted the column value from string to numbers. We have added errors='ignore' for the rows that are still in progress. Finally we have sorted it from high to low based on the value.

We have sorted it based value so that we can make use of the drop_duplicates function of pandas:

df = df.drop_duplicates(['col1','col2'],keep='first')
df = df.sort_values(by=['col1','col2'])
df
    col1    col2    status      date                         value
0   A       a       Finished    30 September 2020 12:46 PM   8.08
1   B       b       Finished    30 September 2020 2:52 PM    3.46
2   J       j       In progress -                            -
3   K       k       Finished    1 October 2020 11:32 AM      10.00
4   M       m       Finished    1 October 2020 2:12 PM       6.15
6   N       n       Finished    1 October 2020 3:38 PM       7.69
7   O       o       Finished    1 October 2020 5:06 PM       4.87

Rows get dropped when they have identical col1 and col2. Whenever this happends, it keeps the first occurrence. In this case that is were value is the highest since we sorted on value before we dropped rows.
We can go back to your list_ok by sorting again on col1 and col2.
If you really want to go back to a 2d list that can be done with:

list_ok = df.values.tolist()
list_ok
 [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
 ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
 ['J', 'j', 'In progress', '-', '-'],
 ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
 ['M', 'm', 'Finished', '1 October 2020  2:12 PM', '6.15'],
 ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
 ['O', 'o', 'Finished', '1 October 2020  5:06 PM', '4.87']]