Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

tidy up a list and remove duplicates

Hello I want to tidy up this list to keep only the highest result.
Example like: ‘M’, ‘m’ and ‘O’, ‘o’ are duplicate in the list I want to keep them as their greatest result.

Here is my list :

list_no_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:12 PM', '6.15'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:06 PM', '4.87'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]

and I want this as output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

list_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]

The problem is, there is a dash in a list and I don’t know how to arrange the list while keeping it.

Thanks.

>Solution :

The Python package pandas (https://pandas.pydata.org/) is a great way to do this, especially when your 2d list gets very large.

To start we would need to import the package:

import pandas as pd

We can put your 2d list in a pandas Dataframe and give it some sensible colums:

list_no_ok = [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
           ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
           ['J', 'j', 'In progress', '-', '-'],
           ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
           ['M', 'm', 'Finished', '1 October 2020  2:12 PM', '6.15'],
           ['M', 'm', 'Finished', '1 October 2020  2:20 PM', '9.10'],
           ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
           ['O', 'o', 'Finished', '1 October 2020  5:06 PM', '4.87'],
           ['O', 'o', 'Finished', '1 October 2020  5:37 PM', '5.90']]
df = pd.DataFrame(data=list_no_ok, columns=["col1", "col2", "status", "date", "value"])
df["value"] = pd.to_numeric(df["value"], errors='ignore')
df = df.sort_values(by=["value"])
df

    col1    col2    status      date                        value
2   J       j       In progress -                           -
3   K       k       Finished    1 October 2020 11:32 AM     10.00
1   B       b       Finished    30 September 2020 2:52 PM   3.46
7   O       o       Finished    1 October 2020 5:06 PM      4.87
8   O       o       Finished    1 October 2020 5:37 PM      5.90
4   M       m       Finished    1 October 2020 2:12 PM      6.15
6   N       n       Finished    1 October 2020 3:38 PM      7.69
0   A       a       Finished    30 September 2020 12:46 PM  8.08
5   M       m       Finished    1 October 2020 2:20 PM      9.10

Here we have put the 2d list in a Dataframe, gave it column names, we converted the column value from string to numbers. We have added errors='ignore' for the rows that are still in progress. Finally we have sorted it from high to low based on the value.

We have sorted it based value so that we can make use of the drop_duplicates function of pandas:

df = df.drop_duplicates(['col1','col2'],keep='first')
df = df.sort_values(by=['col1','col2'])
df
    col1    col2    status      date                         value
0   A       a       Finished    30 September 2020 12:46 PM   8.08
1   B       b       Finished    30 September 2020 2:52 PM    3.46
2   J       j       In progress -                            -
3   K       k       Finished    1 October 2020 11:32 AM      10.00
4   M       m       Finished    1 October 2020 2:12 PM       6.15
6   N       n       Finished    1 October 2020 3:38 PM       7.69
7   O       o       Finished    1 October 2020 5:06 PM       4.87

Rows get dropped when they have identical col1 and col2. Whenever this happends, it keeps the first occurrence. In this case that is were value is the highest since we sorted on value before we dropped rows.
We can go back to your list_ok by sorting again on col1 and col2.
If you really want to go back to a 2d list that can be done with:

list_ok = df.values.tolist()
list_ok
 [['A', 'a', 'Finished', '30 September 2020  12:46 PM', '8.08'],
 ['B', 'b', 'Finished', '30 September 2020  2:52 PM', '3.46'],
 ['J', 'j', 'In progress', '-', '-'],
 ['K', 'k', 'Finished', '1 October 2020  11:32 AM', '10.00'],
 ['M', 'm', 'Finished', '1 October 2020  2:12 PM', '6.15'],
 ['N', 'n', 'Finished', '1 October 2020  3:38 PM', '7.69'],
 ['O', 'o', 'Finished', '1 October 2020  5:06 PM', '4.87']]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading