Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Dataframe sort and remove on date

I have the following data frame

import pandas as pd
from pandas import Timestamp

df=pd.DataFrame({
'Tech en Innovation Fonds': {0: '63.57', 1: '63.57', 2: '63.57', 3: '63.57', 4: '61.03', 5: '61.03', 6: 61.03}, 'Aandelen Index Fonds': {0: '80.22', 1: '80.22', 2: '80.22', 3: '80.22', 4: '79.85', 5: '79.85', 6: 79.85}, 
'Behoudend Mix Fonds': {0: '44.80', 1: '44.8', 2: '44.8', 3: '44.8', 4: '44.8', 5: '44.8', 6: 44.8}, 
'Neutraal Mix Fonds': {0: '50.43', 1: '50.43', 2: '50.43', 3: '50.43', 4: '50.37', 5: '50.37', 6: 50.37}, 
'Dynamisch Mix Fonds': {0: '70.20', 1: '70.2', 2: '70.2', 3: '70.2', 4: '70.04', 5: '70.04', 6: 70.04}, 
'Risicomijdende Strategie': {0: '46.03', 1: '46.03', 2: '46.03', 3: '46.03', 4: '46.08', 5: '46.08', 6: 46.08}, 
'Tactische Strategie': {0: '48.69', 1: '48.69', 2: '48.69', 3: '48.69', 4: '48.62', 5: '48.62', 6: 48.62}, 
'Aandelen Groei Strategie': {0: '52.91', 1: '52.91', 2: '52.91', 3: '52.91', 4: '52.77', 5: '52.77', 6: 52.77}, 
'Datum': {0: Timestamp('2022-07-08 18:00:00'), 1: Timestamp('2022-07-11 19:42:55'), 2: Timestamp('2022-07-12 09:12:09'), 3: Timestamp('2022-07-12 09:29:53'), 4: Timestamp('2022-07-12 15:24:46'), 5: Timestamp('2022-07-12 15:30:02'), 6: Timestamp('2022-07-12 15:59:31')}})

I scrape these from a website several times a day
I am looking for a way to clean the dataframe, so that for each day only the latest entry is kept.
So for this dataframe 2022-07-12 has 5 entries for 2027-07-12 but I want to keep the last one i.e. 2022-07-12 15:59:31
The entries on the previous day are made already okay manually 🙁
I intent to do this once a month so each day has several entries

I already tried

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

dfclean=df.sort_values('Datum').drop_duplicates('Datum', keep='last')

But that gives me al the records back because the time is different

Any one an idea how to do this?

>Solution :

You can use .max() with datetime columns like this:

dfclean = df.loc[
    (df['Datum'].dt.date < df['Datum'].max().date()) | 
    (df['Datum'] == df['Datum'].max())
]

Output:

  Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds  \
0                    63.57                80.22               44.80   
1                    63.57                80.22                44.8   
6                    61.03                79.85                44.8   

  Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie  \
0              50.43               70.20                    46.03   
1              50.43                70.2                    46.03   
6              50.37               70.04                    46.08   

  Tactische Strategie Aandelen Groei Strategie               Datum  
0               48.69                    52.91 2022-07-08 18:00:00  
1               48.69                    52.91 2022-07-11 19:42:55  
6               48.62                    52.77 2022-07-12 15:59:31  
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading