Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas: Check in which month's filename unique id was last seen

I have 12 files:

['01_2021.csv', '02_2021.csv', '03_2021.csv', '04_2021.csv', '05_2021.csv', '06_2021.csv', '07_2021.csv', '08_2021.csv', '09_2021.csv', '10_2021.csv', '11_2020.csv', '12_2020.csv']

My CSV file structure:

sampleCSVFile in my mypath:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

id    itemName    NonImportantEntries    Entries    SomeOtherEntries
1      item1              27              111             163
2      item2              16               22              98

Each file has column "ID" with unique value. I wish to scan across all files in order to confirm in which fileName a given ID was last seen. Could someone help with this?

My Code so far:

import os
import pandas as pd

#get your working directory and target folder that contains all your files
path = os.path.join(os.getcwd(),'folder')

files = [os.path.join(path,i) for i in os.listdir(path) if os.path.isfile(os.path.join(path,i))]
files.remove(path+'.DS_Store')
files.sort()

#I'm stuck here as the below code seems to add column 'lastSeen' into my output file but it includes rows from all the files in one data frame. How should I approach it?
      
#for every file in folder, create a separate data frame and read it, for each frame append with column filename as 'lastSeen'. Scan unique IDs through all data frames to find in which data frame name, unique ID was seen last - in this example we are consider months between 2020 and 2021. 
df = pd.DataFrame()
for file in files:
    _df = pd.read_csv(file)
    _df['fileName'] = os.path.split(file)[-1]
    df = df.append(_df)   

Expected finalFile.csv format:

    id      lastSeen
    1       06_2021
    2       12_2020 
    3       10_2021
    ...
    45000   07_2021

Thanks in advance for any help with this!

>Solution :

Try:

  1. read only the necessary ("id") columns using pd.read_csv and insert a column ("lastSeen") with the file name.
  2. append each DataFrame to create the master dataframe
  3. use pd.to_datetime to convert the file names to dates.
  4. groupby and only keep the "id" where the date column is maximum.
path = os.path.join(".", "folder")
files = [f for f in os.listdir(path) if f.endswith(".csv")]

master = pd.DataFrame()
for file in files:
    temp = pd.read_csv(os.path.join(path, file),usecols=[0])
    temp["lastSeen"] = file.replace(".csv","")
    master = master.append(temp, ignore_index=True)

master["date"] = pd.to_datetime(master["lastSeen"], format="%m_%Y")    
output = master[master["date"]==master.groupby("id")["date"].transform("max")].drop("date", axis=1)

>>> output
   id lastSeen
0   1  01_2021
2   2  02_2021
3   3  02_2021
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading