Pandas: Check in which month's filename unique id was last seen

November 12, 2021

I have 12 files:

['01_2021.csv', '02_2021.csv', '03_2021.csv', '04_2021.csv', '05_2021.csv', '06_2021.csv', '07_2021.csv', '08_2021.csv', '09_2021.csv', '10_2021.csv', '11_2020.csv', '12_2020.csv']

My CSV file structure:

sampleCSVFile in my mypath:

id    itemName    NonImportantEntries    Entries    SomeOtherEntries
1      item1              27              111             163
2      item2              16               22              98

Each file has column "ID" with unique value. I wish to scan across all files in order to confirm in which fileName a given ID was last seen. Could someone help with this?

My Code so far:

import os
import pandas as pd

#get your working directory and target folder that contains all your files
path = os.path.join(os.getcwd(),'folder')

files = [os.path.join(path,i) for i in os.listdir(path) if os.path.isfile(os.path.join(path,i))]
files.remove(path+'.DS_Store')
files.sort()

#I'm stuck here as the below code seems to add column 'lastSeen' into my output file but it includes rows from all the files in one data frame. How should I approach it?
      
#for every file in folder, create a separate data frame and read it, for each frame append with column filename as 'lastSeen'. Scan unique IDs through all data frames to find in which data frame name, unique ID was seen last - in this example we are consider months between 2020 and 2021. 
df = pd.DataFrame()
for file in files:
    _df = pd.read_csv(file)
    _df['fileName'] = os.path.split(file)[-1]
    df = df.append(_df)

Expected finalFile.csv format:

    id      lastSeen
    1       06_2021
    2       12_2020 
    3       10_2021
    ...
    45000   07_2021

Thanks in advance for any help with this!

>Solution :

Try:

read only the necessary ("id") columns using pd.read_csv and insert a column ("lastSeen") with the file name.
append each DataFrame to create the master dataframe
use pd.to_datetime to convert the file names to dates.
groupby and only keep the "id" where the date column is maximum.

path = os.path.join(".", "folder")
files = [f for f in os.listdir(path) if f.endswith(".csv")]

master = pd.DataFrame()
for file in files:
    temp = pd.read_csv(os.path.join(path, file),usecols=[0])
    temp["lastSeen"] = file.replace(".csv","")
    master = master.append(temp, ignore_index=True)

master["date"] = pd.to_datetime(master["lastSeen"], format="%m_%Y")    
output = master[master["date"]==master.groupby("id")["date"].transform("max")].drop("date", axis=1)

>>> output
   id lastSeen
0   1  01_2021
2   2  02_2021
3   3  02_2021