Home How to create new column in Python pandas dataframe with column labels based on the content of those columns

Questions

How to create new column in Python pandas dataframe with column labels based on the content of those columns

August 15, 2023

I have a bunch of data files from a scientific instrument that come with checksums. I’ve read the files into a combined pandas dataframe, but the checksum for each row is for comparison to the column label of the checksum column and is unique to each file. This results in a dataframe with O10 columns with very large strings in the column labels and column values (for the rows from the file with that column label). Now I need to combine these O10 columns into 2 columns, one for the values, and one for the headers the checksums are compared against.

Simplified Example Starting Table, CH=Checksum Header, CV=Checksum Value

CH_A	CH_B	CH_C
CV_A1	NaN	NaN
CV_A2	NaN	NaN
NaN	CV_B1	NaN
NaN	NaN	CV_C1

Should combine to be 2 columns

ChecksumHeaders	ChecksumValues
CH_A	CV_A1
CH_A	CV_A2
CH_B	CV_B1
CH_C	CV_C1

The ‘ChecksumValues’ table was easy enough to get with df.ffill, but I’m struggling to determine a good method for creating the ‘ChecksumHeaders’ column. Seeing as how 'string'*1 =='string' and 'string'*0 ==''. I tied using a number of variations of np.sum([df[CH].notnull().astype(np.int)*CH for CH in ListOfHeaders]) but found that the string*int only works as a single equation and trying to do an array results in errors like:
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U9'), dtype('int32')) -> None

So to ensure it was only 1 at a time I used nested list comprehension:
df['ChecksumHeaders'] = [''.join([ListOfHeaders[idx]*df[CH].notnull().iloc[rowidx].astype(np.int) for idx,CH in enumerate(ListOfHeaders)]) for rowidx in np.arange(len(df))] which works fine if you only have a handful of rows and columns to go through, but takes a long time when it needs to run through O10 columns and O100k rows. Surely there’s a better way to do this, any ideas?

>Solution :

IIUC, you can do:

print(
    df.stack()
    .reset_index(level=1)
    .rename(columns={"level_1": "ChecksumHeaders", 0: "ChecksumValues"})
)

Prints:

  ChecksumHeaders ChecksumValues
0            CH_A          CV_A1
1            CH_A          CV_A2
2            CH_B          CV_B1
3            CH_C          CV_C1