How to create new column in Python pandas dataframe with column labels based on the content of those columns

I have a bunch of data files from a scientific instrument that come with checksums. I’ve read the files into a combined pandas dataframe, but the checksum for each row is for comparison to the column label of the checksum column and is unique to each file. This results in a dataframe with O10 columns with very large strings in the column labels and column values (for the rows from the file with that column label). Now I need to combine these O10 columns into 2 columns, one for the values, and one for the headers the checksums are compared against.

Simplified Example Starting Table, CH=Checksum Header, CV=Checksum Value

CH_A CH_B CH_C
CV_A1 NaN NaN
CV_A2 NaN NaN
NaN CV_B1 NaN
NaN NaN CV_C1

Should combine to be 2 columns

ChecksumHeaders ChecksumValues
CH_A CV_A1
CH_A CV_A2
CH_B CV_B1
CH_C CV_C1

The ‘ChecksumValues’ table was easy enough to get with df.ffill, but I’m struggling to determine a good method for creating the ‘ChecksumHeaders’ column. Seeing as how 'string'*1 =='string' and 'string'*0 ==''. I tied using a number of variations of np.sum([df[CH].notnull().astype(np.int)*CH for CH in ListOfHeaders]) but found that the string*int only works as a single equation and trying to do an array results in errors like:
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U9'), dtype('int32')) -> None

So to ensure it was only 1 at a time I used nested list comprehension:
df['ChecksumHeaders'] = [''.join([ListOfHeaders[idx]*df[CH].notnull().iloc[rowidx].astype(np.int) for idx,CH in enumerate(ListOfHeaders)]) for rowidx in np.arange(len(df))] which works fine if you only have a handful of rows and columns to go through, but takes a long time when it needs to run through O10 columns and O100k rows. Surely there’s a better way to do this, any ideas?

>Solution :

IIUC, you can do:

print(
    df.stack()
    .reset_index(level=1)
    .rename(columns={"level_1": "ChecksumHeaders", 0: "ChecksumValues"})
)

Prints:

  ChecksumHeaders ChecksumValues
0            CH_A          CV_A1
1            CH_A          CV_A2
2            CH_B          CV_B1
3            CH_C          CV_C1

Leave a Reply