Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to create new column in Python pandas dataframe with column labels based on the content of those columns

I have a bunch of data files from a scientific instrument that come with checksums. I’ve read the files into a combined pandas dataframe, but the checksum for each row is for comparison to the column label of the checksum column and is unique to each file. This results in a dataframe with O10 columns with very large strings in the column labels and column values (for the rows from the file with that column label). Now I need to combine these O10 columns into 2 columns, one for the values, and one for the headers the checksums are compared against.

Simplified Example Starting Table, CH=Checksum Header, CV=Checksum Value

CH_A CH_B CH_C
CV_A1 NaN NaN
CV_A2 NaN NaN
NaN CV_B1 NaN
NaN NaN CV_C1

Should combine to be 2 columns

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ChecksumHeaders ChecksumValues
CH_A CV_A1
CH_A CV_A2
CH_B CV_B1
CH_C CV_C1

The ‘ChecksumValues’ table was easy enough to get with df.ffill, but I’m struggling to determine a good method for creating the ‘ChecksumHeaders’ column. Seeing as how 'string'*1 =='string' and 'string'*0 ==''. I tied using a number of variations of np.sum([df[CH].notnull().astype(np.int)*CH for CH in ListOfHeaders]) but found that the string*int only works as a single equation and trying to do an array results in errors like:
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U9'), dtype('int32')) -> None

So to ensure it was only 1 at a time I used nested list comprehension:
df['ChecksumHeaders'] = [''.join([ListOfHeaders[idx]*df[CH].notnull().iloc[rowidx].astype(np.int) for idx,CH in enumerate(ListOfHeaders)]) for rowidx in np.arange(len(df))] which works fine if you only have a handful of rows and columns to go through, but takes a long time when it needs to run through O10 columns and O100k rows. Surely there’s a better way to do this, any ideas?

>Solution :

IIUC, you can do:

print(
    df.stack()
    .reset_index(level=1)
    .rename(columns={"level_1": "ChecksumHeaders", 0: "ChecksumValues"})
)

Prints:

  ChecksumHeaders ChecksumValues
0            CH_A          CV_A1
1            CH_A          CV_A2
2            CH_B          CV_B1
3            CH_C          CV_C1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading