I have a bunch of data files from a scientific instrument that come with checksums. I’ve read the files into a combined pandas dataframe, but the checksum for each row is for comparison to the column label of the checksum column and is unique to each file. This results in a dataframe with O10 columns with very large strings in the column labels and column values (for the rows from the file with that column label). Now I need to combine these O10 columns into 2 columns, one for the values, and one for the headers the checksums are compared against.
Simplified Example Starting Table, CH=Checksum Header, CV=Checksum Value
| CH_A | CH_B | CH_C |
|---|---|---|
| CV_A1 | NaN | NaN |
| CV_A2 | NaN | NaN |
| NaN | CV_B1 | NaN |
| NaN | NaN | CV_C1 |
Should combine to be 2 columns
| ChecksumHeaders | ChecksumValues |
|---|---|
| CH_A | CV_A1 |
| CH_A | CV_A2 |
| CH_B | CV_B1 |
| CH_C | CV_C1 |
The ‘ChecksumValues’ table was easy enough to get with df.ffill, but I’m struggling to determine a good method for creating the ‘ChecksumHeaders’ column. Seeing as how 'string'*1 =='string' and 'string'*0 ==''. I tied using a number of variations of np.sum([df[CH].notnull().astype(np.int)*CH for CH in ListOfHeaders]) but found that the string*int only works as a single equation and trying to do an array results in errors like:
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U9'), dtype('int32')) -> None
So to ensure it was only 1 at a time I used nested list comprehension:
df['ChecksumHeaders'] = [''.join([ListOfHeaders[idx]*df[CH].notnull().iloc[rowidx].astype(np.int) for idx,CH in enumerate(ListOfHeaders)]) for rowidx in np.arange(len(df))] which works fine if you only have a handful of rows and columns to go through, but takes a long time when it needs to run through O10 columns and O100k rows. Surely there’s a better way to do this, any ideas?
>Solution :
IIUC, you can do:
print(
df.stack()
.reset_index(level=1)
.rename(columns={"level_1": "ChecksumHeaders", 0: "ChecksumValues"})
)
Prints:
ChecksumHeaders ChecksumValues
0 CH_A CV_A1
1 CH_A CV_A2
2 CH_B CV_B1
3 CH_C CV_C1