Pandas : Create new column based on text value of another column

May 16, 2022

This might be very simple question, but here’s my dataframe:

    id      text                    position        labels
0   39088   skin melanoma           [58.0, 71.0]    indication
1   39088   proteinase              [137.0, 147.0]  protein
2   39088   plasminogen activator   [170.0, 191.0]  protein
3   39088   NaN                     [nan, nan]      NaN
4   39088   NaN                     [nan, nan]      NaN
5   39088   proteinase substrates   [36.0, 57.0]    protein
6   39088   tumors                  [67.0, 73.0]    indication
7   39088   NaN                     [nan, nan]      NaN
8   39088   Melanoma                [0.0, 8.0]      indication
9   39088   EDTA                    [172.0, 176.0]  protein

{'pmid': [39088,
  39088,
  39088,
  39088,
  39088,
  39088,
  39088,
  39088,
  39088,
  39088],
 'text': ['skin melanoma',
  'proteinase',
  'plasminogen activator',
  nan,
  nan,
  'proteinase substrates',
  'tumors',
  nan,
  'Melanoma',
  'EDTA'],
 'position': ['[58.0, 71.0]',
  '[137.0, 147.0]',
  '[170.0, 191.0]',
  '[nan, nan]',
  '[nan, nan]',
  '[36.0, 57.0]',
  '[67.0, 73.0]',
  '[nan, nan]',
  '[0.0, 8.0]',
  '[172.0, 176.0]'],
 'labels': ['indication',
  'protein',
  'protein',
  nan,
  nan,
  'protein',
  'indication',
  nan,
  'indication',
  'protein']}

And here’s the WANTED OUTPUT , where I want to create 2 new columns based on the values of the labels column, and put as values the corresponding text and position depending on whether they are indication or protein, and for the rest NaN.

    id      indication     indication.position       protein                 protein.position 
0   39088   skin melanoma   [58.0, 71.0]             NaN                     [nan, nan]
1   39088   NaN             [nan, nan]               proteinase              [137.0, 147.0]
2   39088   NaN             [nan, nan]               plasminogen activator   [170.0, 191.0]  
3   39088   NaN             [nan, nan]               NaN                     [nan, nan]
4   39088   NaN             [nan, nan]               NaN                     [nan, nan]
5   39088   NaN             [nan, nan]               proteinase substrates   [36.0, 57.0] 
6   39088   tumors          [67.0, 73.0]             NaN                     [nan, nan]
7   39088   NaN             [nan, nan]               NaN                     [nan, nan]
8   39088   Melanoma                [0.0, 8.0]       NaN                     [nan, nan]
9   39088   NaN             [nan, nan]               EDTA                    [172.0, 176.0]

What is the best way to do this? Can someone help?

>Solution :

You can use:

out = (df
   .drop(columns=['position', 'result.value.labels'])
   .join(
 df.reset_index().dropna(subset=['result.value.labels'])
   .pivot(index='index', columns='result.value.labels', values='position')
   .reindex(df.index)
   .fillna('[nan, nan]')
   .add_suffix('.position')
   )
)

output:

    pmid      result.value.text indication.position protein.position
0  39088          skin melanoma        [58.0, 71.0]       [nan, nan]
1  39088             proteinase          [nan, nan]   [137.0, 147.0]
2  39088  plasminogen activator          [nan, nan]   [170.0, 191.0]
3  39088                    NaN          [nan, nan]       [nan, nan]
4  39088                    NaN          [nan, nan]       [nan, nan]
5  39088  proteinase substrates          [nan, nan]     [36.0, 57.0]
6  39088                 tumors        [67.0, 73.0]       [nan, nan]
7  39088                    NaN          [nan, nan]       [nan, nan]
8  39088               Melanoma          [0.0, 8.0]       [nan, nan]
9  39088                   EDTA          [nan, nan]   [172.0, 176.0]