Home I have an excel sheet which has four columns in which each cell of the column is related to the next columns cell in a cascaded manner

Questions

I have an excel sheet which has four columns in which each cell of the column is related to the next columns cell in a cascaded manner

byMR

February 21, 2024

I have an excel sheet which has four columns in which each cell of the column is related to the next columns cell in a cascaded manner.

The excel sheet looks like below:-

  Base Version   OS Package Name             Description  Version
0            A  NaN          NaN                     NaN      NaN
1          NaN    B          NaN                     NaN      NaN
2          NaN  NaN     b-01.zip  description about B-01      NaN
3          NaN  NaN     b-02.zip  description about B-02      NaN
4            X  NaN          NaN                     NaN      NaN
5          NaN    Y          NaN                     NaN      NaN
6          NaN  NaN     y-01.zip  description about Y-01      NaN
7          NaN  NaN     y-02.zip  description about Y-02      NaN

I want to have a DataFrame that should look like below when output to an excel sheet.

  Base Version   OS Package Name             Description  Version
2            A    B     b-01.zip  description about B-01      NaN
3          NaN  NaN     b-02.zip  description about B-02      NaN
6            X    Y     y-01.zip  description about Y-01      NaN
7          NaN  NaN     y-02.zip  description about Y-02      NaN

Is there a way in Pandas to achieve this?

>Solution :

You can use ffill and boolean indexing:

# identify rows with data
m = df['Package Name'].notna()

# ffill and keep first value
df[['Base Version', 'OS']] = df[['Base Version', 'OS']].ffill()[m].drop_duplicates()

# only select the rows of interest
df = df[m]

Alternative if you want a copy:

m = df['Package Name'].notna()
out = df[m]
out.loc[:, ['Base Version', 'OS']] = (df[['Base Version', 'OS']].ffill()
                                      [m].drop_duplicates()
                                     )

Output:

  Base Version   OS Package Name             Description  Version
2            A    B     b-01.zip  description about B-01      NaN
3          NaN  NaN     b-02.zip  description about B-02      NaN
6            X    Y     y-01.zip  description about Y-01      NaN
7          NaN  NaN     y-02.zip  description about Y-02      NaN

Reproducible input:

from numpy import nan

df = pd.DataFrame({'Base Version': ['A', nan, nan, nan, 'X', nan, nan, nan],
                   'OS': [nan, 'B', nan, nan, nan, 'Y', nan, nan],
                   'Package Name': [nan, nan, 'b-01.zip', 'b-02.zip', nan, nan, 'y-01.zip', 'y-02.zip'],
                   'Description': [nan, nan, 'description about B-01', 'description about B-02', nan, nan, 'description about Y-01', 'description about Y-02'],
                   'Version': [nan, nan, nan, nan, nan, nan, nan, nan]})