I have an excel sheet which has four columns in which each cell of the column is related to the next columns cell in a cascaded manner.
The excel sheet looks like below:-
Base Version OS Package Name Description Version
0 A NaN NaN NaN NaN
1 NaN B NaN NaN NaN
2 NaN NaN b-01.zip description about B-01 NaN
3 NaN NaN b-02.zip description about B-02 NaN
4 X NaN NaN NaN NaN
5 NaN Y NaN NaN NaN
6 NaN NaN y-01.zip description about Y-01 NaN
7 NaN NaN y-02.zip description about Y-02 NaN
I want to have a DataFrame that should look like below when output to an excel sheet.
Base Version OS Package Name Description Version
2 A B b-01.zip description about B-01 NaN
3 NaN NaN b-02.zip description about B-02 NaN
6 X Y y-01.zip description about Y-01 NaN
7 NaN NaN y-02.zip description about Y-02 NaN
Is there a way in Pandas to achieve this?
>Solution :
You can use ffill and boolean indexing:
# identify rows with data
m = df['Package Name'].notna()
# ffill and keep first value
df[['Base Version', 'OS']] = df[['Base Version', 'OS']].ffill()[m].drop_duplicates()
# only select the rows of interest
df = df[m]
Alternative if you want a copy:
m = df['Package Name'].notna()
out = df[m]
out.loc[:, ['Base Version', 'OS']] = (df[['Base Version', 'OS']].ffill()
[m].drop_duplicates()
)
Output:
Base Version OS Package Name Description Version
2 A B b-01.zip description about B-01 NaN
3 NaN NaN b-02.zip description about B-02 NaN
6 X Y y-01.zip description about Y-01 NaN
7 NaN NaN y-02.zip description about Y-02 NaN
Reproducible input:
from numpy import nan
df = pd.DataFrame({'Base Version': ['A', nan, nan, nan, 'X', nan, nan, nan],
'OS': [nan, 'B', nan, nan, nan, 'Y', nan, nan],
'Package Name': [nan, nan, 'b-01.zip', 'b-02.zip', nan, nan, 'y-01.zip', 'y-02.zip'],
'Description': [nan, nan, 'description about B-01', 'description about B-02', nan, nan, 'description about Y-01', 'description about Y-02'],
'Version': [nan, nan, nan, nan, nan, nan, nan, nan]})