Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas data frame transformations

I have this data frame:

import pandas as pd

url = "https://www-genesis.destatis.de/genesisWS/rest/2020/data/tablefile?username=DEB924AL95&password=P@ssword123&name=45213-0005&area=all&compress=false&transpose=false&startyear=1900&endyear=&timeslices=&regionalvariable=&regionalkey=&classifyingvariable1=WERTE4&classifyingkey1=REAL&classifyingvariable2=WERT03&classifyingkey2=BV4KSB&classifyingvariable3=WZ2008&classifyingkey3=WZ08-551&format=xlsx&job=false&stand=01.01.1970&language=en"


df = pd.read_excel(url)

#df.head(20)

df = df.iloc[7:-5]
df
                                        
Turnover from accommodation and food services: Germany,\nmonths, price types, original and adjusted data, economic\nactivities  Unnamed: 1  Unnamed: 2

WZ08-55 Accommodation       NaN     NaN  
                1994    January     121.9
                 NaN   February     122.0
                          March     121.4
                          April     122.1
                            May     120.1
                           June     123.1
                           July     125.5
                         August     126.1
                      September     127.8
                        October     124.3
                       November     121.8
                       December     121.7
               1995     January     120.9
                NaN    February     121.5
                          March     120.8

The expected outcome should be like this. WZ08-55 Accomodation is the name of an industry. There are many such industry names in that column. These Industry names ought to be Column Headers. And the Years which begins in the rows next to the Industry names need to be clubbed with Months and form DATE column.

The Industry names are followed by the Year in the next row and rest of the rows are blank. I have no idea how to proceed.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

enter image description here

>Solution :

Clean your data first then reshape your dataframe:

# Clean column names
df.columns = ['Variable', 'Date', 'Value']

# Boolean mask
m = df['Date'].isna()

# Clean data
df['Date'] += '-' + df['Variable'].ffill()
df['Variable'] = df['Variable'].where(m).ffill()

# Reshape your dataframe
out = (df[~m].replace('...', np.nan)
             .pivot_table(index='Date', columns='Variable',
                          values='Value', sort=False)
             .reset_index().rename_axis(columns=None))

Output:

>>> out
              Date  WZ08-55 Accommodation  ...  WZ08-561-01 Restaurants  WZ08-55-01 Accommodation and food and beverage service act.
0     January-1994                  121.9  ...                    193.2                                              152.5          
1    February-1994                  122.0  ...                    189.0                                              150.7          
2       March-1994                  121.4  ...                    192.2                                              152.2          
3       April-1994                  122.1  ...                    189.0                                              150.6          
4         May-1994                  120.1  ...                    189.9                                              150.1          
..             ...                    ...  ...                      ...                                                ...          
351     April-2023                   98.6  ...                     90.0                                               91.8          
352       May-2023                   90.9  ...                     85.4                                               87.3          
353      June-2023                   88.2  ...                     84.7                                               87.3          
354      July-2023                   84.4  ...                     84.3                                               84.7          
355    August-2023                   82.3  ...                     83.2                                               83.2          

[356 rows x 12 columns]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading