Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to remove zeros in dataframe after being created from dictionary?

I have this dictionary with descriptive statistics of the data:

import pandas as pd


def summary_table(df):
    """
    Return a summary table with the descriptive statistics about the dataframe.
    """

    summary = {
        "Number of Days": [len(df)],
        "Missing Cells": [df.isnull().sum().sum()],
        "Missing Cells (%)": [round(df.isnull().sum().sum() / df.shape[0] * 100, 2)],
        "Duplicated Rows": [df.duplicated().sum()],
        "Duplicated Rows (%)": [round(df.duplicated().sum() / df.shape[0] * 100, 2)],
        "Length of Categorical Variables": [len([i for i in df.columns if df[i].dtype == object])],
        "Length of Numerical Variables": [len([i for i in df.columns if df[i].dtype != object])]
    }
    print(summary.items())
    df = pd.DataFrame(summary.items(), columns=['Description', 'Value'])
    df = df.applymap(lambda x: x[0] if isinstance(x, list) else x)
    return df

df=pd.read_csv('test.csv')
df2=summary_table(df)
print(df2)

and this creates the output:

dict_items([('Number of Days', [434]), ('Missing Cells', [108]), ('Missing Cells (%)', [24.88]), ('Duplicated Rows', [0]), ('Duplicated Rows (%)', [0.0]), ('Length of Categorical Variables', [1]), ('Length of Numerical Variables', [11])])
                       Description   Value
0                   Number of Days  434.00
1                    Missing Cells  108.00
2                Missing Cells (%)   24.88
3                  Duplicated Rows    0.00
4              Duplicated Rows (%)    0.00
5  Length of Categorical Variables    1.00
6    Length of Numerical Variables   11.00

When printing the dictionary items, the data doesn’t contain zeros at the end. However, the dataframe cells contain extra zeros, which cause confusion. How could I fix this issue and remove the extra zeros in the dataframe conversion from dictionary?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Use an object dtype to enable mixed int/floats. Don’t use lists as container:

def summary_table(df):
    """
    Return a summary table with the descriptive statistics about the dataframe.
    """
    nulls = df.isnull().sum().sum()
    dups = df.duplicated().sum()
    summary = {
        "Number of Days": len(df),
        "Missing Cells": nulls,
        "Missing Cells (%)": round(nulls / df.shape[0] * 100, 2),
        "Duplicated Rows": dups,
        "Duplicated Rows (%)": round(dups / df.shape[0] * 100, 2),
        "Length of Categorical Variables": len([i for i in df.columns if df[i].dtype == object]),
        "Length of Numerical Variables": len([i for i in df.columns if df[i].dtype != object])
    }
    df = pd.DataFrame(summary.items(), columns=['Description', 'Value'], dtype=object)
    return df

Example:

print(summary_table(df))
                       Description Value
0                   Number of Days     8
1                    Missing Cells     0
2                Missing Cells (%)   0.0
3                  Duplicated Rows     0
4              Duplicated Rows (%)   0.0
5  Length of Categorical Variables     2
6    Length of Numerical Variables     1

You could further improve your code to avoid computing duplicated indicators.

For instance:

nulls = df.isnull().sum().sum()
...
        "Missing Cells": [nulls],
        "Missing Cells (%)": [nulls / df.shape[0] * 100, 2)
...
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading