Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Different behavior of apply(str) and astype(str) for datetime64[ns] pandas columns

I’m working with datetime information in pandas and wanted to convert a bunch of datetime64[ns] columns to str. I noticed a different behavior from the two approaches that I expected to yield the same result.

Here’s a MCVE.

import pandas as pd

# Create a dataframe with dates according to ISO8601
df = pd.DataFrame(
    {
        "dt_column": [
            "2023-01-01",
            "2023-01-02",
            "2023-01-02",
        ]
    }
)

# Convert the dates to datetime columns
# (I expect the time portion to be 00:00:00)
df["dt_column"] = pd.to_datetime(df["dt_column"])

df["str_from_astype"] = df["dt_column"].astype(str)
df["str_from_apply"] = df["dt_column"].apply(str)

print(df)
print("")
print(f"Datatypes of the dataframe \n{df.dtypes}")

Output

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

   dt_column str_from_astype       str_from_apply
0 2023-01-01      2023-01-01  2023-01-01 00:00:00
1 2023-01-02      2023-01-02  2023-01-02 00:00:00
2 2023-01-02      2023-01-02  2023-01-02 00:00:00

Datatypes of the dataframe 
dt_column          datetime64[ns]
str_from_astype            object
str_from_apply             object
dtype: object

If I use .astype(str) the time information is lost and when I use .apply(str) the time information is retained (or inferred).

Why is that?

(Pandas v1.5.2, Python 3.9.15)

>Solution :

The time information is never lost, if you use 2023-01-02 12:00, you’ll see that all times will be present with astype, but also visible in the original datetime column:

            dt_column      str_from_astype       str_from_apply
0 2023-01-01 00:00:00  2023-01-01 00:00:00  2023-01-01 00:00:00
1 2023-01-02 00:00:00  2023-01-02 00:00:00  2023-01-02 00:00:00
2 2023-01-02 12:00:00  2023-01-02 12:00:00  2023-01-02 12:00:00

With apply, the python str builtin is applied on each Timestamp object, which always shows a full format:

str(pd.Timestamp('2023-01-01'))
# '2023-01-01 00:00:00'

With astype, the formatting is handled by pandas.io.formats.format.SeriesFormatter, which is a bit smarter and decides on the output format depending on the context (here other values in the Series and the presence of a non-null time).

The canonical way to be explicit is anyway to use dt.strftime:

# without time
df["dt_column"].dt.strftime('%Y-%m-%d')

# with time
df["dt_column"].dt.strftime('%Y-%m-%d %H:%M:%S')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading