Transforming automated text to a DataFrame

April 4, 2023

I have a set of computer-generated notices saved in a text file – it looks like this:

MY COMPANY LLC: Statement# 123456 for $10000.99 for the month of 2023 02 (FEB)

MY COMPANY LLC: Statement# 123457 for $100.01 for the month of 2022 09 (SEP)

MY COMPANY LLC: Statement# 123458 for -$51.00 for the month of 2022 10 (OCT)

Ideal Output – as DataFrame

Statement	Amount	Month
123456	10000.99	2023 02 (FEB)
123457	100.01	2022 09 (SEP)
123458	-51	2022 10 (OCT)

I know I can turn this into a CSV outside of Python, then import it with pandas.

But can I load the text file into Python as-is, and transform convert it into a DataFrame?

One "hint" – the word for can work as my column separator in this case. It splits every row reliably into the fields I want. It’s a bit of luck.

>Solution :

The format of your expected dataframe is not clear but here is an option with extract :

df = (
        pd.read_csv("input2.txt", header=None, sep="|").squeeze()
            .str.extract("(.*): Statement# (.*) for (.*) for the month of (\d+) (\d+) \((\w+)\)")
            .set_axis(["company_name", "statement", "amount", "year", "day", "month"], axis=1)
      )

Output :

print(df)

     company_name statement     amount  year day month
0  MY COMPANY LLC    123456  $10000.99  2023  02   FEB
1  MY COMPANY LLC    123457    $100.01  2022  09   SEP
2  MY COMPANY LLC    123458    -$51.00  2022  10   OCT