I have a set of computer-generated notices saved in a text file – it looks like this:
- MY COMPANY LLC: Statement# 123456 for $10000.99 for the month of 2023 02 (FEB)
- MY COMPANY LLC: Statement# 123457 for $100.01 for the month of 2022 09 (SEP)
- MY COMPANY LLC: Statement# 123458 for -$51.00 for the month of 2022 10 (OCT)
Ideal Output – as DataFrame
| Statement | Amount | Month |
|---|---|---|
| 123456 | 10000.99 | 2023 02 (FEB) |
| 123457 | 100.01 | 2022 09 (SEP) |
| 123458 | -51 | 2022 10 (OCT) |
I know I can turn this into a CSV outside of Python, then import it with pandas.
But can I load the text file into Python as-is, and transform convert it into a DataFrame?
One "hint" – the word for can work as my column separator in this case. It splits every row reliably into the fields I want. It’s a bit of luck.
>Solution :
The format of your expected dataframe is not clear but here is an option with extract :
df = (
pd.read_csv("input2.txt", header=None, sep="|").squeeze()
.str.extract("(.*): Statement# (.*) for (.*) for the month of (\d+) (\d+) \((\w+)\)")
.set_axis(["company_name", "statement", "amount", "year", "day", "month"], axis=1)
)
Output :
print(df)
company_name statement amount year day month
0 MY COMPANY LLC 123456 $10000.99 2023 02 FEB
1 MY COMPANY LLC 123457 $100.01 2022 09 SEP
2 MY COMPANY LLC 123458 -$51.00 2022 10 OCT