I have some text extracted from beautiful soup, unfortunately all of the text is in one and I’m saving it down as a list.
example:
souplist = [
[' Date', '', ' Fri 30th Apr 2021', '', ' 60084096-1', 'Type', '', ' Staff Travel (Rail)', '', 'Description', '', '', ' Stratford International', 'Amount', '', '', '', '', ' £25.10 Paid '],
[' Date', '', ' Tue 27th Apr 2021', '', ' 60084096-3', 'Type', '', ' Office Costs (Stationery & printing)', '', 'Description', '', '', ' AMAZON.CO.UK [***]', 'Amount', '', '', '', '', ' £42.98 Paid '],
[' Date', '', ' Tue 1st Dec 2020', '', ' 90012371-0', 'Type', '', ' Office Costs (Rent)', '', ' Amount', '', '', '', '', ' £3,500.00 Paid '],
[' Date', '', ' Wed 14th Oct 2020', '', ' 60064831-1', 'Type', '', ' Office Costs (Software & applications)', '', 'Description', '', '', ' MAILCHIMP', 'MISC', 'Amount', '', '', '', '', ' £38.13 Paid ']
]
I wanted to create this into a data frame with columns, date, id, type, description, amount.
I tried to do a for loop something like:
claims= {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for i in range(len(souplist)):
claims['id'].append(newclaims[i][4])
claims['date'].append(newclaims[i][2])
claims['type'].append(newclaims[i][7])
claims['description'].append(newclaims[i][12])
claims['amount'].append(newclaims[i][18])
however in the larger dataset the position in the list changes and not all of the lists are the same length. I’m not really sure how to clean the list.
please could you provide a better way?
>Solution :
TL;DR use this:
souplist = [[val.strip() for val in lst if val] for lst in souplist]
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for key in claims:
for lst in souplist:
try:
claims[key].append(lst[lst.index(key.title())+1])
except ValueError:
if key == 'id':
claims[key].append(lst[2])
else:
claims[key].append(None)
souplist = [[val.strip() for val in lst if val] for lst in souplist]
will strip down your souplist to not contain any empty strings ('') and to strip whitespace off your non-empty strings.
It seems some of your intended fields are labelled; e.g. 'Date' is the element immediately prior to the value you want. You can use this to your advantage:
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}
for key in claims:
for lst in souplist:
try:
# using key.title() to translate e.g. 'amount' -> 'Amount'
# get the field from lst immediately after the label matching key.title()
claims[key].append(lst[lst.index(key.title())+1])
except ValueError:
# the key doesn't exist as a label in the soup list
claims[key].append(None)
# change the above as desired; this will make the lists all be the same length
This gives:
>>> claims
{'id': [None, None],
'date': ['Fri 30th Apr 2021', 'Tue 27th Apr 2021'],
'type': ['Staff Travel (Rail)', 'Office Costs (Stationery & printing)'],
'description': ['Stratford International', 'AMAZON.CO.UK [***]'],
'amount': ['£25.10 Paid', '£42.98 Paid']
}
If you wanted claims to be a list of dicts, which might make further processing more straightforward, you might try:
keys = ['id', 'date', 'type', 'description', 'amount']
claims = []
for lst in souplist:
claim = {}
for key in keys:
try:
claim[key] = lst[lst.index(key.title())+1]
except ValueError:
# somehow figure out how to capture 'id'?
claim[key] = None
claims.append(claim)
which would give
>>> claims
[
{'id': None,
'date': 'Fri 30th Apr 2021',
'type': 'Staff Travel (Rail)',
'description': 'Stratford International',
'amount': '£25.10 Paid'
},
{'id': None,
'date': 'Tue 27th Apr 2021',
'type': 'Office Costs (Stationery & printing)',
'description': 'AMAZON.CO.UK [***]',
'amount': '£42.98 Paid'
}
]
Bonus points:
If you’re going to use pandas (an awesome library!) you probably want to convert to Datetime your date column, index on your id column, and convert your amount to a number:
df = pd.DataFrame(claims)
df.date = pd.to_datetime(df.date)
df.amount = df.amount.str.replace(r'(£)|( Paid)|(\.)|,','').astype(int)
# treat your amount as an integer to get exact maths; then /100 whenever printing