Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

a better way of cleaning lists in python

I have some text extracted from beautiful soup, unfortunately all of the text is in one and I’m saving it down as a list.

example:

souplist = [
['   Date', '', '  Fri 30th Apr 2021', '', ' 60084096-1', 'Type', '', '  Staff Travel (Rail)', '', 'Description', '', '', '  Stratford International', 'Amount', '', '', '', '', '   £25.10 Paid  '],

['   Date', '', '  Tue 27th Apr 2021', '', ' 60084096-3', 'Type', '', '  Office Costs (Stationery & printing)', '', 'Description', '', '', '  AMAZON.CO.UK [***]', 'Amount', '', '', '', '', '   £42.98 Paid  '],
['   Date', '', '  Tue 1st Dec 2020', '', ' 90012371-0', 'Type', '', '  Office Costs (Rent)', '', '  Amount', '', '', '', '', '   £3,500.00 Paid  '],
['   Date', '', '  Wed 14th Oct 2020', '', ' 60064831-1', 'Type', '', '  Office Costs (Software & applications)', '', 'Description', '', '', '  MAILCHIMP', 'MISC', 'Amount', '', '', '', '', '   £38.13 Paid  ']
]

I wanted to create this into a data frame with columns, date, id, type, description, amount.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I tried to do a for loop something like:

claims= {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for i in range(len(souplist)):
      
    claims['id'].append(newclaims[i][4])
    claims['date'].append(newclaims[i][2])
    claims['type'].append(newclaims[i][7])
    claims['description'].append(newclaims[i][12])
    claims['amount'].append(newclaims[i][18])
    

however in the larger dataset the position in the list changes and not all of the lists are the same length. I’m not really sure how to clean the list.
please could you provide a better way?

>Solution :

TL;DR use this:

souplist = [[val.strip() for val in lst if val] for lst in souplist]
claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for key in claims:
    for lst in souplist:
        try:
            claims[key].append(lst[lst.index(key.title())+1])
        except ValueError:
            if key == 'id':
                claims[key].append(lst[2])
            else:
                claims[key].append(None)

souplist = [[val.strip() for val in lst if val] for lst in souplist]

will strip down your souplist to not contain any empty strings ('') and to strip whitespace off your non-empty strings.


It seems some of your intended fields are labelled; e.g. 'Date' is the element immediately prior to the value you want. You can use this to your advantage:

claims = {'id':[],'date':[],'type':[],'description':[],'amount':[]}

for key in claims:
    for lst in souplist:
        try:
            # using key.title() to translate e.g. 'amount' -> 'Amount'
            # get the field from lst immediately after the label matching key.title()
            claims[key].append(lst[lst.index(key.title())+1])
        except ValueError:
            # the key doesn't exist as a label in the soup list
            claims[key].append(None)
            # change the above as desired; this will make the lists all be the same length

This gives:

>>> claims
{'id':          [None, None], 
 'date':        ['Fri 30th Apr 2021', 'Tue 27th Apr 2021'], 
 'type':        ['Staff Travel (Rail)', 'Office Costs (Stationery & printing)'],
 'description': ['Stratford International', 'AMAZON.CO.UK [***]'],
 'amount':      ['£25.10 Paid', '£42.98 Paid']
}

If you wanted claims to be a list of dicts, which might make further processing more straightforward, you might try:

keys = ['id', 'date', 'type', 'description', 'amount']
claims = []

for lst in souplist:
    claim = {}
    for key in keys:
        try:
            claim[key] = lst[lst.index(key.title())+1]
        except ValueError:
            # somehow figure out how to capture 'id'?
            claim[key] = None
    claims.append(claim)

which would give

>>> claims
[
 {'id': None,
  'date': 'Fri 30th Apr 2021',
  'type': 'Staff Travel (Rail)',
  'description': 'Stratford International',
  'amount': '£25.10 Paid'
 }, 
 {'id': None,
  'date': 'Tue 27th Apr 2021',
  'type': 'Office Costs (Stationery & printing)',
  'description': 'AMAZON.CO.UK [***]',
  'amount': '£42.98 Paid'
 }
]

Bonus points:
If you’re going to use pandas (an awesome library!) you probably want to convert to Datetime your date column, index on your id column, and convert your amount to a number:

df = pd.DataFrame(claims)
df.date = pd.to_datetime(df.date)
df.amount = df.amount.str.replace(r'(£)|( Paid)|(\.)|,','').astype(int)
# treat your amount as an integer to get exact maths; then /100 whenever printing
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading