Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Delete numbers smaller then 3 digits in a list while amount of items stays the same

I want to normalize my list containing years. It is important that the amount of items in the list stay the same, because I’m going to convert the list to a dataframe and the rows need to allign with the other variables. This is the list I have. It contains many different ways to notate the year:

['1817 (1817p)', '1800-1824 (19.1q)', '1825-1849', 'ca. 1850', '1856–60', '1861-07-XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']

Now, I would like to get only 1 year per item in the list. For example:

['1817', '1800', '1825', '1850', '1856', '1861', '1824', '1767', '1718']

If there are two years in 1 item, then choose the first year. (Bonus points if you could get the mean if there are 2 items in a list.)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

In order to get te desired result, I removed everything within brackets and replaced "-" with spaces.

import re

data2 = []

for i in data:
    df8 = re.sub(r"\([^()]*\)", "", i)
    df10 = re.sub((r'\–'), " ", df8)
    df11 = re.sub((r'\-'), " ", df10)
    data2 += [df11]
print(data2)

Output 1:

['1817 ', '1800 1824 ', '1825 1849', 'ca. 1850', '1856 60', '1861 07 XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']

Then I iterated through the items, but I end up with more items in the list than at the beginning.

ls = data2
ls2 = []
 
for i in ls:
    res = re.findall(r'\w+', i)
    for w in res:
        if len(w) > 3:
            ls2.append(w)
print(ls2)

Output 2:

['1817', '1800', '1824', '1825', '1849', '1850', '1856', '1861', 'copied', 'between', '1824', '1845', 'copied', '14tn', 'Merz', '1767', '1718']

>Solution :

What I can think of is using a combination of regex and numpy modules:

import re
import numpy as np
myList = ['1817 (1817p)', '1800-1824 (19.1q)', '1825-1849', 'ca. 1850', '1856–60', '1861-07-XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']
[np.array(re.findall("\d{4}",x)).astype("int").mean() for x in myList]

Output

[1817.0, 1812.0, 1837.0, 1850.0, 1856.0, 1861.0, 1834.5, 1767.0, 1718.0]

This actually gives you the mean of the numbers in each element of the list.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading