I have a regex function that extracts the numbers before text. But I do it now with hard coded text.
But is it also possible to extract the numbers regardless of the text.
So I have this example string:
text = "[' \n\na)\n\n \n\nFactuur\nVerdi Import Schoolfruit\nFactuur nr. : 71201 Koopliedenweg 33\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 10-12-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 77553 Loading date : 09-12-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK50\nD.C. Schoolfruit\n16 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 123,20\n360 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 2.772,00\n6 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,/0 € 46,20\n75 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 577,50\n9 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 69,30\n688 Appels Royal Gala 13kg 60/65 Generica PL I € 5,07 € 3.488,16\n22 Sinaasappels Valencias 15kg 105 Elara ZAI € 6,25 € 137,50\n80 Sinaasappels Valencias 15kg 105 Elara ZAI € 6,25 € 500,00\n160 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 1.000,00\n320 Sinaasappels Valencias 15kg 105 Generica ZAI € 6,25 € 2.000,00\n160 Sinaasappels Valencias 15kg 105 Noordhoek ZA I € 6,25 € 1.000,00\n61 Sinaasappels Valencias 15kg 105 Noordhoek ZA I € 6,25 € 381,25\nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedrag\n€ 12.095,11 1.088,56\nBetaling binnen 30 dagen\nAchterstand wordt gemeld bij de kredietverzekeringsmaatschappij\nVerDi Import BV ING Bank NV. Rotterdam IBAN number: NL17INGB0006959173 ~~\n\n \n\nKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01 i\nTel, +31 (0}1 80 61 88 11, Fax +31 (0)1 8061 88 25 Chamber of Commerce Rotterdam no. 55424309 VerDi\n\nE-mail: sales@verdiimport.nl, www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.\n\nrut ard wegetables\n\x0c']"
and I have this as search words :
fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
'Tomaten Cherry', 'Sinaasappels',
'Watermeloenen', 'Rettich']
and this is the regex expression:
regex = r"(\d*(?:\.\d+)*)\s*(?:" + '|'.join(re.escape(word)
for word in fruit_words) + ')'
number_found = re.findall(regex, verdi3)
print(number_found)
and the output is then like this:
['16', '360', '6', '75', '9', '688', '22', '80', '160', '320', '160', '61']
My question:
Is it also possible to have the same output but then without the fruit_words?
Or mabye without regex?
Thank you
>Solution :
One approach without regex. First, we cut the text by \n, because all the numbers we need start on a new line. Then we discard those elements that do not start with a number. Next, we cut the remaining elements by spaces and get numbers.
text = "[' \n\na)\n\n \n\nFactuur\nVerdi Import Schoolfruit\nFactuur nr. : 71201 Koopliedenweg 33\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 10-12-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 77553 Loading date : 09-12-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK50\nD.C. Schoolfruit\n16 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 123,20\n360 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 2.772,00\n6 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,/0 € 46,20\n75 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 577,50\n9 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 69,30\n688 Appels Royal Gala 13kg 60/65 Generica PL I € 5,07 € 3.488,16\n22 Sinaasappels Valencias 15kg 105 Elara ZAI € 6,25 € 137,50\n80 Sinaasappels Valencias 15kg 105 Elara ZAI € 6,25 € 500,00\n160 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 1.000,00\n320 Sinaasappels Valencias 15kg 105 Generica ZAI € 6,25 € 2.000,00\n160 Sinaasappels Valencias 15kg 105 Noordhoek ZA I € 6,25 € 1.000,00\n61 Sinaasappels Valencias 15kg 105 Noordhoek ZA I € 6,25 € 381,25\nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedrag\n€ 12.095,11 1.088,56\nBetaling binnen 30 dagen\nAchterstand wordt gemeld bij de kredietverzekeringsmaatschappij\nVerDi Import BV ING Bank NV. Rotterdam IBAN number: NL17INGB0006959173 ~~\n\n \n\nKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01 i\nTel, +31 (0}1 80 61 88 11, Fax +31 (0)1 8061 88 25 Chamber of Commerce Rotterdam no. 55424309 VerDi\n\nE-mail: sales@verdiimport.nl, www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.\n\nrut ard wegetables\n\x0c']"
a = text.split('\n')
b = list(filter(lambda x: x[0].isdigit() if len(x) > 0 else False, a))
c = [x.split()[0] for x in b]
print(c)
['16', '360', '6', '75', '9', '688', '22', '80', '160', '320', '160', '61']