Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why does python's regex take all the characters after .* when I type until reach a certain character?

I want to write a program that counts the likes of a YouTube channel.
This is my code.

import re
import requests 
from bs4 import BeautifulSoup

r = requests.get("https://filmot.com/channel/UCX6OQ3DkcsbYNE6H8uQQuVA")

soup = BeautifulSoup(r.text , "html.parser")
val=soup.find_all("span",attrs={"class":"badge"})
res = re.findall(r"class=\"fa fa-thumbs-up\"></i>(.*)\<" , str(val))

print(res)

But it returns the result.

['404.1K</span>, <span class="badge">Entertainment</span>, <span class="badge">8m1s</span>, <span class="badge">18 Dec 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>10M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>957.2K</span>, <span class="badge">Entertainment</span>, <span class="badge">12m9s</span>, <span class="badge">16 Dec 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>14.6M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>1.4M</span>, <span class="badge">Entertainment</span>, <span class="badge">12m4s</span>, <span class="badge">10 Dec 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>11.3M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>1.1M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>5.1K</span>, <span class="badge">Entertainment</span>, <span class="badge">11m1s</span>, <span class="badge">24 Nov 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>17.5M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>2.8M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>3.5K</span>, <span class="badge">Entertainment</span>, <span class="badge">25m41s</span>, <span class="badge">29 Oct 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>17M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>2M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>6K</span>, <span class="badge">Entertainment</span>, <span class="badge">4m55s</span>, <span class="badge">23 Oct 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>19.4M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>1.4M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>12.5K</span>, <span class="badge">Entertainment</span>, <span class="badge">15m42s</span>, <span class="badge">12 Oct 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>127.7K</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>15.3K</span>, <span class="badge">Entertainment</span>, <span class="badge">5m20s</span>, <span class="badge">26 Sep 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>7.7M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>777.1K</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>6.1K</span>, <span class="badge">Entertainment</span>, <span class="badge">8m2s</span>, <span class="badge">04 Sep 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>48.4M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>2.5M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>24.1K</span>, <span class="badge">Entertainment</span>, <span class="badge">12m40s</span>, <span class="badge">31 Aug 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>69.8M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>3M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>38.6K</span>, <span class="badge">Entertainment</span>, <span class="badge">19m25s</span>, <span class="badge">07 Aug 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>53.3M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>2.2M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>29.1K</span>, <span class="badge">Entertainment</span>, <span class="badge">16m40s</span>, <span class="badge">24 Jul 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>44.6M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>1.7M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>21.4K</span>, <span class="badge">Entertainment</span>, <span class="badge">10m45s</span>, <span class="badge">10 Jul 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>42.2M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>1.7M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>24.1K</span>, <span class="badge">Entertainment</span>, <span class="badge">11m34s</span>, <span class="badge">26 Jun 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>53.6M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>1.8M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>30.6K</span>, <span class="badge">Entertainment</span>, <span class="badge">12m33s</span>, <span class="badge">12 Jun 2021</span>, <span class="badge"><i aria-hidden="true" class="fa fa-eye"></i>49.5M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-up"></i>1.9M</span>, <span class="badge"><i aria-hidden="true" class="fa fa-thumbs-down"></i>29.2K</span>, <span ....

I tested it on the regex101.com site and the result was correct. you can see that in this image.
enter image description here

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

If you want to use regex, a positive lookbehind would be best in such case, e.g.
(?<=class=\"fa fa-thumbs-up\"></i>)[\d\w.]+ as in res = re.findall(r"(?<=class=\"fa fa-thumbs-up\"></i>)[\d\w.]+" , str(val)). The .* can be tricky since . catches any character and * catches it between zero and unlimited times (it’s an example of a greedy regex operator).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading