Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Scrapy Returning Data Outside of Specified Elements

I am trying to scrape the names of players from this page: https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard

To do that I first get the tables containing the batting scorecards:

batting_scorecard = response.xpath("//table[@class='ds-w-full ds-table ds-table-md ds-table-auto  ci-scorecard-table']")

Then I try to get the player names:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

batting_scorecard.xpath("//a[contains(@href,'/player/')]/span/span/text()").getall()

This returns a list that contains all the player names (as well as some rubbish to be parsed) but it also contains names of players/umpires/referees who are not in the specified tables.

In the list below ‘Luke Wood’ (last occurrence), ‘Aleem Dar’, ‘Asif Yaqoob’, ‘Ahsan Raza’, ‘Rashid Riaz’, ‘Muhammad Javed’ should not be returned as they are in a different table. The batting_scorecard tables have class "ds-w-full ds-table ds-table-md ds-table-auto ci-scorecard-table" whereas this data is in a table with class "ds-w-full ds-table ds-table-sm ds-table-auto ".

Can anyone see what the problem is?

['Mohammad Rizwan',
 '\xa0',
 'Babar Azam',
 '\xa0',
 'Haider Ali',
 '\xa0',
 'Shan Masood',
 '\xa0',
 'Iftikhar Ahmed',
 '\xa0',
 'Mohammad Nawaz',
 '\xa0',
 'Khushdil Shah',
 '\xa0',
 'Naseem Shah',
 '\xa0',
 'Usman Qadir',
 '\xa0',
 'Haris Rauf',
 ',',
 '\xa0',
 'Shahnawaz Dahani',
 '\xa0',
 'Phil Salt',
 '\xa0',
 'Alex Hales',
 '\xa0',
 'Dawid Malan',
 '\xa0',
 'Ben Duckett',
 '\xa0',
 'Harry Brook',
 '\xa0',
 'Moeen Ali',
 '\xa0',
 'Sam Curran',
 ',',
 '\xa0',
 'David Willey',
 ',',
 '\xa0',
 'Adil Rashid',
 ',',
 '\xa0',
 'Luke Wood',
 ',',
 '\xa0',
 'Richard Gleeson',
 '\xa0',
 'Luke Wood',
 'Aleem Dar',
 'Asif Yaqoob',
 'Ahsan Raza',
 'Rashid Riaz',
 'Muhammad Javed',
 'Mohammad Rizwan',
 '\xa0',
 'Babar Azam',
 '\xa0',
 'Haider Ali',
 '\xa0',
 'Shan Masood',
 '\xa0',
 'Iftikhar Ahmed',
 '\xa0',
 'Mohammad Nawaz',
 '\xa0',
 'Khushdil Shah',
 '\xa0',
 'Naseem Shah',
 '\xa0',
 'Usman Qadir',
 '\xa0',
 'Haris Rauf',
 ',',
 '\xa0',
 'Shahnawaz Dahani',
 '\xa0',
 'Phil Salt',
 '\xa0',
 'Alex Hales',
 '\xa0',
 'Dawid Malan',
 '\xa0',
 'Ben Duckett',
 '\xa0',
 'Harry Brook',
 '\xa0',
 'Moeen Ali',
 '\xa0',
 'Sam Curran',
 ',',
 '\xa0',
 'David Willey',
 ',',
 '\xa0',
 'Adil Rashid',
 ',',
 '\xa0',
 'Luke Wood',
 ',',
 '\xa0',
 'Richard Gleeson',
 '\xa0',
 'Luke Wood',
 'Aleem Dar',
 'Asif Yaqoob',
 'Ahsan Raza',
 'Rashid Riaz',
 'Muhammad Javed']

>Solution :

Change your selector to:

batting_scorecard.xpath(".//a[contains(@href,'/player/')]/span/span/text()").getall()

This way (by adding a dot in front of xpath), XPATH will only search within the actual element, not in the full page.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading