Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Selenium xpath: Trying to get original url from archived link

I am working on a project, trying to scrape articles from archive websites. For example, below is an archive url and the original url. I have the archive url. And I want to use Selenium to extract the original url.

Arhive url: https://archive.is/xXAoL

Original url: https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)

Any advice on how to get the original url?

Method 1

One thing that might work is that the canonical link is

https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU

I could just strip out things up until the second https. However, that method is not working so looking for another method not relying on meta.

>Solution :

To extract the original url you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get('https://archive.is/xXAoL')
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[name='q'][value]"))).get_attribute("value"))
    
  • Using XPATH:

    driver.get('https://archive.is/xXAoL')
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='q'][@value]"))).get_attribute("value"))
    
  • Console Output:

    https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading