I am working on a project, trying to scrape articles from archive websites. For example, below is an archive url and the original url. I have the archive url. And I want to use Selenium to extract the original url.
Arhive url: https://archive.is/xXAoL
Original url: https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
Any advice on how to get the original url?
Method 1
One thing that might work is that the canonical link is
https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
I could just strip out things up until the second https. However, that method is not working so looking for another method not relying on meta.
>Solution :
To extract the original url you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
-
Using CSS_SELECTOR:
driver.get('https://archive.is/xXAoL') print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[name='q'][value]"))).get_attribute("value")) -
Using XPATH:
driver.get('https://archive.is/xXAoL') print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='q'][@value]"))).get_attribute("value")) -
Console Output:
https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU -
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC