Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract URLs from webpages

I want to extract URLs from a webpage that contains multiple URLs in it and save the extracted to a txt file.

The URLs in the webpage starts ‘127.0.0.1’ but i wanted to remove ‘127.0.0.1’ from them and extract only the URLs. When i run the ps script below, it only saves ‘127.0.0.1’. Any help to fix this please.

$threatFeedUrl = "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt"
    
    # Download the threat feed data
    $threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
    
    # Define a regular expression pattern to match URLs starting with '127.0.0.1'
    $pattern = '127\.0\.0\.1(?:[^\s]*)'
    
    # Use the regular expression to find matches in the threat feed data
    $matches = [regex]::Matches($threatFeedData.Content, $pattern)
    
    # Create a list to store the matched URLs
    $urlList = @()
    
    # Populate the list with matched URLs
    foreach ($match in $matches) {
        $urlList += $match.Value
    }
    
    # Specify the output file path
    $outputFilePath = "output.txt"
    
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
    
    Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Preface:

  • The target URL happens to be a (semi-structured) plain-text resource, so regex-based processing is appropriate.

  • In general, however, with HTML content, using a dedicated parser is preferable, given that regexes aren’t capable of parsing HTML robustly.[1] See this answer for an example of extracting links from an HTML document.


'127\.0\.0\.1(?:[^\s]*)'
  • You’re mistakenly using a non-capturing group ((?:…)) rather than a capturing one ((…))

  • In the downloaded content, there is a space after 127.0.0.1

  • Therefore use the following regex instead (\S is the simpler equivalent of [^\s] + only matches only a non-empty run of non-whitespace characters):

    '127\.0\.0\.1 (\S+)'
    
$matches = …
  • While it technically doesn’t cause a problem here, $matches is the name of the automatic $Matches variable, and therefore shouldn’t be used for custom purposes.
$match.Value
  • $match.Value is the whole text that your regex matched, whereas you only want the text of the capture group.

  • Use $match.Groups[1].Value instead.

$urlList += 
  • Building an array iteratively, with += is inefficient, because a new array must be allocated behind the scenes in every iteration; simply use the foreach statement as an expression, and let PowerShell collect the results for you. See this answer for more information.
Invoke-WebRequest -Uri $threatFeedUrl
  • Since you’re only interested in the text content of the response, it is simpler to use Invoke-RestMethod rather than Invoke-WebRequest; the former returns the content directly (no need to access a .Content property).

To put it all together:

$threatFeedUrl = 'https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt'
    
# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
    
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 (\S+)'
    
# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)
    
# Create and populate the list with matched URLs
$urlList = 
  foreach ($match in $matchList) {
    $match.Groups[1].Value
  }
    
# Specify the output file path
$outputFilePath = 'output.txt'
    
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
    
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."

[1] See this blog post for background information.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading