Extract URLs from webpages

Advertisements

I want to extract URLs from a webpage that contains multiple URLs in it and save the extracted to a txt file.

The URLs in the webpage starts ‘127.0.0.1’ but i wanted to remove ‘127.0.0.1’ from them and extract only the URLs. When i run the ps script below, it only saves ‘127.0.0.1’. Any help to fix this please.

$threatFeedUrl = "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt"
    
    # Download the threat feed data
    $threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
    
    # Define a regular expression pattern to match URLs starting with '127.0.0.1'
    $pattern = '127\.0\.0\.1(?:[^\s]*)'
    
    # Use the regular expression to find matches in the threat feed data
    $matches = [regex]::Matches($threatFeedData.Content, $pattern)
    
    # Create a list to store the matched URLs
    $urlList = @()
    
    # Populate the list with matched URLs
    foreach ($match in $matches) {
        $urlList += $match.Value
    }
    
    # Specify the output file path
    $outputFilePath = "output.txt"
    
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
    
    Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."

>Solution :

Preface:

  • The target URL happens to be a (semi-structured) plain-text resource, so regex-based processing is appropriate.

  • In general, however, with HTML content, using a dedicated parser is preferable, given that regexes aren’t capable of parsing HTML robustly.[1] See this answer for an example of extracting links from an HTML document.


'127\.0\.0\.1(?:[^\s]*)'
  • You’re mistakenly using a non-capturing group ((?:…)) rather than a capturing one ((…))

  • In the downloaded content, there is a space after 127.0.0.1

  • Therefore use the following regex instead (\S is the simpler equivalent of [^\s] + only matches only a non-empty run of non-whitespace characters):

    '127\.0\.0\.1 (\S+)'
    
$matches = …
  • While it technically doesn’t cause a problem here, $matches is the name of the automatic $Matches variable, and therefore shouldn’t be used for custom purposes.
$match.Value
  • $match.Value is the whole text that your regex matched, whereas you only want the text of the capture group.

  • Use $match.Groups[1].Value instead.

$urlList += 
  • Building an array iteratively, with += is inefficient, because a new array must be allocated behind the scenes in every iteration; simply use the foreach statement as an expression, and let PowerShell collect the results for you. See this answer for more information.
Invoke-WebRequest -Uri $threatFeedUrl
  • Since you’re only interested in the text content of the response, it is simpler to use Invoke-RestMethod rather than Invoke-WebRequest; the former returns the content directly (no need to access a .Content property).

To put it all together:

$threatFeedUrl = 'https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt'
    
# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
    
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 (\S+)'
    
# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)
    
# Create and populate the list with matched URLs
$urlList = 
  foreach ($match in $matchList) {
    $match.Groups[1].Value
  }
    
# Specify the output file path
$outputFilePath = 'output.txt'
    
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
    
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."

[1] See this blog post for background information.

Leave a ReplyCancel reply