I have a very big text (.sql) file, and I want to get all the links out of it in a nice clean text file, where the link are all one in each line.
I have found the following command
grep -Eo "https?://\S+?\.html" filename.txt > newFile.txt
from anubhava, which nearly works for me, link:
Extract all URLs that start with http or https and end with html from text file
Unfortunately, it does not quite work:
Problem 1: In the above link, the webpages end with .html. Not so in my case. They do not have a common ending, so I just have to finish before the second ‘ symbol.
Problem 2: I do not want it to copy the ‘ symbol.
To give an example, (cause, I think I explain rather bad here):
Say, my file says things like this:
Not him old music think his found enjoy merry. Listening acuteness dependent at or an. 'https://I_want_this' Apartments thoroughly unsatiable terminated sex how themselves. She are ten hours wrong walls stand early. 'https://I_want_this_too'. Domestic perceive on an ladyship extended received do. Why jennings our whatever his learning gay perceive. Is against no he without subject. Bed connection unreserved preference partiality not unaffected. Years merit trees so think in hoped we as.
I would want
https://I_want_this
https://I_want_this_too
as the outputfile.
Sorry for the easy question, but I am new to this whole thing and grep/sed etc. are not so easy for me to understand, esp. when I want it to search for special characters, such as /,’," etc.
>Solution :
You can use a GNU grep command like
grep -Po "'\Khttps?://[^\s']+" file
Details:
P
enables PCRE regex engineo
outputs matches only, not matched lines'\Khttps?://[^\s']+
– matches a'
, then omits it from the match with\K
, then matcheshttp
, then an optionals
,://
, and then one or more chars other than whitespace and'
chars.
See the online demo:
#!/bin/bash
s="Not him old music think his found enjoy merry. Listening acuteness dependent at or an. 'https://I_want_this' Apartments thoroughly unsatiable terminated sex how themselves. She are ten hours wrong walls stand early. 'https://I_want_this_too'. Domestic perceive on an ladyship extended received do. Why jennings our whatever his learning gay perceive. Is against no he without subject. Bed connection unreserved preference partiality not unaffected. Years merit trees so think in hoped we as."
grep -Po "'\Khttps?://[^\s']+" <<< "$s"
Output:
https://I_want_this
https://I_want_this_too