I’m looking for a regex for extracting urls when they are not separated by a space or whatever, but keep the "redirect" ones a a complete url.
Let me show you an example:
http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz
should result in the following array:
['http://foo.bar', 'https://foo.baz', 'http://foo.bar?url=http://foo.baz']
I am able to separate urls joined thanks to this regex :
'~(?:https?:)?//.*?(?=$|(?:https?:)?//)~'
from this answer: Extract urls from string without spaces between
But I struggle to also extract the ones by keeping the =http
Thanks,
>Solution :
EDIT: for python
Use re.split
and regex (?<!=)(?<!^)(?=https?://)
.
It will split on beginning of new url, unless this new url preceded by =
, or first in line (to exclude redundunt split in the beginning of string)
>>> re.split(r'(?<!=)(?<!^)(?=https?://)', 'http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz')
['http://foo.bar', 'https://foo.baz', 'http://foo.bar?url=http://foo.baz']
Demo and explanation at regex101.
Assuming (based on regex provided in question) you are using PHP:
Use preg_split
and lookahead for https?://
and negative lookbehind with =|^
to exclude matching beginning of URL preceded by =
and redundant split in the beginning of line.
<?php
$keywords = preg_split("~(?<!=|^)(?=https?://)~", "http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz");
print_r($keywords);
?>
Outputs:
Array
(
[0] => http://foo.bar
[1] => https://foo.baz
[2] => http://foo.bar?url=http://foo.baz
)
Online demo here.
Demo and explanation at regex101.