Extract joined urls but not if redirect exists

I’m looking for a regex for extracting urls when they are not separated by a space or whatever, but keep the "redirect" ones a a complete url.

Let me show you an example:

http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz

should result in the following array:

['http://foo.bar', 'https://foo.baz', 'http://foo.bar?url=http://foo.baz']

I am able to separate urls joined thanks to this regex :

'~(?:https?:)?//.*?(?=$|(?:https?:)?//)~'

from this answer: Extract urls from string without spaces between

But I struggle to also extract the ones by keeping the =http

Thanks,

>Solution :

EDIT: for python

Use re.split and regex (?<!=)(?<!^)(?=https?://).

It will split on beginning of new url, unless this new url preceded by =, or first in line (to exclude redundunt split in the beginning of string)

>>> re.split(r'(?<!=)(?<!^)(?=https?://)', 'http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz')
['http://foo.bar', 'https://foo.baz', 'http://foo.bar?url=http://foo.baz']

Demo and explanation at regex101.


Assuming (based on regex provided in question) you are using PHP:

Use preg_split and lookahead for https?:// and negative lookbehind with =|^ to exclude matching beginning of URL preceded by = and redundant split in the beginning of line.

<?php
$keywords = preg_split("~(?<!=|^)(?=https?://)~", "http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz");
print_r($keywords);
?>

Outputs:

Array
(
    [0] => http://foo.bar
    [1] => https://foo.baz
    [2] => http://foo.bar?url=http://foo.baz
)

Online demo here.

Demo and explanation at regex101.

Leave a Reply