I would like to use preg_match in PHP to parse the site lang out of a html document;
My preg_match:
$sitelang = preg_match('!<html lang="(.*?)">!i', $result, $matches) ? $matches[1] : 'Site Language not detected';
When I have a simple attribute without any class or ids.
For example:
Input:
<html lang="de">
Output:
de
But when I have a other html code like this:
Input:
<html lang="en" class="desktop-view not-mobile-device text-size-normal anon">
Output:
en " class=" desktop - view not - mobile - device text - size - normal anon,
I need just the lang code(en, de, en-En, de-DE).
Thanks for your advice or code.
>Solution :
Standard disclaimer of using regex to parse HTML aside, there are two things you likely want. First, get rid of the closing bracket in your pattern. Once you have the close quote, the rest of the line doesn’t matter. Second, make sure what’s inside the quotes doesn’t itself contain quotes.
Current, open quote, then anything, then close quote:
preg_match('!<html lang="(.*?)">!i', $result, $matches)
This means if you have lang="foo" class="bar" you get foo" class="bar as a match because regex is greedy and that whole string could be considered to be inside the two separate sets of outermost quotes.
New, inside the quotes, one or more of anything but a quote:
preg_match('!<html lang="([^"]+)"!i', $result, $matches)
If you want to be more resilient, change the hard space to one or more whitespace chars:
preg_match('!<html\s+lang="([^"]+)"!i', $result, $matches)