Find similar words in an array and eliminate them

September 20, 2022

$a[] = "paris";
$a[] = "london";
$a[] = "paris";
$a[] = "london tour";
$a[] = "london tours";
$a[] = "london";
$a[] = "londonn";

foreach($a as $name) {

echo $name;
echo '<br>';

}

Output:

paris
london
paris
london tour
london tours
london
londonn

I can eliminate the same words with array_unique

foreach(array_unique($a) as $name) {

echo $name;
echo '<br>';

}

Output:

paris
london
london tour
london tours
londonn

I want to take this further and eliminate similar words. Like, if there is a "london", I want to eliminate "londonn".

So the output will be:

paris
london
london tour

I tried similar_text($name, $name, $percent) but it did not help.

Here is what I tried with my limited of knowledge:

foreach(array_unique($a) as $name) {

$test = $a;
foreach($test as $test1) {

 similar_text($name, $test1, $percent);
if ($percent > 90) {
echo $name;
echo '<br>';
} 

}
}

Output:

paris
paris
london
london
london
london tour
london tour
london tours
london tours
londonn
londonn
londonn

The source of the words is a search list:

$a[] = "$popular_search";

>Solution :

The main problem seems to be the way you use the two nested loops. Here’s a very explicit example, without anything fancy, showing how you could do this:

$a[] = "paris";
$a[] = "london";
$a[] = "paris";
$a[] = "london tour";
$a[] = "london tours";
$a[] = "london";
$a[] = "londonn";

$b = [];
foreach($a as $outerName) {
    // start optimistic, no similar string found
    $isUnique = true;
    foreach($b as $innerName) {
        // check whether the string already has a similar entry
        similar_text($outerName, $innerName, $percent);
        if ($percent > 90) {
            $isUnique = false;
            break;
        }
    }
    if ($isUnique) {
        $b[] = $outerName;
    }
}

print_r($b);

Working example

The output is:

Array
(
    [0] => paris
    [1] => london
    [2] => london tour
)

How does it work? There’s an outer loop that simply goes through all the strings in array $a. Inside that loop it loops through the strings $b that have already been identified as being unique enough. If a string from $a is similar enough to a string of $b we skip it. That’s all.