let’s say i have tableA.points which has a bunch of geographic points.
and i have tableB.points which has a bunch of geographic points, as well.
i want to compare every row in tableA.points to every row in tableB.points, and list the closest (shortest distance) point for each row in tableA.points.
so, actually two points:
- the only way i can think of is a cartesian product (cross join) or a aggregate correlated subquery.
- the issue with the correlated subquery is that i can’t attach
tableB.Identiferto the outer query.
for example:
SELECT
tableA.*
,(SELECT
CONCAT(tableB.IDENTIFIER,':',MIN(ST_DISTANCE(tableA.points, tableB.points))
FROM tableB) as closest_point
FROM tableA
The big problem here is that i have over a million rows for each of these tables…
Any advice?
edit: i could also use zip codes to maybe trim down each comparison in the subquery, for example WHERE tableA.zip = tableB.zip but still not sure how to get that identifier…
>Solution :
A simple way to find the nearest point is to aggregate sorting by distance, and take the first element, something like:
SELECT
table_a.idenfitier,
ARRAY_AGG(table_b.identifier
ORDER BY ST_Distance(table_a.geog, table_b.geog)
LIMIT 1)
[ORDINAL(1)]
FROM ...
GROUP BY table_a.idenfitier
But as you correctly mentioned, cross product is very slow, so if you can – restrict the join to a limited radius using ST_DWithin.
Couple blog posts describe the ideas how to do this in BigQuery in details
https://mentin.medium.com/nearest-neighbor-in-bigquery-gis-7d50ebd5d63
https://mentin.medium.com/nearest-neighbor-using-bq-scripting-373241f5b2f5