I am working on a project for university, where I have two pandas dataframes:
# Libraries
import pandas as pd
from geopy import distance
# Dataframes
df1 = pd.DataFrame({'id': [1,2,3],
'lat':[-23.48, -22.94, -23.22],
'long':[-46.36, -45.40, -45.80]})
df2 = pd.DataFrame({'id': [100,200,300],
'lat':[-28.48, -22.94, -23.22],
'long':[-46.36, -46.40, -45.80]})
I need to calculate distances between geographic latitude and longitude coordinates between dataframes. So I used geopy. If the distance between the coordinate combination is less than a threshold of 100 meters, then I must assign the value 1 in the ‘nearby’ column. I made the following code:
threshold = 100 # meters
df1['nearby'] = 0
for i in range(0, len(df1)):
for j in range(0, len(df2)):
coord_geo_1 = (df1['lat'].iloc[i], df1['long'].iloc[i])
coord_geo_2 = (df2['lat'].iloc[j], df2['long'].iloc[j])
var_distance = (distance.distance(coord_geo_1, coord_geo_2).km) * 1000
if(var_distance < threshold):
df1['nearby'].iloc[i] = 1
Although a warning appears, the code is working. However, I would like to find a way to override for() iterations. It’s possible?
# Output:
id lat long nearby
1 -23.48 -46.36 0
2 -22.94 -45.40 0
3 -23.22 -45.80 1
>Solution :
You can cross-merge the two dfs to get a distance between each id in df1 vs df2:
dfm = pd.merge(df1, df2, how = 'cross', suffixes = ['','_2'])
dfm['dist'] = dfm.apply(lambda r: distance.distance((r['lat'],r['long']),(r['lat_2'],r['long_2'])).km * 1000 , axis=1)
dfm
looks like this:
id lat long id_2 lat_2 long_2 dist
-- ---- ------ ------ ------ ------- -------- --------
0 1 -23.48 -46.36 100 -28.48 -46.36 553941
1 1 -23.48 -46.36 200 -22.94 -46.4 59943.4
2 1 -23.48 -46.36 300 -23.22 -45.8 64095.5
3 2 -22.94 -45.4 100 -28.48 -46.36 621251
4 2 -22.94 -45.4 200 -22.94 -46.4 102568
5 2 -22.94 -45.4 300 -23.22 -45.8 51393.4
6 3 -23.22 -45.8 100 -28.48 -46.36 585430
7 3 -23.22 -45.8 200 -22.94 -46.4 68854.7
8 3 -23.22 -45.8 300 -23.22 -45.8 0
you can test column ‘dist’ to be below the treshold, but if the requirement is to aggregate by id
from df1
then you can do for example
res = df1.merge(dfm.groupby('id').apply(lambda g:any(g['dist'] < threshold)*1).rename('nearby'), on = 'id')
res
now looks like this:
id lat long nearby
-- ---- ------ ------ --------
0 1 -23.48 -46.36 0
1 2 -22.94 -45.4 0
2 3 -23.22 -45.8 1