I have a dataframe dfA like this
chromosome basepair
chrA 500
chrA 1000
chrA 7000
chrA 20000
chrA 23000
chrA 24000
chrA 35000
chrB 13000
chrB 14000
chrB 14500
For each chromosome A position in dfA I would like to scan the basepair column of adjacent chromosome A rows to identify groups with a sequence separation of 5000 basepairs (i.e. 1-5000). Then repeat for chromosome B and write a new dataframe dfB with the list of all groups identified.
The output for dfB should be
chromosome basepair Group ID
chrA 500 1
chrA 1000 1
chrA 20000 2
chrA 23000 2
chrA 24000 2
chrA 23000 3
chrA 24000 3
chrB 13000 4
chrB 14000 4
chrB 14500 4
>Solution :
Assuming you want to change group whenever the value is > 5000, or when it goes backwards:
df['Group ID'] = (~df.groupby('chromosome')['basepair']
.diff().between(0, 5000)
).cumsum()
Output:
chromosome basepair Group ID
0 chrA 500 1
1 chrA 1000 1
2 chrA 20000 2
3 chrA 23000 2
4 chrA 24000 2
5 chrA 23000 3
6 chrA 24000 3
7 chrB 13000 4
8 chrB 14000 4
9 chrB 14500 4