For the DataFrame below:
df = pd.DataFrame({'Name': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes'],
'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','-','-','Alpha','Alpha','S-Class'],
'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020],
'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000],
'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black']
})
df
Name Type Year Price Color
0 Tesla Model X 2015 85000 White
1 Tesla Model X 2015 90000 White
2 Tesla Model X 2015 95000 White
3 Toyota Corolla 2017 20000 Red
4 Ford Bronco 2018 35000 Blue
5 Ford Bronco 2018 35000 Blue
6 Ford Mustang 2020 45000 Yellow
7 BMW 3 Series 2015 40000 Silver
8 BMW - 2015 40000 Silver
9 BMW - 2017 65000 Black
10 Mercedes Alpha 2018 50000 White
11 Mercedes Alpha 2018 50000 White
12 Mercedes S-Class 2020 75000 Black
For each Name column groups, I am trying to find the corresponding start index and end index of consecutive common values in other columns. if the value repeat consecutively it should be considered as common and store it to a dictionary with key as start index and value as end index.
For example Tesla has Model X common in the Type column, so in the Type column I should get the start index of Model X and end index of Model X as the first value in a dictionary with start index : `end index as the key value pair of the dictionary.
Similarly, for the Color column Blue should have start index as 4 and end index as 5 and for the Year column 2015 should have start index as 0 and end index as 2.
Expected Output:
Name :
{0: 2, 3: 3, 4: 6, 7: 9, 10: 12}
Type :
{0: 2, 3: 3, 4: 5, 6: 6, 7: 7, 8: 9, 10: 11, 12: 12}
Year :
{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12] }
Price:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}
Color:
{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12:12}
What I tried
def find_indices(df, column):
df1 = df.index.to_series().groupby(df[column]).agg(['first', 'last']).reset_index()
df1 = df1.sort_values("first").reset_index()
first_last_rows = df1.set_index('first')['last'].to_dict()
print(column + ":")
print(first_last_rows)
Output I got
Name:
{0: 2, 3: 3, 4: 6, 7: 9, 10: 12}
Type:
{0: 2, 3: 3, 4: 5, 6: 6, 7: 7, 8: 9, 10: 11, 12: 12}
Year:
{0: 8, 3: 9, 4: 11, 6: 12}
Price:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}
Color:
{0: 11, 3: 3, 4: 5, 6: 6, 7: 8, 9: 12}
In the output I got Year and Color columns end values are from disjoint group so it is not finding the consecutive common values(or the values are for different Name)
>Solution :
It seems you want a group of consecutive values. For that a well known method is to use shift and cumsum and then groupby, and then for your problem you take first index and last index of each group and then build a dict as below:
col = "Year"
col_dict = dict(
df.groupby(df[col].ne(df[col].shift()).cumsum())[col]
.agg(lambda x: (x.index[0], x.index[-1]))
.values
)
print(col_dict)
{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}