Here is my data –
inp = [{'father_husband_mother_name': [['Father s Name', 0.8603670001029968],
['Shripati', 0.8603670001029968],
['Father s Name', 0.8903670001029969],
['Shpppati', 0.8903670001029969]],
'doc_id': [['GGX2176', 0.8435981869697571],
['GGC2176', 0.8835981869697571]],
'name': [['Elector s Name', 0.8301510810852051],
['Shibshankar Ghosh', 0.8301510810852051],
['Elector s Name', 0.8501510810852051],
['Shibshankar Ghosh', 0.8501510810852051]],
'date_of_birth': [['Age as on 1.1.2000', 0.8067844915390014],
['15', 0.8067844915390014],
['Age as on 1.1.2000', 0.8267844915390015],
['15', 0.8267844915390015]],
'gender_sex': [['Sex', 0.7784658074378967],
['M', 0.7784658074378967],
['Sex', 0.8784658074378967],
['M', 0.8784658074378967]]}]
STOPWORDS = ['Sex', 'Father s Name', 'Elector s Name', 'Address', 'Name', 'Gender', 'Mother s Name',
'Husband s Name']
The output that I expect:
{'father_husband_mother_name': 'Shpppati',
'doc_id': 'GGC2176',
'name': 'Shibshankar Ghosh',
'date_of_birth': 'Age as on 1.1.2000,15',
'gender_sex': 'M'}
Here is the logic –
Retrieve the value that has the highest confidence score [the float inside the list of lists] that is not present in STOPWORDS for each key.
What I have tried –
def process_kie_dict(voter_raw_labels, threshold=0.7):
cleaned_dict = {}
intermediate_dict = {}
for entity_dict in voter_raw_labels:
for entity, val in entity_dict.items():
conf_val = [item[1] for item in val]
unique_val = list(set(conf_val))
max_conf = max(unique_val)
if max_conf > threshold:
if len(unique_val)==1:
add_val = [item[0] for item in val]
else:
max_conf_index = conf_val.index(max_conf)
add_val = [item[0] for item in val[max_conf_index:]]
if entity not in intermediate_dict.keys():
intermediate_dict[entity] = [add_val,max_conf]
else:
if intermediate_dict[entity][1] < max_conf:
intermediate_dict[entity] = [add_val,max_conf]
# print(intermediate_dict)
for key, val in intermediate_dict.items():
final_value = ''
for value in val[0]:
m = len(str.strip(value))
edit_dist_list = []
for word in STOPWORDS:
n = len(word)
edit_dist = editDistDP(value, word, m, n)
edit_dist_list.append(edit_dist)
if min(edit_dist_list) < 2:
value=''
final_value = final_value + value + ','
clean_value = final_value.strip(",")
cleaned_dict[key]=clean_value
return cleaned_dict
def editDistDP(str1, str2, m, n):
# Create a table to store results of subproblems
dp = [[0 for x in range(n + 1)] for x in range(m + 1)]
# Fill d[][] in bottom up manner
for i in range(m + 1):
for j in range(n + 1):
# If first string is empty, only option is to
# insert all characters of second string
if i == 0:
dp[i][j] = j # Min. operations = j
# If second string is empty, only option is to
# remove all characters of second string
elif j == 0:
dp[i][j] = i # Min. operations = i
# If last characters are same, ignore last char
# and recur for remaining string
elif str1[i-1] == str2[j-1]:
dp[i][j] = dp[i-1][j-1]
# If last character are different, consider all
# possibilities and find minimum
else:
dp[i][j] = 1 + min(dp[i][j-1], # Insert
dp[i-1][j], # Remove
dp[i-1][j-1]) # Replace
return dp[m][n]
You can forget about the edit distance implementation, not important. What I want to know is given nested for loops, this won’t work at scale. Looking for a more efficient implementation.
>Solution :
Here is a parser for your data
result = {k: sorted(v, key=lambda x: x[1] if x[0] not in STOPWORDS else 0)[-1][0] for k, v in inp[0].items()}
In short, it takes a key and sorts the rest of the dictionary based on the confidence value, unless the first element of the list is included in STOPWORDS. Then adds the first element of that sorted list to the result dictionary as a value.