I am trying load data (with embeddings) into my Azure Cognitive Search index. This is my process after adding the embedding fields to my Pandas dataframe:
input data = df.to_json() # Where DF is the Pandas dataframe with embedding fields
# Use SearchIndexingBufferedSender to upload the documents in batches optimized for indexing
with SearchIndexingBufferedSender(
endpoint=service_endpoint,
index_name=index_name,
credential=credential,
) as batch_client:
# Add upload actions for all documents
batch_client.upload_documents(documents=input_data)
print(f"Uploaded {len(input_data)} documents in total")
I am getting the following error:
File /packages/azure/search/documents/_search_indexing_buffered_sender.py:322, in SearchIndexingBufferedSender._retry_action(self, action)
320 self._callback_fail(action)
321 return
--> 322 key = action.additional_properties.get(self._index_key)
323 counter = self._retry_counter.get(key)
324 if not counter:
325 # first time that fails
AttributeError: 'str' object has no attribute 'get'
Since my input data is relatively small, I have also tried loading the data without batches:
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(input_data, timeout = 50)
And this gives me a different error:
File /packages/azure/search/documents/_generated/operations/_documents_operations.py:1251, in DocumentsOperations.index(self, batch, request_options, **kwargs)
1249 map_error(status_code=response.status_code, response=response, error_map=error_map)
1250 error = self._deserialize.failsafe_deserialize(_models.SearchError, pipeline_response)
-> 1251 raise HttpResponseError(response=response, model=error)
1253 if response.status_code == 200:
1254 deserialized = self._deserialize("IndexDocumentsResult", pipeline_response)
HttpResponseError: () The request is invalid. Details: A null value was found with the expected type 'search.documentFields[Nullable=False]'. The expected type 'search.documentFields[Nullable=False]' does not allow null values.
Code:
Message: The request is invalid. Details: A null value was found with the expected type 'search.documentFields[Nullable=False]'. The expected type 'search.documentFields[Nullable=False]' does not allow null values.
But my dataframe does not have any empty values, so that makes me think there is something wrong with the format of the file I am sending. I have tried both of these with no success:
input_data = df.to_json()
input_data = df.to_json(orient="records")
Here is my index definition:
index_client = SearchIndexClient(
endpoint=service_endpoint, credential=credential)
fields = [
SimpleField(name="Id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
SearchableField(name="Field1", type=SearchFieldDataType.String),
SearchableField(name="Field2", type=SearchFieldDataType.String, filterable=True),
SearchableField(name="Field3", type=SearchFieldDataType.String, filterable=True),
SearchableField(name="Field4", type=SearchFieldDataType.String, filterable=True),
SearchableField(name="Field5", type=SearchFieldDataType.String, filterable=True),
SearchField(name="Field4_vec", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
searchable=True, vector_search_dimensions=384, vector_search_profile="myHnswProfile"),
SearchField(name="Field5_vec", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
searchable=True, vector_search_dimensions=384, vector_search_profile="myHnswProfile")
]
# Configure the vector search configuration
vector_search = VectorSearch(
algorithms=[
HnswVectorSearchAlgorithmConfiguration(
name="myHnsw",
kind=VectorSearchAlgorithmKind.HNSW,
parameters=HnswParameters(
m=4,
ef_construction=400,
ef_search=500,
metric="cosine"
)
),
ExhaustiveKnnVectorSearchAlgorithmConfiguration(
name="myExhaustiveKnn",
kind=VectorSearchAlgorithmKind.EXHAUSTIVE_KNN,
parameters=ExhaustiveKnnParameters(
metric="cosine"
)
)
],
profiles=[
VectorSearchProfile(
name="myHnswProfile",
algorithm="myHnsw",
),
VectorSearchProfile(
name="myExhaustiveKnnProfile",
algorithm="myExhaustiveKnn",
)
]
)
# Create the search index
index = SearchIndex(name=index_name, fields=fields,
vector_search=vector_search)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')
I am unable to post a sample of the data, but it is a Pandas dataframe with the same fields as the index:
Id (string)
Field1 (string)
Field2 (string)
Field3 (string)
Field4 (string)
Field5 (string)
Field4_vec (contents in the shape of [-0.01168345008045435, -0.0396871380507946, -0...]) with dimension 384
Field5_vec (contents in the shape of [-0.01168345008045435, -0.0396871380507946, -0...]) with dimension 384
Any advice is appreciated. Thanks!
>Solution :
For the first error, the documents parameter of batch_client.upload_documents expects a list of dictionaries, not a JSON string. You can try converting your dataframe to a list of dictionaries using input_data = df.to_dict(orient="records").
For the second error, you may be correct, that null values are being detected due to the format trying to upload. Note, that your vector fields can be an empty array [] but can’t be null