- ⚙️ Struct fields in PySpark can be filtered using dot notation, getField(), or expr() for clean and flexible querying.
- 🧪
pyspark array_containslets you filter arrays inside structs by checking if a value exists in a nested list. - 🧭 Flattening structs using withColumn makes things easier to use and faster when accessing deeply nested fields repeatedly.
- 🚨 Schema changes and null fields in structs can make filters fail if you do not handle them with schema checks.
- ⚡ Using struct filters correctly makes things faster in large ETL, analytics, or log file processing pipelines.
Working with nested data in PySpark is a common need, especially when handling data from JSON, NoSQL, or APIs. To query this nested data well, you need to know how to filter struct fields precisely. This article provides a detailed guide on how to filter a struct by field in PySpark using methods like dot notation, getField(), expr(), and the array_contains() function.
Understanding Structs in PySpark
In PySpark, a StructType is one of the data types that hold nested data. It is made of individual StructFields. Structs provide a way to represent multiple fields under a single column. This makes them very useful for semi-structured data, such as that coming from JSON or columnar storage formats like Parquet or Avro.
Struct types allow you to package several fields as a single logical column:
from pyspark.sql.types import StructType, StructField, StringType
StructType([
StructField("location", StructType([
StructField("city", StringType(), True),
StructField("country", StringType(), True)
]))
])
This way of organizing data is key when processing log data, telemetry data, or API responses that have deeply nested schemas. Structs also help save space and run faster due to how Spark's Catalyst and Tungsten engines work.
When Filtering on a Struct Field Becomes Necessary
Often in a real-world pipeline, schemas often look like real-world data. Consider user demographic information stored in nested JSON:
{
"name": "Alice",
"user": {
"age": 29,
"address": {
"city": "New York",
"zip": "10001"
}
}
}
If your business needs include filtering users from "New York", you need to find user.address.city.
-
Filtering top-level fields is simple:
df.filter(col("city") == "New York") -
But for nested fields:
df.filter(col("user.address.city") == "New York")
This kind of filtering—called "filter struct by field" in PySpark—is key for building fast ETL pipelines and business reporting dashboards.
How to Filter Struct Fields in PySpark
There are several common ways to access and filter struct fields in PySpark. Each method works best in different situations, based on how easy it is to read, how complex it is, and how easy it is to keep up.
1. Dot Notation and getField()
Dot notation is simple when you know the struct's structure and it is clear:
from pyspark.sql.functions import col
df.filter(col("location.city") == "London")
Alternatively, use getField() for more control, especially if field names contain special characters or periods:
df.filter(col("location").getField("city") == "London")
Advantages of getField():
- Avoids confusion from field names containing periods (
.) - Safer for queries made by code
- Works with schemas that change
You can also chain multiple getField() calls:
df.filter(col("user").getField("address").getField("city") == "London")
This is a good way to work when dealing with deeply nested data.
2. SQL Expression via expr()
The expr() function lets you use SQL-like expressions. This can make filters with many conditions easier:
from pyspark.sql.functions import expr
df.filter(expr("location.city = 'London'"))
It works well when you combine many conditions:
df.filter(expr("location.city = 'London' OR location.city = 'Paris'"))
Pros:
- Good for filters that change
- Works with logic you define as text
- Supports conditional logic (CASE statements)
But for complex nesting or creating code as you go, getField() or dot notation may still be clearer and help with errors.
3. Flattening Structs with withColumn()
If you need to get to the same nested field many times, flattening the struct can make processing faster and your code shorter:
df = df.withColumn("city", col("location.city"))
df_filtered = df.filter(col("city") == "London")
This approach:
- Makes it easier to read (especially when used many times)
- Makes things faster by avoiding repeated nested access
- Lets you give shorter names to fields with long paths
Useful tip: Do this after confirming your fields with df.printSchema().
Example: Filtering Struct Field with Dot Notation and getField()
Let’s walk through a real code example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder.getOrCreate()
data = [
("Alice", {"city": "London", "country": "UK"}),
("Bob", {"city": "Paris", "country": "France"})
]
schema = StructType([
StructField("name", StringType(), True),
StructField("location", StructType([
StructField("city", StringType(), True),
StructField("country", StringType(), True)
]), True)
])
df = spark.createDataFrame(data, schema)
# Dot notation
df.filter(col("location.city") == "London").show()
# getField notation
df.filter(col("location").getField("city") == "London").show()
Both filters return the expected rows. This shows that Spark’s APIs work as expected.
Using array_contains Within Struct Fields
The pyspark array_contains function is useful, but many people do not fully understand it. It filters arrays within structs.
Example: Filtering Nested Arrays with array_contains
from pyspark.sql.functions import array_contains
from pyspark.sql.types import StructType, StructField, ArrayType, StringType
data = [
("Sam", {"skills": ["Python", "Spark"]}),
("Lisa", {"skills": ["Java", "SQL"]})
]
schema = StructType([
StructField("name", StringType(), True),
StructField("profile", StructType([
StructField("skills", ArrayType(StringType()), True)
]), True)
])
df = spark.createDataFrame(data, schema)
df.filter(array_contains(col("profile.skills"), "Spark")).show()
This returns the row where the skills array contains "Spark".
Key Notes:
array_containsmust be used on array-typed fields- You cannot use it on a struct field itself
- It works well with nested fields, but use it with care
⚠️ Tip: Always check your data schema with df.printSchema() before using array_contains() to check that the data type is an array.
Filtering Struct Fields with Complex Conditions
Complex filters often need you to combine many nested fields using logical operators. SQL-like queries via expr() make this easier to read:
df.filter(
expr("location.city = 'London' AND location.country = 'UK'")
).show()
This is the same as:
df.filter(
(col("location.city") == "London") &
(col("location.country") == "UK")
).show()
Always enclose conditions in parentheses to avoid logical errors when combining AND/OR expressions.
You often need to use many nested fields together for things like:
- Address checks
- Finding fraud
- Grouping by location
Pitfalls to Avoid
- Confusing arrays and structs:
pyspark array_containsdoes not work on struct fields. - Ignoring nulls: A missing subfield can make filters fail without warning.
- Over-nesting: Deeply nested fields are harder to filter and check.
- Misnaming: Field names with periods (
.) will not work with dot notation. UsegetField()to access them.
Recommendations:
- Use
printSchema()in every notebook/script to check the structure. - Replace dots in nested field names with underscores during ingestion, if you can.
- Always use
coalesceor expressions that handle nulls if needed.
Performance Notes for Struct Filtering
While Spark SQL provides good performance, the way filters are applied to struct fields can affect how fast it runs.
Optimization Tips:
- 🔁 If you filter many times by the same nested field, extract it using
withColumnonce. - 🧊 Caching: Use
df.cache()after flattening when you use the data many times. - 💾 Use Catalyst optimization: Use the DataFrame API instead of UDFs if you can.
- 🚫 Avoid
explode()or complex joins on deeply nested data unless absolutely needed.
When you handle struct filtering correctly, Spark stays fast—up to 100x faster processing than traditional Hadoop MapReduce tasks (Armbrust et al., 2015).
Handling Unknown or Changing Struct Fields
Filtering becomes more complicated when schemas change, like:
- JSON ingestion where keys vary
- Log files from different sources
- APIs that send different kinds of data
Handling Changing Schemas:
import json
schema_json = json.loads(df.schema.json())
def traverse_struct(schema_obj, path=[]):
for field in schema_obj.get("fields", []):
current_path = path + [field["name"]]
if field["type"]["type"] == "struct":
traverse_struct(field["type"], current_path)
else:
print(".".join(current_path))
This recursion prints out all subfields. This lets you look at and filter fields with code, even when schemas are not clear.
Real-Life Scenarios: Applying Struct Filtering
In production, filtering by struct fields becomes very important in:
- ETL Pipelines: Clean and check input data inside nested records
- Log Analysis: Process only events where metadata shows high importance or certain services
- User Tracking and Analytics: Filter user sessions with specific actions within nested "events" or "actions" fields
- Compliance and Audit: Find records with private information in nested fields so you can remove or encrypt it.
Companies handling different or semi-structured data formats (media, banking, IoT) depend a lot on strong struct filtering.
Best Practices to Follow
- ✅ Prefer dot notation unless dealing with unclear field names
- 🔁 Use
getField()when creating expressions that change - 🧱 Flatten selectively using
withColumn()to make nested data less of a hassle - 🧪 Keep
printSchema()checks often when testing - ⛔ Do not use
array_contains()on non-array fields—check types first - 🛠️ Use
expr()for complex logic involving multiple nested fields - 🚀 Make filters efficient before big changes like joins or explodes
Recap
Filtering structured nested data is a basic skill when working with PySpark. Using methods like dot notation, getField(), expr(), and pyspark array_contains, you can make pipelines that are easy to read and run fast. Whether you're looking closely at logs, processing user sessions, or performing complex joins—struct filtering is a good tool to have.
References
- Armbrust, M., Das, T., Xin, R. S., et al. (2015). Apache Spark SQL: A Unified Data Processing Engine for Big Data Workloads. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 1381-1392. https://doi.org/10.1145/2723372.2742797
- Databricks. (2023). StructType – Nested Data Structures. Retrieved from https://docs.databricks.com/data/data-sources/read-json.html
- Apache Spark Documentation. (2023). Functions – array_contains. Retrieved from https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html