Home Filter Struct by Field in PySpark: How Does It Work?

Tutorials

Filter Struct by Field in PySpark: How Does It Work?

Learn how to filter values from a struct field in PySpark using array_contains and expr functions with examples and practical tips.

byDev Solutions

August 19, 2025

Developer pointing at a complex nested struct in a PySpark DataFrame with highlighted code and shocked expression, visual guide on how to filter struct fields in PySpark

⚙️ Struct fields in PySpark can be filtered using dot notation, getField(), or expr() for clean and flexible querying.
🧪 pyspark array_contains lets you filter arrays inside structs by checking if a value exists in a nested list.
🧭 Flattening structs using withColumn makes things easier to use and faster when accessing deeply nested fields repeatedly.
🚨 Schema changes and null fields in structs can make filters fail if you do not handle them with schema checks.
⚡ Using struct filters correctly makes things faster in large ETL, analytics, or log file processing pipelines.

Working with nested data in PySpark is a common need, especially when handling data from JSON, NoSQL, or APIs. To query this nested data well, you need to know how to filter struct fields precisely. This article provides a detailed guide on how to filter a struct by field in PySpark using methods like dot notation, getField(), expr(), and the array_contains() function.

Understanding Structs in PySpark

In PySpark, a StructType is one of the data types that hold nested data. It is made of individual StructFields. Structs provide a way to represent multiple fields under a single column. This makes them very useful for semi-structured data, such as that coming from JSON or columnar storage formats like Parquet or Avro.

Struct types allow you to package several fields as a single logical column:

from pyspark.sql.types import StructType, StructField, StringType

StructType([
    StructField("location", StructType([
        StructField("city", StringType(), True),
        StructField("country", StringType(), True)
    ]))
])

This way of organizing data is key when processing log data, telemetry data, or API responses that have deeply nested schemas. Structs also help save space and run faster due to how Spark's Catalyst and Tungsten engines work.

When Filtering on a Struct Field Becomes Necessary

Often in a real-world pipeline, schemas often look like real-world data. Consider user demographic information stored in nested JSON:

{
  "name": "Alice",
  "user": {
    "age": 29,
    "address": {
      "city": "New York",
      "zip": "10001"
    }
  }
}

If your business needs include filtering users from "New York", you need to find user.address.city.

Filtering top-level fields is simple:
```
df.filter(col("city") == "New York")
```

But for nested fields:

df.filter(col("user.address.city") == "New York")

This kind of filtering—called "filter struct by field" in PySpark—is key for building fast ETL pipelines and business reporting dashboards.

How to Filter Struct Fields in PySpark

There are several common ways to access and filter struct fields in PySpark. Each method works best in different situations, based on how easy it is to read, how complex it is, and how easy it is to keep up.

1. Dot Notation and `getField()`

Dot notation is simple when you know the struct's structure and it is clear:

from pyspark.sql.functions import col
df.filter(col("location.city") == "London")

Alternatively, use getField() for more control, especially if field names contain special characters or periods:

df.filter(col("location").getField("city") == "London")

Advantages of getField():

Avoids confusion from field names containing periods (.)
Safer for queries made by code
Works with schemas that change

You can also chain multiple getField() calls:

df.filter(col("user").getField("address").getField("city") == "London")

This is a good way to work when dealing with deeply nested data.

2. SQL Expression via `expr()`

The expr() function lets you use SQL-like expressions. This can make filters with many conditions easier:

from pyspark.sql.functions import expr
df.filter(expr("location.city = 'London'"))

It works well when you combine many conditions:

df.filter(expr("location.city = 'London' OR location.city = 'Paris'"))

Pros:

Good for filters that change
Works with logic you define as text
Supports conditional logic (CASE statements)

But for complex nesting or creating code as you go, getField() or dot notation may still be clearer and help with errors.

3. Flattening Structs with `withColumn()`

If you need to get to the same nested field many times, flattening the struct can make processing faster and your code shorter:

df = df.withColumn("city", col("location.city"))
df_filtered = df.filter(col("city") == "London")

This approach:

Makes it easier to read (especially when used many times)
Makes things faster by avoiding repeated nested access
Lets you give shorter names to fields with long paths

Useful tip: Do this after confirming your fields with df.printSchema().

Example: Filtering Struct Field with Dot Notation and `getField()`

Let’s walk through a real code example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.getOrCreate()

data = [
    ("Alice", {"city": "London", "country": "UK"}),
    ("Bob", {"city": "Paris", "country": "France"})
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("location", StructType([
        StructField("city", StringType(), True),
        StructField("country", StringType(), True)
    ]), True)
])

df = spark.createDataFrame(data, schema)

# Dot notation
df.filter(col("location.city") == "London").show()

# getField notation
df.filter(col("location").getField("city") == "London").show()

Both filters return the expected rows. This shows that Spark’s APIs work as expected.

Using `array_contains` Within Struct Fields

The pyspark array_contains function is useful, but many people do not fully understand it. It filters arrays within structs.

Example: Filtering Nested Arrays with `array_contains`

from pyspark.sql.functions import array_contains
from pyspark.sql.types import StructType, StructField, ArrayType, StringType

data = [
    ("Sam", {"skills": ["Python", "Spark"]}),
    ("Lisa", {"skills": ["Java", "SQL"]})
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("profile", StructType([
        StructField("skills", ArrayType(StringType()), True)
    ]), True)
])

df = spark.createDataFrame(data, schema)

df.filter(array_contains(col("profile.skills"), "Spark")).show()

This returns the row where the skills array contains "Spark".

Key Notes:

array_contains must be used on array-typed fields
You cannot use it on a struct field itself
It works well with nested fields, but use it with care

⚠️ Tip: Always check your data schema with df.printSchema() before using array_contains() to check that the data type is an array.

Filtering Struct Fields with Complex Conditions

Complex filters often need you to combine many nested fields using logical operators. SQL-like queries via expr() make this easier to read:

df.filter(
    expr("location.city = 'London' AND location.country = 'UK'")
).show()

This is the same as:

df.filter(
    (col("location.city") == "London") &
    (col("location.country") == "UK")
).show()

Always enclose conditions in parentheses to avoid logical errors when combining AND/OR expressions.

You often need to use many nested fields together for things like:

Address checks
Finding fraud
Grouping by location

Pitfalls to Avoid

Confusing arrays and structs: pyspark array_contains does not work on struct fields.
Ignoring nulls: A missing subfield can make filters fail without warning.
Over-nesting: Deeply nested fields are harder to filter and check.
Misnaming: Field names with periods (.) will not work with dot notation. Use getField() to access them.

Recommendations:

Use printSchema() in every notebook/script to check the structure.
Replace dots in nested field names with underscores during ingestion, if you can.
Always use coalesce or expressions that handle nulls if needed.

Performance Notes for Struct Filtering

While Spark SQL provides good performance, the way filters are applied to struct fields can affect how fast it runs.

Optimization Tips:

🔁 If you filter many times by the same nested field, extract it using withColumn once.
🧊 Caching: Use df.cache() after flattening when you use the data many times.
💾 Use Catalyst optimization: Use the DataFrame API instead of UDFs if you can.
🚫 Avoid explode() or complex joins on deeply nested data unless absolutely needed.

When you handle struct filtering correctly, Spark stays fast—up to 100x faster processing than traditional Hadoop MapReduce tasks (Armbrust et al., 2015).

Handling Unknown or Changing Struct Fields

Filtering becomes more complicated when schemas change, like:

JSON ingestion where keys vary
Log files from different sources
APIs that send different kinds of data

Handling Changing Schemas:

import json
schema_json = json.loads(df.schema.json())

def traverse_struct(schema_obj, path=[]):
    for field in schema_obj.get("fields", []):
        current_path = path + [field["name"]]
        if field["type"]["type"] == "struct":
            traverse_struct(field["type"], current_path)
        else:
            print(".".join(current_path))

This recursion prints out all subfields. This lets you look at and filter fields with code, even when schemas are not clear.

Real-Life Scenarios: Applying Struct Filtering

In production, filtering by struct fields becomes very important in:

ETL Pipelines: Clean and check input data inside nested records
Log Analysis: Process only events where metadata shows high importance or certain services
User Tracking and Analytics: Filter user sessions with specific actions within nested "events" or "actions" fields
Compliance and Audit: Find records with private information in nested fields so you can remove or encrypt it.

Companies handling different or semi-structured data formats (media, banking, IoT) depend a lot on strong struct filtering.

Best Practices to Follow

✅ Prefer dot notation unless dealing with unclear field names
🔁 Use getField() when creating expressions that change
🧱 Flatten selectively using withColumn() to make nested data less of a hassle
🧪 Keep printSchema() checks often when testing
⛔ Do not use array_contains() on non-array fields—check types first
🛠️ Use expr() for complex logic involving multiple nested fields
🚀 Make filters efficient before big changes like joins or explodes

Recap

Filtering structured nested data is a basic skill when working with PySpark. Using methods like dot notation, getField(), expr(), and pyspark array_contains, you can make pipelines that are easy to read and run fast. Whether you're looking closely at logs, processing user sessions, or performing complex joins—struct filtering is a good tool to have.

References

Armbrust, M., Das, T., Xin, R. S., et al. (2015). Apache Spark SQL: A Unified Data Processing Engine for Big Data Workloads. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 1381-1392. https://doi.org/10.1145/2723372.2742797
Databricks. (2023). StructType – Nested Data Structures. Retrieved from https://docs.databricks.com/data/data-sources/read-json.html
Apache Spark Documentation. (2023). Functions – array_contains. Retrieved from https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html