I have the below pyspark dataframe.
column_a
name, varchar(10) country, age
name, age, decimal(15) percentage
name, varchar(12) country, age
name, age, decimal(10) percentage
I have to remove varchar and decimal from above dataframe irrespective of its length. Below is expected output.
column_a
name, country, age
name, age, percentage
name, country, age
name, age, percentage
How to achieve this in Pyspark.
>Solution :
You can replace patterns matching decimal() and varchar() using regexp_replace.
from pyspark.sql import functions as F
data = [("name, varchar(10) country, age",),
("name, age, decimal(15) percentage",),
("name, varchar(12) country, age",),
("name, age, decimal(10) percentage",), ]
df = spark.createDataFrame(data, ("column_a", ), )
df.withColumn("column_a",
F.regexp_replace("column_a", r"varchar\(\d*\)\s|decimal\(\d*\)\s", ""))\
.show(truncate=False)
"""
+---------------------+
|column_a |
+---------------------+
|name, country, age |
|name, age, percentage|
|name, country, age |
|name, age, percentage|
+---------------------+
"""