Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Create a new column in Spark dataframe that is a list of other column values

I have a dataframe called ‘df’ structured as follows

ID name lv1 lv2
abb name1 40.34 21.56
bab name2 21.30 67.45
bba name3 32.45 45.44

In Pandas, I can use the following code to create a new column that contains a list of the lv1 and lv2 values

cols = ['lv1', 'lv2']
df['new_col'] = df[cols].values.tolist()

Due to memory issues because of the size of the data, I am now using Databricks instead (which I have never used before) and need to replicate the above. I’ve created a Spark dataframe successfully by mounting the location of my data and then loading

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

file_location = 'dbfs:/mnt/<mountname>/filename.csv'
file_type = "csv"
   
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type)
  .option("inferSchema", infer_schema)
  .option("header", first_row_is_header)
  .option("sep", delimiter)
  .load(file_location)

display(df)

This loads the data, however, I’m stuck on how to complete the necessary next step. I’ve found a function called struct in the Spark, however, I can’t seem to find the corresponding function in PySpark. Any suggestions?

>Solution :

It’s probably array function that you’re looking for.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('abb', 'name1', 40.34, 21.56),
     ('bab', 'name2', 21.30, 67.45),
     ('bba', 'name3', 32.45, 45.44)],
    ['ID', 'name', 'lv1', 'lv2'])

df = df.withColumn('new_col', F.array('lv1', 'lv2'))

df.show()
# +---+-----+-----+-----+--------------+
# | ID| name|  lv1|  lv2|       new_col|
# +---+-----+-----+-----+--------------+
# |abb|name1|40.34|21.56|[40.34, 21.56]|
# |bab|name2| 21.3|67.45| [21.3, 67.45]|
# |bba|name3|32.45|45.44|[32.45, 45.44]|
# +---+-----+-----+-----+--------------+
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading