Home creating a table with string<array> and string in pyspark

Questions

creating a table with string<array> and string in pyspark

March 2, 2022

I have a dataset as below.

+-------------------------------------------------------+-------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------+---------------------------------------------------------+-------------+
|emp_id                                                 |sik_id                                                             |modification_date                                                                      |file_name    |org_path                                                 |received_date|
+-------------------------------------------------------+-------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------+---------------------------------------------------------+-------------+
|[85627230-s387s09, 98722016-s015s05, 40022035-s008s21] |[f13c1320-5c8f3daas5cd, f13c1384-6659-4831, 4831-aaf1-5c8f3da]     |[2021-04-19T11:43:32.617953Z, 2021-04-19T11:43:32.858290Z, 2021-04-19T11:43:34.027082Z]|test1.json   |s3://my-bucket/test_prefix/vualt2/rcvd_dt=2022-01-24/    |2022-01-25   |
|[67dm34-4334, 8723gv6-2022, 6f7m99-2244-ki856]         |[66d9-4888-aaf1, aaf1-5c8f3da1d5cd, f13c1884-66d9]                 |[2020-11-12T23:22:05.433107Z, 2020-11-12T20:16:51.339437Z, 2020-11-11T20:59:03.758126Z]|test2.json   |s3://my-bucket/test_prefix/vualt2/rcvd_dt=2022-01-25/    |2022-01-25   |
+-------------------------------------------------------+-------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------+---------------------------------------------------------+-------------+

Whose schema contains array and string fields as below

>>> df.printSchema()
root
 |-- emp_id: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- sik_id: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- modification_date: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- file_name: string (nullable = false)
 |-- org_path: string (nullable = false)
 |-- received_date: string (nullable = false)

I would like to get a result something like below where every emp_id, sik_id, modification_date gets the right file_name, org_path , received_date

+-----------------+--------------------------+-----------------------------+-------------+---------------------------------------------------------+-------------+
|emp_id           |sik_id                    |modification_date            |file_name    |org_path                                                 |received_date|
+-----------------+--------------------------+-----------------------------+-------------+---------------------------------------------------------+-------------+
|85627230-s387s09 |f13c1320-5c8f3daas5cd     |2021-04-19T11:43:32.617953Z  |test1.json   |s3://my-bucket/test_prefix/vualt2/rcvd_dt=2022-01-24/    |2022-01-25   |
|98722016-s015s05 |f13c1384-6659-4831        |2021-04-19T11:43:32.858290Z  |test1.json   |s3://my-bucket/test_prefix/vualt2/rcvd_dt=2022-01-24/    |2022-01-25   |
|40022035-s008s21 |4831-aaf1-5c8f3da         |2021-04-19T11:43:34.027082Z  |test1.json   |s3://my-bucket/test_prefix/vualt2/rcvd_dt=2022-01-24/    |2022-01-25   |
|67dm34-4334      |66d9-4888-aaf1            |2020-11-12T23:22:05.433107Z  |test2.json   |s3://my-bucket/test_prefix/vualt2/rcvd_dt=2022-01-25/    |2022-01-25   |
|8723gv6-2022     |aaf1-5c8f3da1d5cd         |2020-11-12T20:16:51.339437Z  |test2.json   |s3://my-bucket/test_prefix/vualt2/rcvd_dt=2022-01-25/    |2022-01-25   |
|6f7m99-2244-ki856|f13c1884-66d9             |2020-11-11T20:59:03.758126Z  |test2.json   |s3://my-bucket/test_prefix/vualt2/rcvd_dt=2022-01-25/    |2022-01-25   |
+-----------------+--------------------------+-----------------------------+-------------+---------------------------------------------------------+-------------+

I tried using zip() on these fields but looks like zip doesn’t work on array and string fields. As I was seeing a type mismatch exception.

Can someone please help me with the right solution.

Thanks in advance.

>Solution :

Combining the SQL functions arrays_zip and inline.

df = df.selectExpr('inline(arrays_zip(emp_id, sik_id, modification_date))', 'file_name', 'org_path', 'received_date')
df.show(truncate=False)

apache-spark-sql

byMR

Published March 02, 2022

Add a comment

ObjectOutputStream writes same instances differently depending on how i open stream

byMR

March 2, 2022

Questions

How to copy the elements of the list within the list?

byMR

March 2, 2022

Questions

How to make a simple calculator using Python?

byMR

March 2, 2022

Questions

Making a Call From UIButton

byMR

March 2, 2022

Questions

Making a Call From UIButton

byMR

March 2, 2022

Questions

How to write multiple values to on a row to txt and avoid tuple?

byMR

March 2, 2022

creating a table with string<array> and string in pyspark

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

ObjectOutputStream writes same instances differently depending on how i open stream

How to copy the elements of the list within the list?

How to make a simple calculator using Python?

Making a Call From UIButton

Making a Call From UIButton

How to write multiple values to on a row to txt and avoid tuple?

Keep Up to Date with the Most Important News

creating a table with string<array> and string in pyspark

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

ObjectOutputStream writes same instances differently depending on how i open stream

How to copy the elements of the list within the list?

How to make a simple calculator using Python?

Making a Call From UIButton

Making a Call From UIButton

How to write multiple values to on a row to txt and avoid tuple?

Discover more from Dev solutions