Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

PYSpark data Frame schema is showing String for every column

I am reading CSV file from below code snippet

df_pyspark = spark.read.csv("sample_data.csv") df_pyspark

and when i try to print data Frame its output is like shown as below:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]

For each column dataType is showing ‘String’ even though column contains different dataType’s as below:

df_pyspark.show()

|_c0|       _c1|      _c2|                 _c3|        _c4|       _c5|
+---+----------+---------+--------------------+-----------+----------+
| id|first_name|last_name|               email|     gender|     phone|
|  1|    Bidget| Mirfield|bmirfield0@scient...|     Female|5628618353|
|  2|   Gonzalo|    Vango|    gvango1@ning.com|       Male|9556535457|
|  3|      Rock| Pampling|rpampling2@guardi...|   Bigender|4472741337|
|  4|   Dorella|  Edelman|dedelman3@histats...|     Female|4303062344|
|  5|     Faber|  Thwaite|fthwaite4@google....|Genderqueer|1348658809|
|  6|     Debee| Philcott|dphilcott5@cafepr...|     Female|7906881842|`

I want to print the exact DataType of every column?

thankyou!

As i am new i dont know much of PYSpark!

>Solution :

Use inferSchema parameter during read of CSV file it’ll Show the exact/correct datatype according to the values in columns

    df_pyspark = spark.read.csv("sample_data.csv", header=True, inferSchema=True)

    +---+----------+---------+--------------------+-----------+----------+
    | id|first_name|last_name|               email|     gender|     phone|
    +---+----------+---------+--------------------+-----------+----------+
    |  1|    Bidget| Mirfield|bmirfield0@scient...|     Female|5628618353|
    |  2|   Gonzalo|    Vango|    gvango1@ning.com|       Male|9556535457|
    |  3|      Rock| Pampling|rpampling2@guardi...|   Bigender|4472741337|
    |  4|   Dorella|  Edelman|dedelman3@histats...|     Female|4303062344|
    |  5|     Faber|  Thwaite|fthwaite4@google....|Genderqueer|1348658809|
    +---+----------+---------+--------------------+-----------+----------+
    only showing top 5 rows

    df_pyspark.printSchema()

    root
     |-- id: integer (nullable = true)
     |-- first_name: string (nullable = true)
     |-- last_name: string (nullable = true)
     |-- email: string (nullable = true)
     |-- gender: string (nullable = true)
     |-- phone: long (nullable = true)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading