Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Spark Shell: SQL Query doesn't return any results when data is integer/double

I am using the MongoDB Spark Connector to import data from MongoDB and then perform some SQL queries. I will describe the whole process before getting into the actual problem in case I have done something wrong since it’s the first time I am using these tools.

I initialize spark-shell with the specific Collection, including the connector package:

spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection"      
 --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1

I enable the connector:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import com.mongodb.spark._

I make an RDD of the collection I want:

val rdd = MongoSpark.load(sc)

I make a Dataframe out of it so I can use SQL on it:

val df = rdd.toDF()

I create a temporary view of the dataset:

df.createOrReplaceTempView("sales")

The database has 1 million entries in this format:

_id: 61a6540c3838fe02b81e5339
Region: "Sub-Saharan Africa"
Country: "Ghana"
Item Type: "Office Supplies"
Sales Channel: "Online"
Order Priority: "L"
Order Date: 2017-03-22T22:00:00.000+00:00
Order ID: 601245963
Ship Date: 2017-04-14T21:00:00.000+00:00
Units Sold: 896
Unit Price: 651.21
Unit Cost: 524.96
Total Revenue: 583484.16
Total Cost: 470364.16
Total Profit: 113120

The datatypes are string/date/number/double.

A simple query like this works just fine:

spark.sql("SELECT * FROM sales").show()

Queries on strings work fine too:

spark.sql("SELECT Country FROM sales WHERE Region = 'Sub-Saharan Africa'").show()

But when I do a query on numerical values I get back an empty table:

spark.sql("SELECT * FROM sales WHERE 'Units Sold' > 4000").show()

I have tried assigning the query to a val and then using show() on it but it’s the same thing.

>Solution :

If it is not a typo/cut-n-paste error in your SELECT, the WHERE clause in it compares string "Units Sold" to a numeric value 4000 which is never true. The proper way to escape column names in SparkSQL is using a ` (backticks) not an ' (apostrophes).

So use the following query

spark.sql('SELECT * FROM sales WHERE `Unit Sold` > 1').show()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading