In this post, we’ll see how implicit casting in Spark SQL can lead to unexpected and wrong results. Actually Spark SQL follows MySQL and HIve to do the implicit type conversion for binary comparison.

Suppose we have this DataFrame with schema below:

Let’s run this query and see the result:

This is pretty cool. We don’t have to explicitly cast the column _c2(i.e. StringType). But wait, now let’s run this:

There is something fishy. The output should be like this as per general logic.

It happened due the same implicit casting logic. Let’s take help of SQL explain.

The above implicit cast is added during analyzing phase of Logical Plan PromoteStrings rule in TypeCoercion.scala.

and during actual casting null due to NumberFormatException is returned which ultimately ignored during filtering.

 

Also due to loss of precision, this type of error can happen:

because

Whereas, Spark developers are coming up with a fix for this case. Refer https://github.com/apache/spark/pull/15880. But the first case is not covered in this bug fix.

So be careful while handing these types of cases where implicit casting happens. The ultimate should be strict type-safety but that will break the Hive compatibility badly.

You probably  be thinking that we can impose explicit type casting using UDFs which may throw exception in such cases but the problem is same with UDFs.

UDFs are also not type-safe.