Why Spark/MR not considering UTF-8 encoding

Reading/WRITING UTF-8 enabled file

Sometimes, we could have encountered issues in which Spark returns non-ASCII characters in the wrong format. Especially in CDH-flavored Hadoop versions(say CDH 6.*), Spark/MR jobs will decode non-ASCII characters with the wrong format. Rather than using UTF-8 encoding format, Spark/MR would have used ”ANSI_X.3.4-1968”.

On investigating further, we could see that Yarn logs of Spark/MR would have traced java encoding properties file.encoding and sun.jnu.encoding as UTF-8.

For example,

grep "ANSI"  /application_*************_*****.log 
    file.encoding = ANSI_X3.4-1968
    sun.jnu.encoding = ANSI_X3.4-1968 

But if we collect java properties using the command “java -XshowSettings -version”, we can see java has been set UTF-8 as encoding options

Property settings:

    awt.toolkit = sun.awt.X11.XToolkit
    file.encoding = UTF-8
    file.encoding.pkg = sun.io
sun.io.unicode.encoding = UnicodeLittle
    sun.java.launcher = SUN_STANDARD
    sun.jnu.encoding = UTF-8
Also we can see the same behavior from output of unix command “locale” which returns ,
$ locale

Why, then, Spark/MR application returns characters in the wrong format? Why Spark/MR does not consider the encoding format as UTF-8(expected one) but used “ANSI_X3.4-1968” in their applications?

The reason behind this issue is, in some CDH versions(especially in CDH 6.*), CDH blocks some parameters on the Linux system by default. The encoding of most Linux systems should be UTF-8, but CDH6 restricts these parameters by default, so Java uses the JVM default ANSI_X.3.4-1968.

We can identify this behavior by below simple Spark code,

println("Default charset: " + Charset.defaultCharset().displayName())
 println("File encoding: " + System.getProperty("file.encoding"))
 println("Scala encoding: " + scala.util.Properties.encodingString)

How to resolve this issue?

We can resolve this issue, at the job level, by passing file.encoding and sun.jnu.encoding as UTF-8 to Spark/MR application

spark.driver.extraJavaOptions="-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"
spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"

But in order to resolve this issue, in the cluster level, we can add the following parameters to Yarn to make Yarn use the system default parameters:

CM -> YARN -> Configuration -> Search for “Containers Environment Variables Whitelist” -> Add the parameters HADOOP_HOME,PATH,LANG,TZ. Click on “Save Changes”. Restart all NodeManagers.

Similar Posts