Why Spark/MR not considering UTF-8 encoding
Reading/WRITING UTF-8 enabled file
Sometimes, we could have encountered issues in which Spark returns non-ASCII characters in the wrong format. Especially in CDH-flavored Hadoop versions(say CDH 6.*), Spark/MR jobs will decode non-ASCII characters with the wrong format. Rather than using UTF-8 encoding format, Spark/MR would have used ”ANSI_X.3.4-1968”.
On investigating further, we could see that Yarn logs of Spark/MR would have traced java encoding properties file.encoding and sun.jnu.encoding as UTF-8.
For example,
grep "ANSI" /application_*************_*****.log file.encoding = ANSI_X3.4-1968 sun.jnu.encoding = ANSI_X3.4-1968
But if we collect java properties using the command “java -XshowSettings -version”, we can see java has been set UTF-8 as encoding options
Property settings:
awt.toolkit = sun.awt.X11.XToolkit file.encoding = UTF-8 file.encoding.pkg = sun.io .. .. sun.io.unicode.encoding = UnicodeLittle sun.java.launcher = SUN_STANDARD sun.jnu.encoding = UTF-8 Also we can see the same behavior from output of unix command “locale” which returns , $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8"
Why, then, Spark/MR application returns characters in the wrong format? Why Spark/MR does not consider the encoding format as UTF-8(expected one) but used “ANSI_X3.4-1968” in their applications?
The reason behind this issue is, in some CDH versions(especially in CDH 6.*), CDH blocks some parameters on the Linux system by default. The encoding of most Linux systems should be UTF-8, but CDH6 restricts these parameters by default, so Java uses the JVM default ANSI_X.3.4-1968.
We can identify this behavior by below simple Spark code,
println("Default charset: " + Charset.defaultCharset().displayName()) println("File encoding: " + System.getProperty("file.encoding")) println("Scala encoding: " + scala.util.Properties.encodingString)
How to resolve this issue?
We can resolve this issue, at the job level, by passing file.encoding and sun.jnu.encoding as UTF-8 to Spark/MR application
spark.driver.extraJavaOptions="-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8" spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8"
But in order to resolve this issue, in the cluster level, we can add the following parameters to Yarn to make Yarn use the system default parameters:
CM -> YARN -> Configuration -> Search for “Containers Environment Variables Whitelist” -> Add the parameters HADOOP_HOME,PATH,LANG,TZ. Click on “Save Changes”. Restart all NodeManagers.