FIX – TypeError: an integer is required (got type bytes)

In this article, we will learn about the “TypeError: an integer is required (got type bytes)” that occurs in PySpark and explore the methods to resolve this issue.

TypeError: an integer is required (got type bytes) often arises when executing a Pyspark 2. x version script in Python 3. x environment. To resolve this issue, running the PySpark script within a supported Python environment is crucial.

  • Spark 2.4 supports Python 2.7 and 3.4-3.7.
  • Spark 3.0 supports Python 2.7 and 3.4 and higher

Impact: TypeError: an integer is required (got type bytes)

TypeError will result in Spark job failure, Which I will show you in some seconds 🙂

Replicating the Issue

To replicate the issue, I have installed all the Python 2.7, 3.7, and 3.9 versions in my Lab Environment

# yum install python3.9

================================================================================
 Package  Arch   Version                               Repository          Size
================================================================================
Installing:
 python39 x86_64 3.9.2-1.module+el8.4.0+10237+bdc77aac Internal-Lab  32 k
Installing dependencies:
 python39-libs
          x86_64 3.9.2-1.module+el8.4.0+10237+bdc77aac Internal-Lab 8.1 M
 python39-pip-wheel


]# ls /usr/bin/py
pydoc2.7                  python2-config            python3.8
pydoc3                    python2.7                 python3.8-config
pydoc3.6                  python2.7-config          python3.8-x86_64-config
pydoc3.8                  python3                   python3.9
pydoc3.9                 

Now, We can able to replicate the issue by pointing Pyspark version 2.x to Python 3.9.

NOTE: I am using Pyspark Interactive mode for easy understanding You can able to replicate this by running the Pyspark script using spark-submit(Non-interactive mode) in Python 3.9 env and the result will be the same

]# pyspark --master yarn --conf spark.pyspark.python=/usr/bin/python3.9

Python 3.9.2 (default, Mar  5 2021, 01:49:45) 
[GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/shell.py", line 31, in <module>
    f1.cdh/lib/spark/python/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)
>>> 

Resolution

To resolve the issue, We need to make sure to use the compatible Spark and Python version

]# pyspark --master yarn --conf spark.pyspark.python=/usr/bin/python2.7

Python 2.7.18 () 
[GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/spark-atlas-connector-assembly-.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4
      /_/

Using Python version 2.7.18 (default, Jan 22 2021 11:25:37)
SparkSession available as 'spark'.
>>> 

Based on the compatible matrix

Spark VersionPython Version
Spark 2.4Python 2.7 and 3.4-3.7.
Spark 3.0Python 2.7 and 3.4 and higher

By pointing to the Correct version, We can able to resolve the TypeError: an integer is required (got type bytes)

In the above example, I have pointed the Pyspark to Python 2.7 version and it worked as expected.

If you want to make these changes globally, So that we don’t need to specify the version each and every time running the job, We can do this by exporting the correct version in the below files on all the Spark nodes

spark-env.sh  & spark-defaults.conf
export PYSPARK_PYTHON=/usr/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/bin/python2.7

Let’s say you are using Spark 3 then you can point the Python version to /usr/bin/python3.9

In case, you are using Cloudera Distribution, You can add this to the Cloudera Manager web-ui as below

Example for Spark 3 and Python 3.9Cloudera Distribution

Requirements:
CDS 3 requires Python 3.4 or higher.

In Cloudera Manager -> Spark -> Configuration tab -> Spark 3 Client Advanced Configuration Snippet
add the spark-env.sh variables to the property ->

export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9

Add the spark-defaults.conf variables to the property ->

export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9

Conclusion

In conclusion, encountering the error “TypeError: an integer is required (got type bytes)” is a common occurrence when attempting to execute a PySpark 2.x script within a Python 3.x environment. This error can be resolved by ensuring that the PySpark script is executed within a compatible Python environment that supports the required integer operations.

By running the script in a supported environment, users can avoid this error and ensure the successful execution of their PySpark code.

Good Luck with your Learning !! Feel free to comment if you have any questions

Related Topics:

Resolve “org.apache.hadoop.hive.serde2.SerDeException: Unexpected tag” in Spark and Hive

Total size of serialized results of tasks (1024.5 MB) is bigger than spark.driver.maxResultSize

Resolve “Task serialization failed: java.lang.StackOverflowError” in Spark

“Futures timed out” issue in spark

Resolve the “User application exited with status 1” issue in Spark

Similar Posts