Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

java.lang.OutOfMemoryError: UTF16 String size exceeding default value

I was trying to load a tsv files from urls (max file size was 1.05 GB or 1129672402 Bytes)

I used java.net.URL for it.

But, it throwed the below error (for the largest one)-

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

java.lang.OutOfMemoryError: UTF16 String size is 1129672402, should be less than 1073741823

Is there any way to increase the default String size in spark or any other solution for processing this ?

def getGeoFeedsDataNew(spark: SparkSession, url: String, schema: StructType): DataFrame = {
    val inputStream = new URL(url).openStream()
    val reader = new BufferedReader(new InputStreamReader(inputStream))
    var data = reader.lines().collect(Collectors.joining("\n")).split("\\n").map("1\t".concat(_).concat("\t2")).map(_.split("\t"))
    inputStream.close()
    val size = data.length
    logger.info(s"records found: ${size-1}")
    if(size < 2){
      return spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
    }
    data = data.slice(1,size)
    val rowsRDD = spark.sparkContext.parallelize(data).map(row => Row.fromSeq(row.toSeq))
    val geoDataDF: DataFrame = spark.createDataFrame(rowsRDD, schema)
    return geoDataDF
  }

My current spark configs –

  "spark.driver.cores": "1",
  "spark.driver.memory": "30G",
  "spark.executor.cores": "8",
  "spark.executor.memory": "30G",
  "spark.network.timeout": "120",
  "spark.executor.instances": "8",
  "spark.rpc.message.maxSize": "1024",
  "spark.driver.maxResultSize": "2G",
  "spark.sql.adaptive.enabled": "true",
  "spark.sql.broadcastTimeout": "10000",
  "spark.sql.shuffle.partitions": "200",
  "spark.shuffle.useOldFetchProtocol": "true",
  "spark.hadoop.fs.s3a.committer.name": "magic",
  "spark.sql.adaptive.skewJoin.enabled": "true"

>Solution :

Unfortunately you seem to have hit: https://bugs.openjdk.org/browse/JDK-8190429 there is no way around this limitation. The same limitation would be there if you had strings inside a row field. Instead you must save to a file instead of a string and refer to that. (breaking strings up into different fields could work for a bit but you’d have a 2gb limit per row to fight against anyway, again no configuration can change this as both are fighting against byte arrays)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading