TIL: Reading Gzip Compressed JSON in Spark

Posted on Jul 25, 2022

I needed to read some new-line delimited JSON that are compressed with gzip today. I knew that Spark can do this out of the box since I’ve done it some years ago

However, I was surprised, that there is no option to tell Spark that your files are compressed. There is the option compression="gzip" and spark doesn’t complain when you run spark.read.option(compression="gzip").json(path) but this option is only meant for writing data.

So how does Spark know?

Spark infers the compression from your filename. However, if your files, like mine, end with .gzip (e.g., some_data.json.gzip) you are out of luck. You will get relatively cryptic error messages about malformed columns without a straightforward way to inspect the problem. Or even worse, an error about endianness in UTF-4 encoding. Both are rather misleading.

The Solution?

Excluding some rather hacky workarounds, the only way is to rename your files to *.json.gz.

Interesting reads:

https://aws.plainenglish.io/demystifying-apache-spark-quirks-2c91ba2d3978