A case for a lower footprint JSON specification

August 13, 2020 · 6 min read

Jaco Jansen van Vuuren

Software Developer

As human beings we tend not to think about things too much once they work. We all have that one application running somewhere on a server rotting away that "just works" - we should probably revise it every now and again - but we don't - because it's a mental load we just don't need.

I'd argue that we apply the same mindset to our daily tools and established patterns - without ever thinking about it too much either. Recently while downloading a 4.26 GB CSV file from an Azure Databricks instance I was reminded of an idea I had a few months ago; JSON can be optimized.

JSON can be optimized?

Yeah - at least - I believe it can. I'm not an expert on the JSON specification or compression by any means - but I still wanted to explore the problem to see if I could bring down the footprint - even if only marginally.

The problem

A lot of what we transfer with APIs today is redundant data - namely - property names in JSON. You might think this is a negligible detail of the JSON specification and the amount of data that is transferred is minimal - but it will add up over time. As an example of redundancy - let's look at my 4.26 GB CSV file - but convert it to a JSON file and compare the difference in size.

you may be wondering why I am comparing to a CSV file and the reason is rather simple - CSV is doing something right in the way it transfers data. Property names are only ever sent once.

Converting and comparing the difference in size

To convert the CSV file into JSON - I wrote a little utility using nodejs. I was too tired to make it elegant - so I opted to just output the JSON at every 100000 lines of CSV.

I can already hear someone saying that the extra pair of "[" per file is going to taint the result - but at the size of the data - I really don't think it will matter.

const csv = require("csv-parser");
const fs = require("fs");

let results = [];
let i = 0;

fs.createReadStream("large.csv")
  .pipe(csv())
  .on("data", (data) => {
    if (i % 100000 === 0) {
      writeResultsToFile();
    }
    results.push(data);
    i++;
  })
  .on("end", () => {
    writeResultsToFile();
  });

function writeResultsToFile() {
  fs.writeFileSync(`json/${i}.json`, JSON.stringify(results));
  results = [];
}

The difference in size between CSV and JSON

Type	Size (GB)	Size (MB)	Size Gain (GB)	Size Gain (MB)	Size Gain (%)	Information Gain (%)
CSV	4.26	4583.37	-	-	-	-
JSON	7.54	8096.81	+3.28	+3513.44	43.39%	0%

As expected - a rather massive increase of 43% for absolutely no extra information. Redundant data - killing polar bears as it travels through our networks.

The solution

We need a JSON specification that removes as much of the redundancy as possible - whilst keeping the ease of use of JSON that we all came to know and love.

My (probably bad) attempt at solving the problem

Instead of defining our properties for every object when we have an array - we can define a map that we can use to perform the lookup. This will remove the need to send redundant data over the wire.

Current

[
  {
    "propertyOne": "valueOne",
    "propertyTwo": "valueTwo",
    "propertyThree": "valueThree",
    "propertyFour": {
        "ChildOne": 1,
        "ChildTwo": false,
        "ChildThree": "E"
    }
  },
  {
    "propertyOne": "valueOne",
    "propertyTwo": "valueTwo",
    "propertyThree": "valueThree",
    "propertyFour": {
        "ChildOne": 1,
        "ChildTwo": false,
        "ChildThree": "E"
    }
  },
  {
    "propertyOne": "valueOne",
    "propertyTwo": "valueTwo",
    "propertyThree": "valueThree",
    "propertyFour": {
        "ChildOne": 1,
        "ChildTwo": false,
        "ChildThree": "E"
    }
  }
  ...
]

Proposed

I have dubbed it json-b and you can read my bad attempt at implementing it here

//jsonb//
{"1":"propertyOne", "2": "propertyTwo", "3": "propertyThree", "4": "propertyFour", "4.1": "ChildOne", "4.2": "ChildTwo", "4.3": "ChildThree"}
//jsonb//
[
  {
    "1": "valueOne",
    "2": "valueTwo",
    "3": "valueThree",
    "4": {
        "4.1": 1,
        "4.2": false,
        "4.3": "E"
    }
  },
  {
    "1": "valueOne",
    "2": "valueTwo",
    "3": "valueThree",
    "4": {
        "4.1": 1,
        "4.2": false,
        "4.3": "E"
    }
  },
  {
    "1": "valueOne",
    "2": "valueTwo",
    "3": "valueThree",
    "4": {
        "4.1": 1,
        "4.2": false,
        "4.3": "E"
    }
  }
  ...
]

Converting

To compare the results - I wrote a naive (and very bad, and not feature complete, and definitely not close to production ready) implementation of my proposal and applied it to one of the files from the previous CSV -> JSON conversion.

const fs = require("fs");
const JSONB = require("json-b");

// super optimized stuff ;)
const data = JSON.parse(fs.readFileSync("100000.json"));
fs.writeFileSync("100000.jsonb", JSONB.stringify(data));

The difference in size between JSON and JSON-B

Type	Size (MB)	Size Reduction (MB)	Size Reduction (%)	Information Lost (%)
JSON	68.8	-	-	-
JSON-B	51.5	-17.3	-25.15%	0%*

* If you use my json-b implementation on real world data, you'll probably lose information

Yeah - ok. But GZIP fixes the issue, right?

For the most part - it does. But we are trying to squeeze out every byte we can.

Type	Size (MB)	Size Reduction (MB)	Size Reduction (%)	Information Lost (%)
JSON (GZIP)	8.84	-	-	-
JSON-B (GZIP)	8.33	-0.51	-5.77%	0%*

Ok - so how much bandwidth are we talking about saving using json-b and gzip?

In the real world you'll probably not return 100000 rows at a time - so let's use a better example. We'll also pretend you have a very popular blog - and get 100000 hits per day.

Type	Size (KB)	Bandwidth Per Day (KB)	Bandwidth For 30 Days (MB)	Difference Over 30 Days (MB)
JSON (GZIP)	6.94	694000	20820	-
JSON-B (GZIP)	6.72	672000	20160	660

Ok. So should we seriously consider doing this?

Probably not. The savings in electricity costs needed to transfer the data might be outweighed by the processing power needed to parse/stringify json-b files. GZIP also does a pretty good job already. Furthermore - in some cases, like where the response is only an object - json-b might make things larger.

It is an interesting thought experiment and I find it very curious to imagine what the global bandwidth saving could be if we all used a more optimized JSON.

JSON can be optimized?#

The problem#

Converting and comparing the difference in size#

The difference in size between CSV and JSON#

The solution#

My (probably bad) attempt at solving the problem#

Current#

Proposed#

Converting#

The difference in size between JSON and JSON-B#

Yeah - ok. But GZIP fixes the issue, right?#

Ok - so how much bandwidth are we talking about saving using json-b and gzip?#

Ok. So should we seriously consider doing this?#