From JSON to Protobuf: a low level tour

Disclaimer: This is not a guide on how to use protobuf. It's just a brief tour on what happens under the hood when using some common encoding formats.

Transferring data between computers is an essential operation in our daily lives, and at the core of this process, information is ultimately transmitted as binary digits, aka 0s and 1s, across networks. Computers rely on binary to communicate and process information, but how is the data we interact with translated into this binary format? The answer lies in the encoding format used.

JSON format

One of the most recognized encoding formats is JSON (JavaScript Object Notation), which is widely used across the internet due to its simplicity and support by most programming languages. Consider the following example, with spaces removed for the sake of simplicity:

{"id":45,"name":"elie"}

If we want to encode each character based on the ASCII table, we will obtain the following:

7B 22 69 64 22 3A 34 35 2C 22 6E 61 6D 65 22 3A 22 65 6C 69 65 22 7D

  1. {7B: The opening curly brace { is represented by the hexadecimal value 7B.
  2. "22: The double quote " is represented by the hexadecimal value 22.
  3. i69: The lowercase letter i is represented by the hexadecimal value 69.
  4. etc.

As you can see, when transferring this message, and in order for it to be properly constructed on the receiving end, it must include the entire text, including special characters such as the { } and " ", as well as the name for each field: "id" and "name" in this case.

The result of this encoding process is that the payload size increases significantly, often more than doubling the amount of data transmitted, since we only wanted to deliver 45 and elie. This overhead becomes even more substantial when dealing with data representing thousands or even millions of people.

So the question is: Is there a more efficient and faster way to transmit data? Something that reduces the metadata we need to transfer in order to deliver our inteded payload? The answer is yes! [otherwise I wouldn't be writing this article]

Protocol Buffers aka Protobuf

Protobuf was invented by Google in 2001. They were looking for a solution to use internally, that is faster and more efficient than the traditional XML back then.

Protocol Buffers is a language-agnostic binary format, enabling it to be efficient and versatile across different programming languages. The main point behind protobuf was to reduce the payload size by removing the need for special characters and field's name, as we saw with JSON encoding. The way this is done is by using a schema that defines the structure of the message, and both the sender and the receiver should have access to this schema in order to be able to encode and/or decode the message.

Enough gibberish, let's see an example using the same data we wanted to transfer with JSON:

{"id":45,"name":"elie"}

The 3 general steps to be followed when working with protobuf are:

  1. Define the message structure using the protobuf schema definition language in a .proto file.
  2. Use the protobuf compiler (protoc) to generate language-specific code for serialization and deserialization.
  3. Include the generated code in your application and use it to serialize and deserialize your data.

The first two steps happen initially when setting up your project, and the rest is about receiving/sending the data.

1. Defining the schema in a .proto file

In protobuf, the first step is to define a schema that determines the structure of the messages being sent. This schema is defined in a file with .proto extension.

Let's call it: userSchema.proto:

syntax = "proto3";
message User {
    int32 id = 1;
    string name = 2;
}

The schema above defines a User message with two fields: an id field of type int32 and a name field of type string. The numbers 1 and 2 are field numbers, which are crucial for the encoding and decoding process. They identify each field uniquely which will help us avoid the need for including the field name “id” and “name” when sending the message.

If you have other fields, such as email, password, username, etc, you can give them numbers in a sequential order: 3, 4, 5, etc.

2. Using the compiler protoc

As mentioned earlier, protobuf is language-neutral, which means regardless of what language you're coding with, all you have to do is use the compiler (protoc) to generate code for the target programming language from the .proto file. This will automatically create the necessary classes, types, and methods for working with the serialized data.

You can get the compiler from this page: https://protobuf.dev/downloads/

For example, if you want to generate code for Python, you would use the following command:

protoc --python_out=. userSchema.proto

For TypeScript:

protoc --ts_out=. userSchema.proto

3. Encode data to be transferred

Assume we are working in TypeScript, we can create a new instance of User and send it over the wire:

import { User } from"./userSchema_pb";

let elie = new User();
elie.setId(45);
elie.setName("elie");
let data = elie.serializeBinary();
fs.writeFileSync("protobufData", data);

Now here is what's happening inside the data file:

1: 40
2: 'elie'

As mentioned above, in the .proto file, we said that id has a field number of 1, and name a field number of 2. So now instead of saying: "id": 40 and "name": "elie", it's enough to use the field number corresponing to each key: 1 for id and 2 for name.

If you've been following along since the beginning of this article, you should already see an advantage with this aproach over using JSON format. In addition to reducing the fieldname to numbers, we also got rid of special characters such as the curly braces, and the quotation marks.

The binary representation of the above message would be:

08 2D 12 04 65 6C 69 65

Compared to the following in JSON format:

7B 22 69 64 22 3A 34 35 2C 22 6E 61 6D 65 22 3A 22 65 6C 69 65 22 7D

For curious and sharp readers: If you noticed that the first byte is 0x08, which is not the hexadecimal representation of 1, then good job! It's because the field bytes include also the wire representation. More about this here: https://protobuf.dev/programming-guides/encoding/

Last point: Varint

One last technical difference that I will mention briefly is that protobuf uses Varint for its integers. Varint means “variable-length integer”. Assume you are coding in a language that reserves 4 bytes for an int regardless of the integer's value, so if you are sending the number 4, you will end up sending 4 bytes to represent this number. However, with protobuf, the number of bytes is dynamic, which means, if the number we are sending does not require more than 1 byte, then only 1 byte will be sent (It's an over simplification of how it really works, but it's good enough as a mental model). I hope you can already see the advantage of this technology in some cases.

Summary

There is much more to discuss when it comes to the advantages and disadvantages of different encoding formats. Protobuf has its own disadvantages in some situations as well, and no solution is the best for every situation. The appropriate choice requires a case-by-case study, depending on the specific needs and circumstances of each project. JSON is the most popular, but Protobuf is also widely used, with millions of downloads (check the npm download stats: https://www.npmjs.com/package/protobufjs). While 12M weekly downloads is not a small number, it doesn't necessarily mean that all use cases are justified and correct. It's up to you to delve deep into the details of how these technologies work, weigh their pros and cons, and make an informed decision. The primary aim of this article was to emphasize the importance and the beauty of understanding how things work under the hood! Some people memorize that X is faster than Z, while others understand why X is faster than Z. Strive to be among the latter! Happy coding :]