← All articles
DATA Data Serialization Formats: JSON vs MessagePack vs P... 2026-02-09 · 7 min read · json · protobuf · messagepack

Data Serialization Formats: JSON vs MessagePack vs Protocol Buffers vs Avro vs Parquet

Data 2026-02-09 · 7 min read json protobuf messagepack avro serialization

Data Serialization Formats: JSON vs MessagePack vs Protocol Buffers vs Avro vs Parquet

Every application serializes data. APIs encode responses, services pass messages, databases store records, and pipelines transform datasets. JSON is the default choice, but it's not always the best one. Depending on your use case, a binary format can cut payload sizes by 60-80% and improve serialization speed by 10x.

This guide compares the five most practical serialization formats, with real benchmarks, code examples, and clear guidance on when each one makes sense.

Quick Comparison

Feature JSON MessagePack Protocol Buffers Avro Parquet
Format Text Binary Binary Binary Binary (columnar)
Schema required No No Yes (.proto) Yes (.avsc) Yes (embedded)
Human-readable Yes No No No No
Self-describing Yes Yes No Yes (with header) Yes (with footer)
Language support Universal Very broad Very broad Broad (best in JVM) Broad
Typical size vs JSON 1x 0.5-0.7x 0.3-0.5x 0.3-0.5x 0.1-0.3x (columnar)
Schema evolution N/A N/A Good Excellent Good
Best for APIs, config, debugging Drop-in JSON replacement RPCs, microservices Data pipelines, Kafka Analytics, data lakes

JSON

JSON is human-readable, universally supported, and good enough for most use cases. Its weaknesses are verbosity (field names repeated in every record), lack of a native binary type (base64 encoding is a workaround), and no built-in schema.

// JSON serialization — nothing to install
const data = {
  userId: 12345,
  name: "Alice",
  email: "[email protected]",
  roles: ["admin", "editor"],
  createdAt: "2026-01-15T10:30:00Z",
};

const encoded = JSON.stringify(data);    // 128 bytes
const decoded = JSON.parse(encoded);

When JSON Is the Right Choice

When to Move Beyond JSON

MessagePack

MessagePack is "JSON but binary." It has the same data model as JSON (maps, arrays, strings, numbers, booleans, null) but encodes it more compactly. No schema required. It's a drop-in optimization for any JSON workload.

import { encode, decode } from "@msgpack/msgpack";

const data = {
  userId: 12345,
  name: "Alice",
  email: "[email protected]",
  roles: ["admin", "editor"],
  createdAt: "2026-01-15T10:30:00Z",
};

const encoded = encode(data);   // ~85 bytes (vs 128 for JSON)
const decoded = decode(encoded);

Size Savings

MessagePack achieves 30-50% size reduction over JSON by:

When to Pick MessagePack

// Redis caching with MessagePack instead of JSON
import { encode, decode } from "@msgpack/msgpack";
import Redis from "ioredis";

const redis = new Redis();

async function cacheSet(key: string, value: any, ttl: number) {
  await redis.setex(key, ttl, Buffer.from(encode(value)));
}

async function cacheGet<T>(key: string): Promise<T | null> {
  const buf = await redis.getBuffer(key);
  return buf ? decode(buf) as T : null;
}

Protocol Buffers (Protobuf)

Protocol Buffers are Google's schema-driven serialization format. You define your data structure in a .proto file, generate code for your target language, and get compact binary encoding with strong typing and backward-compatible schema evolution.

Defining a Schema

// user.proto
syntax = "proto3";

package myapp;

message User {
  int32 user_id = 1;
  string name = 2;
  string email = 3;
  repeated string roles = 4;
  google.protobuf.Timestamp created_at = 5;

  enum Status {
    ACTIVE = 0;
    SUSPENDED = 1;
    DELETED = 2;
  }
  Status status = 6;
}

message UserList {
  repeated User users = 1;
  int32 total_count = 2;
}

Code Generation and Usage

# Install protoc compiler
# macOS
brew install protobuf

# Generate TypeScript code (using ts-proto)
protoc --plugin=./node_modules/.bin/protoc-gen-ts_proto \
  --ts_proto_out=./src/generated \
  --ts_proto_opt=outputServices=false \
  ./proto/user.proto
import { User } from "./generated/user";

// Encode
const user = User.create({
  userId: 12345,
  name: "Alice",
  email: "[email protected]",
  roles: ["admin", "editor"],
  status: User_Status.ACTIVE,
});

const bytes = User.encode(user).finish();  // ~45 bytes (vs 128 for JSON)
const decoded = User.decode(bytes);

Schema Evolution Rules

Protobuf handles schema changes gracefully if you follow these rules:

message User {
  reserved 7, 8;           // Don't reuse these field numbers
  reserved "phone_number"; // Document removed field names

  int32 user_id = 1;
  string name = 2;
  string email = 3;
  repeated string roles = 4;
  string avatar_url = 9;   // New field — safe to add
}

When to Pick Protobuf

Avro

Avro is an Apache project designed for data-intensive applications. Its key differentiator is that the schema is embedded in the data file (or stored in a schema registry), making data self-describing. This is critical for data pipelines where producers and consumers evolve independently.

Schema Definition

{
  "type": "record",
  "name": "User",
  "namespace": "com.myapp",
  "fields": [
    { "name": "userId", "type": "int" },
    { "name": "name", "type": "string" },
    { "name": "email", "type": "string" },
    { "name": "roles", "type": { "type": "array", "items": "string" } },
    {
      "name": "status",
      "type": { "type": "enum", "name": "Status", "symbols": ["ACTIVE", "SUSPENDED", "DELETED"] },
      "default": "ACTIVE"
    },
    {
      "name": "avatarUrl",
      "type": ["null", "string"],
      "default": null
    }
  ]
}

Usage with Kafka

Avro shines in Kafka ecosystems. The Confluent Schema Registry stores schemas and ensures producers and consumers are compatible.

import { SchemaRegistry } from "@kafkajs/confluent-schema-registry";
import { Kafka } from "kafkajs";

const registry = new SchemaRegistry({ host: "http://localhost:8081" });
const kafka = new Kafka({ brokers: ["localhost:9092"] });

// Producer
async function produce() {
  const producer = kafka.producer();
  await producer.connect();

  const schemaId = await registry.getLatestSchemaId("user-value");

  const encoded = await registry.encode(schemaId, {
    userId: 12345,
    name: "Alice",
    email: "[email protected]",
    roles: ["admin"],
    status: "ACTIVE",
    avatarUrl: null,
  });

  await producer.send({
    topic: "users",
    messages: [{ key: "12345", value: encoded }],
  });
}

// Consumer — schema is resolved automatically
async function consume() {
  const consumer = kafka.consumer({ groupId: "user-service" });
  await consumer.connect();
  await consumer.subscribe({ topic: "users" });

  await consumer.run({
    eachMessage: async ({ message }) => {
      const user = await registry.decode(message.value);
      console.log(user.name); // Fully typed with schema
    },
  });
}

When to Pick Avro

Parquet

Parquet is a columnar storage format designed for analytics. Instead of storing data row by row, it stores each column together, which enables massive compression and lets query engines read only the columns they need.

Parquet isn't a serialization format for APIs or RPCs. It's a file format for storing and querying large datasets.

When Parquet Wins

Consider a table with 100 columns and 1 billion rows. A query that only needs 3 columns reads ~3% of the data in Parquet, versus 100% in a row-oriented format like JSON or Avro.

# Writing Parquet with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({
    "user_id": range(1_000_000),
    "name": [f"user_{i}" for i in range(1_000_000)],
    "signup_date": [datetime.now()] * 1_000_000,
    "lifetime_value": [random.uniform(0, 1000) for _ in range(1_000_000)],
})

# Write with snappy compression (default)
pq.write_table(table, "users.parquet")
# Result: ~8 MB (vs ~90 MB as JSON)
# Reading only specific columns
table = pq.read_table("users.parquet", columns=["user_id", "lifetime_value"])
# Only reads the two requested columns from disk

When to Pick Parquet

Benchmark Summary

Rough benchmarks for serializing/deserializing a typical API payload (1 KB JSON equivalent, 10,000 iterations, Node.js):

Format Encoded Size Serialize (ops/s) Deserialize (ops/s)
JSON 1,000 B 500,000 400,000
MessagePack 650 B 600,000 500,000
Protobuf 380 B 1,200,000 1,500,000
Avro 400 B 800,000 900,000

These numbers vary significantly by payload shape, language, and library. Always benchmark with your actual data.

Decision Framework

Use JSON when human readability matters, you're building public APIs, or payload size isn't a concern. It's the safe default.

Use MessagePack when you want smaller payloads without adding schemas. It's the lowest-effort optimization for internal APIs and caching.

Use Protobuf when you're building gRPC services, need strong typing, or want compact encoding with well-defined schema evolution. It's the standard for service-to-service communication at scale.

Use Avro when you're building Kafka-based data pipelines and need a schema registry for compatibility management.

Use Parquet when you're storing large analytical datasets and querying them with columnar engines.

Most applications should start with JSON and only switch to a binary format when they have a concrete reason: measurable latency improvement, bandwidth savings, or schema enforcement needs. Premature optimization toward binary formats adds complexity without proportional benefit.