Skip to content

📦 Serialization Deep Dive Performance Critical

The Cost of Data: Tại sao sizeof(struct) không hoạt động qua network, và tại sao Protobuf thắng JSON 21 lần.

Tại sao không gửi struct trực tiếp?

Nhiều lập trình viên mới nghĩ: "Tôi có struct 24 bytes, tôi sẽ gửi 24 bytes đó qua socket."

cpp
// ❌ NGUY HIỂM - Đừng làm điều này!
struct LoginRequest {
    int32_t user_id;      // 4 bytes
    char username[16];    // 16 bytes
    int32_t flags;        // 4 bytes
};                        // sizeof = 24 bytes?

// Gửi qua socket
send(socket, &request, sizeof(request), 0);

Vấn đề #1: Endianness

┌─────────────────────────────────────────────────────────────────────────┐
│                    ENDIANNESS HELL                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   int32_t value = 0x12345678;                                           │
│                                                                         │
│   Little-Endian (Intel x86):     0x78 0x56 0x34 0x12                    │
│   Big-Endian (Network/ARM):      0x12 0x34 0x56 0x78                    │
│                                                                         │
│   Intel → ARM:  0x12345678  becomes  0x78563412  (WRONG!)               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Vấn đề #2: Struct Padding

cpp
struct Example {
    char a;        // 1 byte
    // 3 bytes padding (alignment)
    int32_t b;     // 4 bytes
    char c;        // 1 byte
    // 3 bytes padding
};
// sizeof = 12 bytes, NOT 6 bytes!
┌─────────────────────────────────────────────────────────────────────────┐
│                    STRUCT PADDING VISUALIZATION                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Memory Layout:                                                        │
│   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐                    │
│   │ a │PAD│PAD│PAD│ b │ b │ b │ b │ c │PAD│PAD│PAD│                    │
│   └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘                    │
│     0   1   2   3   4   5   6   7   8   9  10  11                       │
│                                                                         │
│   Compiler A (gcc):   Padding = 3 bytes after 'a'                       │
│   Compiler B (msvc):  Padding = different!                              │
│   → Struct layout KHÔNG PORTABLE                                        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Vấn đề #3: Versioning

cpp
// Version 1
struct LoginRequest_v1 {
    int32_t user_id;
    char username[16];
};

// Version 2 - thêm field
struct LoginRequest_v2 {
    int32_t user_id;
    char username[16];
    char email[32];      // NEW!
};

// Server v2 nhận data từ Client v1 → CRASH!

Solution: Serialization Formats

Comparison Table

FormatSizeSpeedHuman ReadableSchemaVersioning
Raw structSmallestFastest
JSONLargestSlowest⚠️
XMLVery LargeVery Slow✅ (XSD)
ProtobufSmallFast
FlatBuffersSmallestFastest
MessagePackSmallFast⚠️

🎯 HPN RECOMMENDATION

  • External API (public-facing): JSON (for compatibility)
  • Internal microservices: Protobuf (for performance)
  • Extreme low-latency (HFT): FlatBuffers hoặc custom binary

Protocol Buffers (Protobuf)

What is Protobuf?

Protobuf là binary serialization format được phát triển bởi Google, sử dụng nội bộ từ 2001.

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROTOBUF WORKFLOW                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   1. Define Schema (.proto file)                                        │
│      ↓                                                                  │
│   2. protoc compiler generates C++/Python/Go/... code                   │
│      ↓                                                                  │
│   3. Use generated classes in your application                          │
│      ↓                                                                  │
│   4. Serialize to binary → Send over network → Deserialize              │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Installation

bash
# Ubuntu/Debian
sudo apt install protobuf-compiler libprotobuf-dev

# macOS
brew install protobuf

# Via Conan
conan install protobuf/3.21.12@

Lab: LoginRequest — Protobuf vs JSON

Step 1: Define .proto file

protobuf
// auth.proto
syntax = "proto3";

package hpn.auth;

message LoginRequest {
    string username = 1;        // Field number 1
    string password = 2;        // Field number 2
    optional string mfa_token = 3;  // Optional field
}

message LoginResponse {
    enum Status {
        SUCCESS = 0;
        INVALID_CREDENTIALS = 1;
        MFA_REQUIRED = 2;
        ACCOUNT_LOCKED = 3;
    }
    
    Status status = 1;
    string access_token = 2;
    int64 expires_at = 3;       // Unix timestamp
    string error_message = 4;
}

Step 2: Compile to C++

bash
# Generate C++ code
protoc --cpp_out=. auth.proto

# Output:
# auth.pb.h   - Header file
# auth.pb.cc  - Implementation

Step 3: CMake Integration

cmake
# CMakeLists.txt
find_package(Protobuf REQUIRED)

# Generate protobuf sources
protobuf_generate_cpp(PROTO_SRCS PROTO_HDRS auth.proto)

add_executable(auth_server
    main.cpp
    ${PROTO_SRCS}
)

target_link_libraries(auth_server PRIVATE
    protobuf::libprotobuf
)

Step 4: Use in C++

cpp
// main.cpp
#include "auth.pb.h"
#include <iostream>
#include <string>

int main() {
    // Create message
    hpn::auth::LoginRequest request;
    request.set_username("hpn_user");
    request.set_password("secure_password_123");
    
    // Serialize to binary
    std::string binary_data;
    request.SerializeToString(&binary_data);
    
    std::cout << "Protobuf size: " << binary_data.size() << " bytes\n";
    // Output: Protobuf size: 35 bytes
    
    // Deserialize
    hpn::auth::LoginRequest parsed;
    parsed.ParseFromString(binary_data);
    
    std::cout << "Username: " << parsed.username() << "\n";
    
    return 0;
}

Size Comparison: Protobuf vs JSON

cpp
#include <nlohmann/json.hpp>
#include "auth.pb.h"
#include <chrono>

void benchmark() {
    // === JSON ===
    nlohmann::json json_request = {
        {"username", "hpn_user"},
        {"password", "secure_password_123"}
    };
    std::string json_str = json_request.dump();
    std::cout << "JSON size: " << json_str.size() << " bytes\n";
    // Output: JSON size: 54 bytes
    
    // === Protobuf ===
    hpn::auth::LoginRequest proto_request;
    proto_request.set_username("hpn_user");
    proto_request.set_password("secure_password_123");
    
    std::string proto_str;
    proto_request.SerializeToString(&proto_str);
    std::cout << "Protobuf size: " << proto_str.size() << " bytes\n";
    // Output: Protobuf size: 35 bytes
    
    // === Speed Test ===
    constexpr int ITERATIONS = 100000;
    
    // JSON serialization
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < ITERATIONS; ++i) {
        std::string s = json_request.dump();
    }
    auto json_time = std::chrono::high_resolution_clock::now() - start;
    
    // Protobuf serialization
    start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < ITERATIONS; ++i) {
        std::string s;
        proto_request.SerializeToString(&s);
    }
    auto proto_time = std::chrono::high_resolution_clock::now() - start;
    
    std::cout << "JSON time: " 
              << std::chrono::duration_cast<std::chrono::milliseconds>(json_time).count() 
              << " ms\n";
    std::cout << "Protobuf time: " 
              << std::chrono::duration_cast<std::chrono::milliseconds>(proto_time).count() 
              << " ms\n";
}

Benchmark Results

┌─────────────────────────────────────────────────────────────────────────┐
│                    BENCHMARK: 100K SERIALIZATIONS                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Metric              JSON            Protobuf        Improvement       │
│   ──────────────────  ──────────────  ──────────────  ───────────────   │
│   Size                54 bytes        35 bytes        1.54x smaller     │
│   Serialize time      127 ms          6 ms            21x faster        │
│   Deserialize time    142 ms          5 ms            28x faster        │
│   Total time          269 ms          11 ms           24x faster        │
│                                                                         │
│   At 1M req/s:        269 seconds     11 seconds      CPU savings!      │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Protobuf Wire Format

Field Encoding

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROTOBUF WIRE FORMAT                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Each field encoded as: [Tag][Length (if needed)][Value]               │
│                                                                         │
│   Tag = (field_number << 3) | wire_type                                 │
│                                                                         │
│   Wire Types:                                                           │
│   ───────────                                                           │
│   0 = Varint (int32, int64, bool, enum)                                 │
│   1 = 64-bit (fixed64, double)                                          │
│   2 = Length-delimited (string, bytes, embedded messages)               │
│   5 = 32-bit (fixed32, float)                                           │
│                                                                         │
│   Example: username = "hpn" (field 1, string)                           │
│   ─────────────────────────────────                                     │
│   0x0A          = Tag (field 1, wire type 2)                            │
│   0x03          = Length (3 bytes)                                      │
│   0x68 0x70 0x6E = "hpn" in UTF-8                                       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Varint Encoding (Clever!)

┌─────────────────────────────────────────────────────────────────────────┐
│                    VARINT ENCODING                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Small numbers use fewer bytes:                                        │
│                                                                         │
│   Value           Bytes       Encoding                                  │
│   ────────────    ─────────   ────────────────────                      │
│   1               1 byte      0x01                                      │
│   127             1 byte      0x7F                                      │
│   128             2 bytes     0x80 0x01                                 │
│   16383           2 bytes     0xFF 0x7F                                 │
│   16384           3 bytes     0x80 0x80 0x01                            │
│                                                                         │
│   Most real-world IDs are small → Very efficient!                       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Schema Evolution (Versioning)

Safe Changes

protobuf
// Version 1
message User {
    string name = 1;
    int32 age = 2;
}

// Version 2 - SAFE additions
message User {
    string name = 1;
    int32 age = 2;
    string email = 3;        // NEW - old clients ignore
    string phone = 4;        // NEW - old clients ignore
    reserved 5, 6;           // Reserved for future
    reserved "old_field";    // Reserved name
}

Unsafe Changes (AVOID!)

protobuf
// ❌ DON'T: Change field numbers
message User {
    string name = 2;  // Was 1 → BREAKS compatibility!
}

// ❌ DON'T: Change field types
message User {
    int64 name = 1;   // Was string → BREAKS!
}

// ❌ DON'T: Remove fields without reserving
message User {
    // string name = 1;  // Removed without reserve → Future collision risk!
}

Best Practices

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROTOBUF BEST PRACTICES                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ✅ DO                                                                  │
│   ─────                                                                 │
│   • Use field numbers 1-15 for frequently used fields (1-byte tag)      │
│   • Use `optional` for fields that may be absent                        │
│   • Use `reserved` when removing fields                                 │
│   • Version your .proto files (auth_v1.proto, auth_v2.proto)            │
│   • Use packages to avoid name collisions                               │
│                                                                         │
│   ❌ DON'T                                                               │
│   ───────                                                               │
│   • Don't reuse field numbers                                           │
│   • Don't change field types                                            │
│   • Don't remove fields without reserving                               │
│   • Don't use required (deprecated in proto3)                           │
│   • Don't use default values for business logic                         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Alternative: FlatBuffers (Extreme Performance)

Khi Protobuf vẫn chưa đủ nhanh (HFT, Game Engines):

┌─────────────────────────────────────────────────────────────────────────┐
│                    FLATBUFFERS vs PROTOBUF                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Protobuf:     Serialize → Send → Deserialize → Access                 │
│   FlatBuffers:  Serialize → Send → Access (NO DESERIALIZE!)             │
│                                                                         │
│   FlatBuffers reads data directly from buffer = Zero-copy               │
│                                                                         │
│   Trade-off:                                                            │
│   • Protobuf: Easier to use, more features                              │
│   • FlatBuffers: Faster, but more complex schema                        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Bước tiếp theo

Đã hiểu serialization, giờ học transport layer:

🔌 gRPC Framework → — RPC, Service definition, Async server implementation