Architecture

Tech stack

We have chosen to use the following tech stack:

Golang 1.23
BoltDB
eBPF

We only have one code base for the Agent, Dispatcher and the TUI application. And it's a single binary for all at the moment. BoltDB is used to store the status of various actions like clone, pause, resume, restore, etc., to store the metadata of the servers where agent is running and to keep track of the dirty sectors.

Overview

The blxrep architecture establishes a robust, real-time connection between Agents and the Dispatcher through multiple WebSocket channels. This design enables both full disk backups and continuous incremental change tracking to operate simultaneously. When an Agent connects, it authenticates through a secure WebSocket connection and immediately begins two parallel processes: creating a complete disk image and monitoring disk sectors for changes. During the full backup process, the Agent streams disk data to the Dispatcher, which stores it as an .img file in the snapshot directory while tracking progress in the xmactions database. Simultaneously, the Agent monitors disk sectors for changes, sending these sector numbers to the Dispatcher through a dedicated WebSocket channel. The Dispatcher preserves these sector changes as .cst files in the incremental directory.

At regular intervals defined by the live_sync_frequency, the Dispatcher reads the collected sector numbers and requests the corresponding data from the Agent. Upon receiving this data, the Dispatcher stores it in .bak files within the incremental directory, ensuring all changes are captured and preserved.

In the current implementation, if network connectivity between the Agent and Dispatcher is interrupted, the Agent initiates a new full disk snapshot upon reconnection. While this approach ensures data consistency, it's not optimized for network efficiency or storage resources. We are actively exploring more efficient approaches that would capture only the incremental changes that occurred during the network downtime, alongside the existing live change sector tracking mechanism.

This optimization would significantly reduce network bandwidth usage and backup time during reconnection scenarios. Instead of transferring the entire disk image again, the system would only need to synchronize the specific sectors that changed during the disconnection period. This enhancement would be particularly valuable in environments with unstable network connections or when dealing with large disk volumes.

sequenceDiagram
    participant A as Agent
    participant D as Dispatcher
    participant XMA as xmactions DB
    participant XMD as xmdispatcher DB
    participant SNAP as /data-dir/snapshot
    participant INC as /data-dir/incremental

    A->>+D: WS: /ws/config, /ws/snapshot, /ws/live, /ws/restore
    A->>D: Auth (secret)
    D-->>-A: Auth Success
    A->>D: Footprint Data
    D->>XMD: Store Footprint

    par Full Backup
        A->>A: Start disk clone
        A->>+D: Metadata (/ws/snapshot)
        D->>XMA: Create action
        D-->>-A: ACK
        loop Backup Progress
            A->>D: Disk chunks
            D->>XMA: Update progress
            D->>SNAP: Write .img file
        end
    and Change Monitor
        A->>A: Monitor sectors
        loop On Changes
            A->>D: Changed sectors (/ws/live)
            D->>INC: Write sectors (.cst)
        end
    end

    loop Live Sync (live_sync_frequency)
        D->>INC: Read .cst file
        D->>A: Request sector data
        A->>A: Read sectors
        A->>D: Send sector data
        D->>INC: Write .bak file
    end

The architecture utilizes four distinct WebSocket endpoints:

/ws/config for configuration management
/ws/snapshot for full disk backup operations
/ws/live for real-time change tracking
/ws/restore for data restoration processes

This separation of concerns allows for efficient handling of different types of operations while maintaining persistent connections between the Agent and Dispatcher. The combination of continuous change tracking and dedicated communication channels makes blxrep particularly effective for maintaining synchronized disk states across systems. The planned optimizations for handling network interruptions will further enhance the system's efficiency and reliability in real-world deployment scenarios.

Deployment Architecture

architecture-beta
    group dispatcher_system(cloud)[Dispatcher System]
        service dispatcher_core(server)[Dispatcher] in dispatcher_system
        service backup_storage(disk)[Backup Storage] in dispatcher_system

    group target_servers(server)[Target Servers]
        service agent1(server)[Agent 1] in target_servers


    service admin(internet)[Backup Administrator]

    dispatcher_core:B -- T:backup_storage

    agent1:R -- L:dispatcher_core
    admin:R -- L:dispatcher_core

Dispatcher is deployed in a different subnet or the datacenter than the target servers. The target servers can be connected to the dispatcher privately or publicly. The backup storage is a disk that is mounted to the dispatcher server where the backups are stored.