Blog

A balancing act – Performance and reliability in trading matching engines

The matching engine is the core technological pillar of any trading venue, it’s the engine room where all the action happens, driving global markets that exchange trillions of dollars daily. With this in mind, Sergey Samushin, head of exchange solutions at Devexperts delves into the intricate balance between performance and reliability in matching engines and how to engineer a system that excels at both.

A matching engine acts as a sophisticated state machine, altering its internal state with every input and output. It processes orders from clients and commands from exchanges, producing outcomes such as filled or rejected orders and various updates related to trades and instruments.

However, it’s the networking and storage elements that equip the matching engine with the capability to manage vast order volumes and maintain a durable record of trading sessions.

Performance and reliability 

Performance and reliability should not conflict in a well-designed exchange. Whether there are three or five working nodes, users should not experience any type of performance dip. 

The additional nodes should proactively ensure consistent performance in case the primary node fails. An overly reliable system might require more efforts in terms of maintenance, but as the primary node is independent, the additional clusters will not slow down the system.

For example, a single-instance matching engine might suffice for a demo or test environment of a retail exchange with moderate latency requirements, but it is insufficient for a system to rely on one node, as it becomes a single point of failure risk. If the one node fails, everything fails. 

Replication as a solution 

To prioritise reliability, a replicated system design is adopted where multiple instances of gateways, matching engines, and databases run simultaneously. Such architecture enhances failure resilience as replicated components can take over in case of individual malfunctions. 

However, this replication comes at the cost of requiring more resources such as additional hosts for extra datasets, increases disk storage, etc. due to the overhead involved in maintaining multiple synchronised datasets.

Addressing latency

Latency is a critical factor, especially for institutional players who engage in algorithmic trading and require swift order processing. Crypto exchanges and retail-centric trading platforms may operate comfortably with latencies ranging from 200-500 microseconds, often hosted in cloud environments for their cost-effectiveness and ease of setup. 

In contrast, institutional venues lean towards bare-metal installations with hardware acceleration to minimise latency further.

High-performance and high-reliability systems

The most demanding trading applications expect both stellar performance and robust reliability. To achieve this, state-of-the-art matching engines operate entirely in RAM, avoiding latency introduced by disk or solid-state drives.

For enhanced reliability, these systems use replication techniques, running multiple engine instances in parallel and employing consensus algorithms to ensure synchronised states across replicas. 

Throughput and scalability 

Exchanges must also be designed to handle sudden surges in trading activity, such as those seen during “black swan” events or market movements driven by social media.

Clusters of independent order processing units and strategies like horizontal scaling, where instrument lists are segmented and managed by individual engine instances, are deployed to ensure scalability and high throughput.

The consensus challenge

Maintaining consensus across distributed systems, especially under high loads, is a complex task. The RAFT protocol is the best solution at the moment to achieve consensus between matching engine clusters, in other words to ensure all engine replicas agree on input sequences.

This might involve electing a “leader” node responsible for input propagation, with systems in place to elect new leaders in case of failure, thus maintaining system consistency and reliability. 

Persistence, recovery, and storage needs

Exchange venues often have to fulfill extensive reporting obligations, necessitating a system that stores event histories without impairing performance. Regular snapshots of the matching engine’s state complement a full event log, allowing for quick recovery and state resumption. 

Additionally, separate storage solutions cater to the extensive querying needs without taxing the matching engine.

In conclusion, designing a matching engine that marries high performance with unwavering reliability is a complex yet achievable goal. It requires an understanding of the interconnectedness of latency and throughput: when the exchange grows in popularity its throughput increases; to increase throughput, the engineering team should work on achieving the lowest latency possible. 

Other key technology considerations are state synchronisation alongside sophisticated replication and consensus strategies.

As the financial trading landscape continues to evolve, so too must the technological backbone that supports it, ensuring that trading venues can withstand the tests of both time and volume without sacrificing speed or stability.