During the course of betanet testing before release of the Radix Olympia mainnet, RDX Works engaged noted distributed systems safety researcher Jepsen on behalf of Radix Tokens Jersey to test and report on the performance of networked Radix validator and archive nodes.
While Jepsen’s work has historically mostly focused on non-DLT distributed databases without public decentralized networks, we were eager to have their famously rigorous testing process employed against Radix. We expected we would gain greater benefit from Jepsen’s approach of creating and running adverse test scenarios than we would from a typical eyeballs-only code audit. Jepsen is justifiably well-known because of their consistent record of identifying issues even in heavily-used, well-trusted business products, and we believed this was the best outside voice we could possibly bring in to put our software through the wringer.
Jepsen did not disappoint, both in confirming correct behavior under conditions such as network partitions and forced crashes of validators, and in discovering some problems with our archive node implementation.
To summarize the findings: Our consensus system functions correctly in all proof of stake security bound conditions. However, some issues were identified in the client layer - the archive node system - that could cause it to return inaccurate information about the underlying ledger state, such as a transaction appearing to have failed when it was actually successfully confirmed, with a delay before the archive node corrected itself to show the transaction as committed.
RDX Works was unable to reproduce these issues on an assortment of networks when using the Radix Wallet (simulating a typical user) in a series of focused testing efforts, however we were able to reproduce them when programmatically pushing transactions through a node as quickly as possible.
In investigating Jepsen’s discoveries, we determined that the best course of action was to shift our development priorities to immediately begin work on moving away from the archive node system as our mechanism of reporting ledger state, and switch to a new architecture in which any node could produce a full stream of observed events. Armed with such an event stream, we could create an aggregator service which consumed those events from multiple feeder nodes, maintained the results in a relational database, and served all consumer requests for information. Jepsen began testing a series of development builds which laid the foundation for this work and we confirmed that the new design was the best path forward.
The resulting system addressed the issues identified by Jepsen and provided several additional benefits. The two key components are the Core API (which exposes the transaction stream on the node, released Jan 17) and the Network Gateway (the aggregator and query software, released Jan 20). Radix Tokens (Jersey) Limited began running a free-to-anyone instance of the Network Gateway, called the Gateway Service, on Jan 20, and updated versions of the Wallet and Explorer released on Jan 27 now make use of that service.
Using the updated architecture, we are no longer able to reproduce any of the issues identified in the Jepsen report, other than a throughput issue relating to test networks with unhealthy validators, which matches expected behavior with our current design.
Basics of the testing process
Jepsen’s testing was conducted by submitting normal user transactions to various network configurations running both released and unreleased builds of the node software. In some instances tests were performed against existing test networks, in others they were performed against private clusters spun up and controlled by Jepsen, in order to controllably introduce adverse conditions such as network partitions.
Initially, Jepsen read ledger state from the existing archive node API, but we eventually switched to a model in which we exposed an archive node’s raw transaction and balance logs which could be directly parsed by the test software. Jepsen’s software allows for the setting of expected system invariants, and then verifies whether those invariants are perfectly upheld in all test scenarios and operating conditions.
Jepsen’s testing did not descend to lower areas such as the communications protocol between validators; such layers were exercised indirectly through the course of submitting transactions, shutting down running node processes and varying the network conditions, then observing the results.
Why the move away from archive nodes?
A bit of background is necessary here. Prior to Jepsen’s testing, we already had plans to eventually move away from the archive node architecture. It was too tightly coupled to the node software, and couldn’t be updated without doing a whole new node release. It didn’t expose much detail about low-level network state, and any desire for a new query capability could only be fulfilled by time from a core network team developer - a precious resource! It was expensive to scale horizontally to meet high demand, and could not quickly respond to demand spikes.
When we began working through Jepsen’s early test results, the first step was to trace through the layers of the archive node software to determine whether the root of a problem was in the per-account transaction logs, the balance tracker, the indexing, the communication layer, or the query logic. This was cumbersome and time-consuming to perform, given that problems were not predictably reproducible and also could not be replicated when attempting to attach to a node process and step through the logic with a debugger.
Given that we already had a desire to move to a model in which nodes could expose a transaction stream that could be consumed by some other service, we decided to prototype this directly with Jepsen, and began producing custom node builds which allowed for direct reading of account balances and transaction logs. This enabled us to start making changes to how those logs were produced without constantly breaking Jepsen’s tests, and to lay the groundwork for what would ultimately become a full stream of network events. This approach immediately showed promise, and we recognized the happy feeling that comes when you’ve hit upon the right way forward.
Despite our confidence that the event stream was the best long-term way to go, we went through a time and resource estimation on what it would take to address the issues identified by Jepsen within the confines of the archive node system, and found that it was not much less than our estimate for delivering the superior architecture. Taking into consideration the fact that all work done to update the archive node would be throwaway code, as the next architecture would still be highly desired as soon as we were done with the fixes, we decided to adjust our development priorities and immediately get cracking on the preferred implementation.
Why wait until now to disclose Jepsen’s findings?
Adversarial actions are extraordinarily common on public ledgers, both in the forms of technical and social engineering attacks. Given that we were unable to replicate the issue under a typical wallet user usage profile, we determined that the risk to the end user was greater if any disclosure was made ahead of a fix being successfully implemented, tested, and deployed. The sole high-rate users that it was reasonable to expect might be impacted at this point were Instabridge and Bitfinex; Bitfinex was notified of the issue during development and they adjusted their integration to avoid it until the solution was provided.
Where’s the full Jepsen report?
Anyone who has read this far down is probably familiar with common terms around consensus in the DLT space, and it should be noted that the definitions of safety and liveness used in the Jepsen report may be different from what you are accustomed to.
The DLT space typically defines a safety violation as two healthy validators disagreeing on what is the correct ledger state. Most notably, this is a result of a double-spend having been permitted. Specifically in the Olympia implementation, this means a single substate being successfully “downed” more than once. No such violations were discovered in Jepsen’s testing.
The DLT space defines a liveness break as the network being unable to process further transactions. No such breaks were discovered in Jepsen’s testing, except in cases where validators possessing greater than ⅓ of the validator set stake were forcibly taken offline. This is a known system invariant and is expected behavior; validators possessing at least ⅔ of the active set stake must participate in consensus order to make progress.
Jepsen uses definitions of safety and liveness which come from the general distributed system space, which differ from the familiar, domain-specific interpretations.
Jepsen’s full report can be found here: http://jepsen.io/analyses/radix-dlt-1.0-beta.35.1
With the exception of an issue relating to performance in networks with unhealthy validators, all issues raised in the Jepsen report have been fully resolved with the updates released in January.
Conclusions
Testing with Jepsen was a great experience – and adversarial in exactly the constructive manner that we engaged them for. Jepsen tested the network in a way that few other public networks are ever tested until a much later point of maturity, when a large ecosystem has already been built around them and the incentives to attack are much greater. Doing this testing has armed us with a host of new tools to rapidly and reliably test an assortment of interesting network conditions going forward. After a long push to the mainnet release we certainly hadn’t planned on switching away from the archive node architecture quite so quickly, but with the work now behind us we’re extraordinarily happy with the results.
We are pleased that we submitted Radix technology to this level of scrutiny, and we encourage the creators of other layer-1 DLT networks to do the same.