Radix Technical AMA with Founder Dan Hughes - 2nd March 2021 | The Radix Blog

Radix Technical AMA with Founder Dan Hughes - 2nd March 2021

March 10, 2021

Every two weeks, Radix DLT’s Founder Dan Hughes hosts a technical ask-me-anything (AMA) session on the main Radix Telegram Channel. There are always great discussions around Cerberus, Radix’s next-generation consensus mechanism, general design approaches to the network, and of course general industry questions.

Thank you to everyone who submitted questions and joined our AMA. You can find the full transcript below from the AMA sessions on March 2nd 2021.

In case you missed it there is a new release on Fast-HotStuff. This is a variant of the HotStuff consensus algorithm that is faster than the original HotStuff, and also describes a potential attack vector and a way to eliminate it. Are there plans to incorporate these insights into Cerberus, and at what release?

I've heard this paper was released but haven't had an opportunity to review it yet.

Mainly because the desire is to move away from strictly classical BFT if possible. There are various cans of worms down the road that we'll hit with HotStuff (including any variant) or any other classical consensus mechanisms. While these issues are solvable, and we have solutions spec'd out for them, these same solutions add varying complexity and require compromises, which I'd rather avoid.

That's part of the reason for Cassandra, is to put all of the research on these kinds of things together and try out solutions or novel ideas WAY before we need them. That way we can stay ahead of not only the competitor curve but also any surprises.

It's important work, even if some may consider what I'm doing pointless at the moment.

What is the throughput for a single trading pair on a DEX operating on RPN-3? For example, the trading pair of tokens RTA and RTB, would it be limited to the throughput of a single shard set(100-3k TPS)?

A fully optimized RPN-3 stack should be able to handle between 2-3k of simple state flipping DEX settlement transactions per second per validator set. That's making the assumption the majority of the validators are all of similar spec.

Beyond that, some tricks are needed. Can either split the order book across multiple shards, and let arbitrage settle the difference between them (some diminishing returns here though due to the arbitrage itself eventually) or can aggregate trades in a kind of rollup data structure.

My preference is the 2nd option, and would probably yield the best overall performance.

Beyond that ... well, I dunno, but I guess things are pretty serious in the market as that's potentially upwards of 30k trades per second! A nice problem to have.

In a few years after rpn3, doesn't the global shard have to remember information about possibly millions of wannabe node runners? While the global shard is lightweight with regard to compute needs, doesn't it eventually run into a storage bottleneck that puts an end to unlimited scalability? Is it possible to (eventually?) shard the data held by the global shard?

It's not anywhere near as bad as you might think.

Assume we have 10,000 validators, and let's assume in the worst case, every day, 5000 leave, and 5000 join.

That's 10,000 events per day going into the global shard, or around 3.65M per year. Those validator registration/deregistration events are quite light, but again, let's assume the worst case and say 500 bytes each.

That totals 1.8GB per year.

For the consensus information, and using signature aggregation that grows to around 2GB per year ... basically nothing.

Someone could of course attempt to use the registration/deregistration process to bloat it with lots of them, the solution for that is easy, make the fee high to register a validator. This brings another benefit, we don't want validators to be around for short periods of time, so a high fee also means a higher level of commitment.

Can you please share your insights on what happens to the active node that is part of the epoch validator set if in case it gets restarted for reasons like crashes/configuration changes?

Does the network select the next available node in the list for participation in epoch?

Will the node that has been restarted have to wait for the next epoch to be part of the validator set again?

This response is in the context of RPN-1.

The system expects validators to go offline / be uncontactable for periods of time as 100% guarantees on uptime are impossible. If a validator does crash, restart, lose internet access, whatever, it doesn't really have any negative effects unless that validator was the leader.

In the case the validator was the current leader, the other non-leader validators will eventually timeout the responses they were expecting from it (these timeouts are generally quite short). That will then trigger what's called a "view-change", which basically is the process to elect a new leader to pick up where the crashed one left off.

Once a new leader is elected, everything continues as normal.

The leader that "went away" is fine to return and continue as a non-leader until it is elected to be a leader again. So long as that duration is short, it's fine.

If that validator crashed and didn't come back for a long time, spanning an epoch perhaps, then things like slashing and penalties would take effect.

So, if you're gunna crash and restart, be quick about it.

Mempool synchronization and latency of processing atoms can differ for different releases, right? Is it in plans to publish a technical document around this topic for each release?

Mempool sync is an interesting subject with different faces for the various main net releases.

The core code will take care of syncing mempools to the required thresholds for that particular release, and it becomes an ever more critical component as the releases move on and become sharded, etc. Almost becoming its own "prepare" consensus in its own right.

Not sure why a technical document would be needed for this specifically for each release? At least from the perspective of node-runners.

It's an interesting topic for the "nerds" among us of course, but the different mempool strategies would probably end up being published as papers if they were so novel.

If 3000 tps is possible on a single validator/shard set on Radix(as is the case for RPN-1) why doesn’t RPN-1 release with this throughput?

Because of round trip times, message complexity, and signature model.

RPN-1 is quite simple in how it manages message complexity and signatures. It's simple to ensure a solid baseline for us to build on.

The 3000 tps is benchmarked from the point of view of what could a validator process if optimizations in various areas (such as constant communication costs, threshold signatures, vote representations, etc) were in place as per the later releases.

RPN-1 doesn't need 3000 tps, therefore we don't need the complexity either.

Can you please affirm that the Node selection algorithm of RPN-1 and beyond picks the nodes in such a manner that the network is evenly distributed across regions? What are the ramifications around exercising the monopoly of the node selection with a large stake weight factor, in other words, would there be a balance between node distribution in an epoch set considering physical node location vs stake weight?

Feels like there are two questions there, but can only really decipher one.

The validator selection doesn't consider geo-location at all. It's super easy for me to spoof my location via VPNs etc. and really doesn't offer much benefit unless ALL validators are in the same area of landmass.

Also, bottom line, maximizing security should always take precedence over maximizing decentralization, especially if maximizing decentralization could lead to large amounts of stake not being in the validator set because they were in the "wrong" region.

And before anyone trolls me, yes I know that zero decentralization is detrimental to security ... you know what I mean!

If all nodes keep shifting around the shards they are serving -- it essentially means that all nodes need to store all the state? One thing I thought about sharding is that it lets you also store only a subset of the state.

You store all the state in the shards you're serving, which collectively are a portion of all global state.

If there are 100 shards, and I store 2 shards, I'm storing 2% of the global state.

That covers all the questions Dan had time to answer this session. If you are keen to get more insights from Dan’s AMA’s, the full history of questions & answers can be easily found by searching the #AMA tag in the Radix Telegram Channel. If you would like to submit a question for the next sessions, just post a message with questions in the main telegram, using the #AMA tag!