Every couple of weeks, Radix Founder Dan Hughes climbs out of the coding cave to host a technical Ask Me Anything (AMA) session in the official Radix Telegram channel. These sessions are always informative, containing a wealth of knowledge drawn from exciting and challenging questions. These AMA sessions cover Cerberus, Cassandra, and other critical technical innovations of Radix.
Thank you to everyone who submitted questions and took part in the AMA. You can find the full transcript below from the session on 22nd June 2021.
Have you seen https://twitter.com/ercwl/status/1164200569474621440? It’s an experiment showing that SSD disk speed limits scalability, not bandwidth, RAM, or CPU. Does NotorosVM or Radix suffer from the same problem? Why not?
He’s not wrong, but he’s not right either. The correct answer is “it depends.” It depends on a bunch of things.
For example, imagine a network with a fixed set of validators chosen beforehand whose sole purpose is to agree on the next random number in a sequence. Everything can be stored in memory as each state change is ephemeral, and there is very little “global state” at all as it’s predetermined.
Where is the bottleneck that determines how many numbers per second can be agreed upon? It’s certainly not SSD, nor CPU, nor memory, its bandwidth.
Another useless network, where each validators vote hash is required to have some leading quantity of bits (POW basically) ... the bottleneck then is CPU.
Persistent IO is one of the things that can be a scalability bottleneck, but it is not the de facto scalability bottleneck.
Bitcoin, for example, is really IO light unless you're syncing. Still, it can only perform at its very best on commodity configurations ~ 700-1000 TPS ... CPU, Memory, and mainly bandwidth are the three horsemen of scalability there.
Suppose somehow a cartel of dishonest validator nodes gains sufficient stake to make a Sybil attack. This question is specifically about Olympia. Assuming they only act maliciously only when they detect an opportunity for an exploit, and one of them happens to be the algorithmically chosen leader. When not acting maliciously, they are indistinguishable from an honest validator.
What is the worst they can do? I think they can affect liveness by how they pick transactions from the mempool.
Can they sneak a micropayment to themselves into the atom selected? A sufficiently small one that no one notices.
Can they force a double-spend through Cerberus and onto the ledger?
In the worst cases, they can break both liveness and safety. An actor with a sufficient amount of influence can always cause both, and it doesn’t matter what the consensus mechanism is; there are provable bounds of tolerance.
Things to consider under those circumstances are the properties the consensus mechanism manifests that can help you to recover from them (even if with some “cost”) or to dissuade actors from doing it in the first place.
The former is *really* difficult to do and is part of the Cassandra research I've been doing and focuses on the notion that prevention/detection is better than cure. Ultimately if you get a safety break, you're in trouble(a 51% attack on probabilistic Nakamoto == safety break on multi-decree deterministic).
The latter is about thinking like the adversary and gauging what the reward/cost relationship looks like.
Without getting too deep, a 51% attack on Bitcoin, for example, is potentially profitable multiple times, as the mining hardware has no “identity” on the network. If there IS an attack, there is zero way to determine where it came from, thus if you want the network to live on, you potentially still have the enemy within. I then can just perform multiple 51% attacks and ride the network all the way down to zero value, and I profit on each iteration.
In something like Cerberus, once the attack is performed, you know explicitly who is involved and can punish/penalize/restrict them to prevent future attacks from them at no additional cost. That “one-time opportunity” becomes interesting from a game theory perspective because the attacker really needs to make it count, as they have one chance.
I can imagine edge cases where an epoch is just too long. For example, an exigent event overloads the ability of those nodes servicing a shard group to keep up with demand. Or worse, a black swan event (like a solar flare causing widespread EMP damage) puts many of the validators out of service.
If this were to happen close to the start of an epoch, just waiting two weeks (or even just four days) for a new epoch before the network recovers is unacceptable. One “obvious” solution is just to terminate the current epoch immediately and start a new one. But this is non-trivial to do on a permissionless network, while avoiding introducing new attack vectors. Starting a new epoch forces the code through the start of epoch logic loosely analogous to “If it isn’t working right, reboot it” but without human intervention.
Are there any plans to implement such a black-swan recovery strategy?
This question, while asking about epochs, is actually exciting at a much deeper level. What this is really touching on are liveness breaks.
Say there is an event that causes a liveness break, and we wish the network to “reboot,” as the OP put it. If all of the remaining validators reboot autonomously, what if some of the validators didn't? What if those that didn’t are eclipsed by an adversary, and for them, the network looks just fine?
So you need the remaining validators to come to an agreement to reboot ... no agreement, no reboot.
If say half of the validators suddenly vanished/not responsive, that poses a problem with the agreement. For any agreement to be valid, 67 validators would need to vote of the 100, but we only have 50, so even if everyone left voted yes, it’s not enough.
Ok, so let’s first agree on who those 50 are, then agree to reboot … the same problem, you need 67 to agree on the 50.
Let’s have a consensus by conference call with all the validators, and we all agree to press a button on our validators that changes the vote from 67 required to 34 ... wonderful, we’re back in business and it’s still somewhat decentralized, even if not ideal. Still broken ... why?
Carol joins the network as a potential new validator just after the conference call ends. She is syncing her validator so she can stake to it. Her validator gets to the liveness break, and suddenly 50 of the validators are agreeing to things with only 34 as a minimal participation? Her validator is now stuck as it doesn’t know the conference call happened.
However, I refer you back to Cassandra and this specific problem that I have spent over a year building a solution for. We actually tested the solution to these kinds of liveness issues a few times already, in quite extreme configurations, too, might I add, and it performed as expected.
Particular of note is that I believe it’s the only solution invented so far which can resolve liveness-breaks in a truly permissionless way. I spent a lot of time looking, and it’s basically an empty field of research right now. The closest thing to anything like what we have is Eth’s finality gadget (and similar) which are MUCH more complex and a lot LESS flexible.
Unfortunately, that solution isn’t going to be battle-tested enough for the Alexandria release, but will be present in a later release to deal with liveness issues in these extreme circumstances. So, for now, we’re left using more “analog” solutions such as the foundation staking and being the “safety of last resort.”
Radix claims to scale on-demand. I don’t dispute that it does, but isn’t the scaling only able to come into effect after the current epoch has ended?
Let’s imagine usage increases by 5% and capacity is at 100% before the 5% increase in usage. Is there a way to scale up during the current epoch, or does the network have to wait until the next epoch?
Extra considerations:
1) Likeliness of an event where more than the allocated capacity is needed (there might be buffers).
2) What happens in case such an event does happen?
A 5% increase in throughput demand isn’t anything to worry about, as validators won’t (and actually can’t) run at 100%.
For example, if the network is configured such that all validators are utilized to 100% at all times, that poses a problem for new validators that want to join or those that have been offline for a while. How will they ever catch up and be “in sync”?
Generally, we want validators to be < 20% constantly, leaving plenty of resources left over for large demand spikes, housekeeping, and various other things.
Of course hardware capabilities can increase by multiples over the years, so there needs to be some kind of “constant” we can use to signal that the shard topology needs to change.
For that, you can rely on the speed of light, as given a set of n validators, the message complexity for each transaction can be calculated. Consequently, so can the absolute minimum time required for finality. That can then be compared to the average time of finality within shard groups, and if there is a large discrepancy, adjust.
In the Cassandra stream of Tuesday (May 11th), Dan presented a sharded website. He mentioned a big movie wouldn’t be possible at the moment (there wasn’t a filesystem in place) but would be if the movie was split into many smaller parts residing each in a different shard with another validator.
Reasoning further, why would/wouldn’t you build a sharded TCP/UDP/P2P system. So a network that has a lot of small packages that each lives on a specific shard with a specific validator (or multiple), and a header/pointer tells you exactly where. The system would be (could be) faster than other systems and cannot be easily compromised.
This is one of those things that I’ve already mainly got specced out and know what I need to do but is of a low priority.
Essentially for something like a movie, you’d chop that movie up into a bunch of pieces equal in quantity to the # shard groups at the time it was presented to the network.
Then you perform large ratio erasure coding on those pieces so that all validators have 1/m of the pieces corresponding to the larger piece for their shard group.
To retrieve it, you need to ask n/m of the validators in that shard group to reconstruct that piece.
If there were 1000 shard groups, then a 1GB movie would be chopped up into 1MB chunks. Those chunks would then be encoded such that each validator would store approximately 22kb of data, and 66/100 of those pieces needed to reconstruct the 1MB chunk.
Done this way, you have lots of redundancy, and crucially, it can handle reconfigurations, new validators joining/leaving, etc., and maintain data integrity.
That covers all Dan had time to answer for this session. If you’re keen to learn more, you can find Dan’s previous AMA’s simply by searching the #AMA tag in the Radix Telegram channel or reading the previous blog posts. If you would like to submit a question for the following sessions, just post it in the official telegram, using the #AMA tag!
To stay up to date or learn more about Radix DLT please follow the links below.
Join the Community:
Twitter: https://twitter.com/radixdlt
Telegram: https://t.me/radix_dlt
Reddit: www.reddit.com/r/Radix/
Discord: https://discord.com/invite/WkB2USt
Radix Resources:
Podcast: https://www.radixdlt.com/podcast
Blog: https://www.radixdlt.com/blog/
YouTube: https://www.youtube.com/c/RadixDLT