Welcome, Guest. Please login or register.

Welcome to the StellarVerse Forums!

I have created these forums as a *stable* and *lasting* alternative to other Stellar forums for developers and everybody else! Read more. Questions? PM -> mmathias! Thank you!

Author Topic: Important: There was a ledger fork caused by consensus failure  (Read 17941 times)

jed

  • Stellar Development Foundation
  • Newbie
  • *
  • Posts: 9
    • View Profile
  • Tech Interests: C++
Re: Important: There was a ledger fork caused by consensus failure
« Reply #15 on: December 08, 2014, 09:58:55 PM »
TL;DR version: The ripple paper and the code are not the same. The ripple code causes nodes to determine “quorum” from the nodes it heard from in its last ledger close, not from its total UNL.  This is what likely causes forks and this code is live in both consensus systems.

I’m not trying to get into this blaming back and forth with Ripple Inc. We have an obligation to make it clear to people what we have seen and what we believe the issues are. When we first reached out to Prof Mazieres, it was to get an independent 3rd party review of the ripple algorithm from a respected computer scientist so we could be certain that it worked. This is what should be done for any complicated algorithm. Bitcoin has had its paper rigorously reviewed and generally has passed all such review on a technical level. We wanted to do the same thing but unfortunately for us, the algorithm did not pass Prof. Mazieres’s review and we do not know any distributed systems expert who is not employed by Ripple Inc who has reviewed the algorithm and thinks it works.

Background

We've seen the nodes exhibit a tendency to get out of sync since at least September. The network would split 3 or 4 ways and then eventually come back together but it would do so relatively quickly and without loss. Last week’s fork was a case of this happening but the ledger was not able to come together quickly.

Let’s review the only commit you can argue changes consensus.

https://github.com/stellar/stellard/commit/067d7158720331937fc782cbb230e8d422cd7341
This commit is the only thing we did that could affect consensus. It was also only deployed on a minority of validators at the time of the fork.

Why this change has no impact on consensus

This simple change only causes a node to stop waiting if it realizes it is way behind the rest of the network. Waiting longer won't really do anything positive:
* the majority of the network has already moved on to a different consensus phase.
* updateposition was already called, so the instance already learned what it can from its peers.
* Waiting longer for other positions will actually increase the chance of divergence (best case some of the positions the instance sees were from the majority, otherwise it's just random stuff). David Schartwz says as much here: “This may mean we occasionally are forked from the main ledger chain, but that's perfectly fine.” https://github.com/stellar/stellard/pull/176#issuecomment-64780903
* this partition will hit a timeout anyways that will cause it to advance (see below), before even seeing that the majority network is proposing again.

What causes forks

We were running 7 validating nodes all connected to each other. Our validation_quorum for each node was set to 4. The system getting out of sync regardless was most likely triggered by the existing ripple code below (per our log files).

LedgerTiming.cpp:157 (in stellard)  LedgerTiming.cpp:121 (in rippled):
if (currentAgreeTime < (previousAgreeTime + LEDGER_MIN_CONSENSUS))

This code ignores the number of participants when for some reason the node missed proposals from other peers. This is a contradiction to the Ripple paper so pointing to the paper as the answer does not explain the issues because the code and the paper do not match.

When a fork like this occurs, the minority partition doesn't have enough validations to take over the network but still closes ledgers (while the majority network continues to go on, validated and all). At some point later, the majority network has a glitch that causes some of its participants to not rejoin consensus. Then those recently caught up nodes may then decide to join the wrong network at that point (in this case if the former majority network does not look like it has majority from a LCL point of view).

The interesting thing that happens at that point is that this new majority network (that has been closing ledgers for some time, but not validated), may have enough participants to cross the validation threshold. When this happens, we end up with gaps in history (as from that forks point of view, the previous fully validated ledgers dates back from the time the fork occurred).

The main misunderstanding on how this code works comes from the fact that "previousProposers" is not the UNL - it's just the subset of the UNL that was participating in the last consensus, which in case of timeout can take any number between 0 and the actual size of the UNL.

Not the big issue

To rely on whether a complicated system like this works, you need a detailed specification for implementation, a paper that meets industry standards and a through proof (talking about a mathematical proof here). These are all lacking for the ripple consensus algorithm.

A lot of us have invested time and other significant resources into these systems so it is hard to accept that they might not be all they are cracked up to be. And believe me, I’d *much* rather not be rewriting stellard but given the risks that our distributed systems experts have found, we must make the code safe. From here, the stellard team and I are focusing all our efforts on getting it right.

Jed.

mmathias

  • Global Moderator
  • Newbie
  • *****
  • Posts: 37
    • View Profile
  • Tech Interests: C++, PHP, Backend
  • Twitter: @mmathias_mmint
Re: Important: There was a ledger fork caused by consensus failure
« Reply #16 on: December 08, 2014, 10:39:56 PM »
Thank you for your explanation, Jed!

Is there a time frame for when the new stellard will be completed?

And does this change the time frame for the Bitcoin and XRP giveaways?
¯\_(ツ)_/¯
Support StellarVerse: gLVXANgQtoNPaK9Nr4egdvh36jqJ9LMG1A

celticwarrior72

  • Newbie
  • *
  • Posts: 8
    • View Profile
    • Coinist
  • Tech Interests: Ex Trader | Founder of Coinist | President of IRBA | Bassist
  • Twitter: @_coinist
Re: Important: There was a ledger fork caused by consensus failure
« Reply #17 on: December 09, 2014, 03:07:19 AM »

And does this change the time frame for the Bitcoin and XRP giveaways?

That's a fair question.  Many of us were affected by the May 22nd announcement of 'Selling my XRP'.

Maybe there is an opportunity revisit the distribution mechanism for the XRP under the terms of the giveaway too?  I think there are a couple of tweaks that might improve its fairness.