Overview
On May 11th and 12th, the Ethereum consensus layer experienced temporary anomalies. imToken's analysis suggests that the main cause of this anomaly was an overload on several Ethereum consensus layer client nodes, causing validators to go offline and fail.
The overload directly resulted in Epoch voting not reaching 2/3, preventing the consensus layer from confirming finality. However, the Ethereum network quickly recovered on its own, which imToken believes demonstrates the resilience and self-healing capabilities of the Ethereum PoS consensus algorithm.
Event and Background
Under normal circumstances, the Ethereum PoS consensus network status would be finalized within 2 Epochs. However, last week, there were two instances of delayed Epoch finalization.
The first instance occurred on May 11th, with the Epoch finalization being delayed by 3 Epochs, approximately 20 minutes.
The second instance occurred on May 12th, with the Epoch finalization being delayed by 8 Epochs, approximately 51 minutes.
During the event, the Ethereum network generated blocks and processed transactions normally. However, due to a low voting rate from validators, the Epoch could not be finalized (i.e., the Epoch did not receive the Ethereum PoS network consensus-level security guarantee).
An unfinalized Epoch means that if most of the validators act maliciously and cause a fork, the epoch could potentially be rolled back, resulting in transactions being rolled back as well.
No forks occurred on the Ethereum network during the event, and validators did not cast malicious votes. The inability to finalize the Epoch was solely due to many validators going offline, resulting in a low voting rate.
Upon further examination, it was found that the validators experienced CPU overloads, causing them to go offline.
In the second event, the Epoch finalization was delayed by 8 Epochs, triggering the Inactivity Leak handling mechanism within the Ethereum consensus algorithm due to the finalization delay being greater than MIN_EPOCHS_TO_INACTIVITY_PENALTY (=4).
- Penalties were imposed on offline validators, slashing their staked funds, with approximately 28 ETH being forfeited.
- Attestation rewards were canceled, resulting in approximately 50 ETH not being issued.
- This mechanism ensures that online Validators ultimately control 2/3 of the total staked funds in Ethereum, allowing the network status to be finalized eventually.
imToken's node services also detected this event, providing early warning of abnormalities in the Ethereum consensus network by monitoring the Ethereum consensus layer Validator voting in real-time before the Epoch could not be finalized normally.
Under the PoW mechanism, the success of a transaction is determined by the probability that it will not be rolled back after a certain number of consecutive blocks. In PoS, the success of a transaction is determined by the block height returned by Safe Head. The current specification uses the Justified Checkpoint as the state recognition for Safe Head. Therefore, looking at the state of the previous Epoch, there may be a judgment delay of up to 6.4 minutes, which is a poor user experience.
imToken's self-developed Safe Head service calculates safe blocks for transaction confirmation based on real-time Ethereum consensus layer data, shortening transaction confirmation time while ensuring user transaction security. Under normal circumstances, the block height returned by imToken's Safe Head algorithm (as shown in the yellow section in the image above) is close to the latest block height (green), thus improving user experience.
Cause Analysis
The direct cause of the events mentioned above is that several Ethereum consensus layer client nodes were overloaded, causing validators to go offline and disrupting the normal consensus voting process. After analysis, the reasons for the overload of these nodes are:
When receiving attestations pointing to stale blocks, nodes must recompute the Beacon Chain state to validate these attestations, which requires consuming a large amount of CPU and memory resources.
When a large number of attestations pointing to stale blocks are received simultaneously, the nodes' CPU and memory resources are exhausted, causing the validators to go offline.
This issue could have been resolved through caching based on the blocks pointed to by attestations. However, due to the growth of validators and the emergence of a large number of such attestations, the cache implementation of the problematic clients was breached, forcing nodes to consume large amounts of resources to recompute the Beacon Chain state.
The consensus layer clients Teku and Prysm have released patch versions to address this issue. Specifically, the patch version of the client implementation filters out these stale attestations, ignoring the attestations when the following conditions are met:
- The attestation points to a stale slot.
- The attestation points to a checkpoint the node has never seen before.
However, we still need to continuously observe the Beacon Chain finalization situation to confirm the patch’s effectiveness.
The consensus layer client patch versions for Teku and Prysm include the following:
- Prysm: v4.0.3-hotfix
- Teku: v23.5.0
Ethereum Design Advantages
During the event, Ethereum ensured availability by continuously generating blocks and processing transactions while only delaying Epoch finalization. The key reasons for this were:
- The diversity of Ethereum clients.
- The design of the Gasper algorithm.
Ethereum Client Diversity
Although the consensus layer client implementations of Teku and Prysm had issues, the normal operation of other consensus layer clients were not affected. For example, the Lighthouse client was not affected by the event. Due to the differences in implementation design among various clients, some validators continued operating normally.
The diversity of Ethereum clients ensures that even if some clients encounter issues (even leading to an Epoch not being finalized), it will not affect the normal clients generating blocks and processing transactions, keeping Ethereum's availability intact.
Ethereum Gasper Consensus Algorithm for Availability
One of the starting points for the design of the Ethereum consensus algorithm Gasper is ensuring the availability of Ethereum. The design separates Ethereum block production from finalization. Therefore, even if block finalization is hindered, block production will not stop. In most cases, block finalization will eventually recover (produced blocks will ultimately be finalized); hence the impact on users will be relatively low. In contrast, in other Byzantine Fault Tolerance (BFT) consensus algorithms: if block finalization fails, consensus nodes will stop producing the next blocks, resulting in the entire blockchain becoming unavailable, commonly referred to as "blockchain crashing."
The second event also triggered the Inactivity Leak mechanism, which is mainly designed to ensure that Ethereum can still finalize blocks under extreme conditions (large numbers of Validators offline for extended periods).
Experience and Lessons Learned
Challenges of Ethereum Multi-Client Diversity
Currently, the diversity of Ethereum clients is as follows:
Source:https://clientdiversity.org/#distribution
As seen, the Ethereum client diversity still needs to be promoted and publicized. If the client implementations were diverse enough so that the shares of Prysm and Teku were less than ⅓, the event might not have occurred (⅔ of clients operating normally would be sufficient to finalize the Epoch). In addition, the current execution layer clients are concentrated on Geth, with a share of up to 61%. This actually poses potential risks: if Geth does not operate properly, Ethereum will be affected significantly.
Apart from the need for further efforts in Ethereum client diversity, the switching of Ethereum clients is also a pain point exposed by the event:
- how validators can switch to a properly functioning client implementation when one client encounters problems.
- The process involves:safely migrating the validation key of the problematic client to the normal clientdue to Ethereum's consensus slashing rules. This eliminates inconsistencies between the old and new clients' behaviors, such as:
- Voting on different sides of a forked checkpoint, resulting in slashing
- Producing different blocks in the same slot, resulting in slashing
Monitoring Ethereum Consensus
Services similar to Safe Head are needed to continuously monitor the real-time status of the Ethereum PoS network, and detect and alert such events in advance, rather than waiting for the Epoch to fail to finalize as expected to find abnormalities. The latest research on this can be found in this article:
Popularizing Ethereum Consensus Algorithm
This event exposed the necessity of popularizing the Ethereum PoS consensus mechanism. Many users mistakenly thought that "Ethereum crashed" during this event, causing unnecessary panic.
However, in reality, the Ethereum network continued to generate blocks and process transactions. The combination of Ethereum's consensus layer and execution layer provides double protection for Ethereum transaction confirmations.
The processing of blocks at the execution layer is not affected even if the consensus layer Epoch cannot be finalized. The abnormal situation of Epoch finalization is also designed and handled within the Ethereum consensus algorithm. Popularizing blockchain knowledge for users is still a direction practitioners should continue to work on.
Implications for Ethereum Applications
Although the Ethereum network is robust enough, occasional instability can have some impact on applications. At the same time, applications need to handle these unstable scenarios correctly.
- The deposit time from L1 to L2 will be extended. When minting on L2, an important prerequisite is to ensure that the L1 deposit transaction will not be rolled back. Therefore, when the Ethereum network Epoch finalization is delayed, the deposit time from L1 to L2 will also be extended accordingly.
- Similarly, exchanges should also prevent on-chain deposit transactions from being rolled back, so that their deposit times will also be extended accordingly.
- On-chain oracle quotes risk being rolled back. Therefore, high-value services that rely on them should be paused temporarily.
- During this event, Uniswap did not display balances and only allowed purchases, not sales, while dYdX suspended deposits.
Conclusion
In this event, we can see the resilience and self-healing ability of the Ethereum PoS consensus algorithm and the quick response and error correction of clients after an incident occurs.
For the entire Ethereum ecosystem, continuous investment is needed in the following aspects: increasing client diversity, optimizing real-time monitoring and early warning of network status, in-depth user education (not only for ordinary users but also for practitioners), and emergency preparedness for ecosystem participants in case of network anomalies.
Reference
- Finality issue updates May 2023
- https://twitter.com/robplust/status/1657044364382846978
- https://twitter.com/superphiz/status/1656780594326405121
- https://twitter.com/terencechain/status/1657021042110631936