Conflux 共识层的设计与实现
Conflux 的共识层处理从同步层接收到的所有区块,根据 Conflux GHAST 共识算法产生区块的完整顺序,并调用底层的交易执行引擎以按确定的顺序运行交易。 它提供了必要的信息,以协助 区块生成器 准备新区块的骨架。 它还通知 交易池(transaction pool) 已处理好 的交易,以便交易池可以做出更好的交易选择决策。
本文档旨在为想要了解 Conflux 共识层(位于目录core/src/consensus中)的 Rust 实现的读者提供高级概述。 对于更多的实现细节,请查看代码中的内联注释。 对于 Conflux 共识算法的更多信息,请参阅 Conflux 协议规范和 Conflux 论文(https://arxiv.org/abs/1805.03870)。
设计目标
共识层有以下设计目标。
-
在后台按照一致的共识算法处理新的区块
-
我们希望最小化共识图中每个块的内存使用。 即使有检查点机制,在正常情况下图中会包含 300K-500K 个块,在面对存活性攻击时可能会超过 1M 个块。 这可能会给内存带来压力。
-
我们想要快速处理每个区块。 因为全节点/归档节点在从零开始同步网络时必须处理从_初始创世区块_开始的之后每一个区块,因此快速处理区块对于缩短所需时间是非常重要的。
-
面对潜在攻击时具有稳健性。 恶意攻击者可能会在TreeGraph的任意位置生成恶意区块。
结构与组成部分。
共识图(ConsensusGraph)
ConsensusGraph
(core/src/consensus/mod.rs)是共识层的主要结构体。 同步层通过一个存储所有区块元数据信息的 BlockDataManager
来构建 ConsensusGraph
。
ConsensusGraph::on_new_block()
是将新区块发送给 ConsensusGraph
结构体进行处理的关键函数。 它还提供了一组公共函数,用于查询区块/交易的状态。 这应该是其他组件与之交互的主要接口。
共识图核心(ConsensusGraphInner)
ConsensusGraphInner
(core/src/consensus/consensus_inner/mod.rs)是 ConsensusGraph
的内部结构。 ConsensusGraph::on_new_block()
在函数开始时会获取内部结构的写入锁。 其余的查询函数只会获取读锁。
ConsensusGraphInner
的内部结构相当复杂。
一般来说,它维护两种类型的信息。 第一种信息是整个TreeGraph的状态,即当前的_pivot chain_、timer chain、_difficulty_等等。 第二种信息是每个区块的状态(即每个区块的ConsensusGraphNode
结构)。
Each block corresponds to a ConsensusGraphNode
struct for its information.
When it first enters ConsensusGraphInner
, it will be inserted into
ConsensusGraphInner::arena : Slab<ConsensusGraphNode>
. The index in the
slab will become the arena index of the block in ConsensusGraphInner
. We use
the arena index to represent a block internally instead of H256
because it is
much cheaper. We will refer back to the fields in ConsensusGraphInner
and
ConsensusGraphNode
when we talk about algorithm mechanism and their
implementations.
ConsensusNewBlockHandler
ConsensusNewBlockHandler
(core/src/consensus/consensus_inner/consensus_new_block_handler.rs) contains a
set of routines for processing a new block. In theory, this code could be part
of ConsensusGraphInner
because it mostly manipulates the inner struct.
However, these routines are all subroutine of the on_new_block()
and the
consensus_inner/mod.rs is already very complicated. We therefore decided to put
them into a separate file.
ConsensusExecutor
ConsensusExecutor
(core/src/consensus/consensus_inner/consensus_executor.rs)
is the interface struct for the standalone transaction execution thread.
ConsensusExecutor::enqueue_epoch()
allows other threads to send an execution
task to execute the epoch of a given pivot chain block asynchronously. Once the
computation finishes, the resulting state root will be stored into
BlockDataManager
. Other threads can call
ConsensusExecutor::wait_for_result()
to wait for the execution of an epoch if
desired. In the current implementation, ConsensusExecutor
also contains the
routines for the calculation for block rewards, including
get_reward_execution_info()
and its subroutines.
ConfirmationMeter
ConfirmationMeter
(core/src/consensus/consensus_inner/confirmation_meter.rs)
conservatively calculates the confirmation risk of each pivot chain block. Its
result will be useful for the storage layer to determine when it is safe to
discard old snapshots. It can also be used to serve RPC queries about block
confirmation if we decide to provide such RPC.
AnticoneCache and PastsetCache
AnticoneCache
(core/src/consensus/anticone_cache.rs) and PastsetCache
(core/src/consensus/pastset_cache.rs) are two structs that implement customized
caches for data structures in ConsensusGraphInner
. In the implementation of
the inner struct, we need to calculate and store the anticone set and the past
set of each block. However, it is not possible to store all of these sets in
memory. We therefore implement cache style data structures to store sets for
recently inserted/accessed blocks. If an anticone/past set is not found in the
cache, we will recalculate the set in the current inner struct implementation.
Important Algorithmic Mechanisms
There are several important algorithmic mechanisms in the Conflux Consensus Layer. Here we will talk about them from the implementation aspect. See XXX for the algorithmic reasoning behind them.
Pivot Chain and Total Order
The basic idea of the Conflux consensus algorithm is to first make everyone agree on a pivot chain. It then expands the total order from the pivot chain to cover all blocks with a topological sort. As long as the pivot chain does not change/reorg, the total order of blocks will stay the same, so does the derived order of transactions.
Comparing with Bitcoin/Ethereum, the consensus in Conflux has two key differences:
-
almost every block will go into the total order, not just the agreed pivot chain.
-
The transaction validity and the block validity are independent. For example, a transaction is invalid if it was included before or it cannot carry out due to insufficient balance. Such invalid transactions will become noop during the execution. However, unlike Bitcoin and Ethereum blocks containing such transactions will not become invalid.
In ConsensusGraphInner
, the arena index of the current pivot chain blocks are
stored in order in the pivot_chain[]
vector. To maintain it, we calculate the
lowest common ancestor (LCA) between the newly inserted block and the current best
block following the GHAST rule. If the fork corresponding to the newly inserted
block for the LCA ended up to be heavier, we will update the pivot_chain[]
from
the forked point.
Timer Chain
Blocks whose PoW quality is timer_chain_difficulty_ratio
times higher than the target
difficulty are timer blocks. The is_timer
field of the block will be set to
True. The consensus algorithm then finds the longest timer block chain (more
accurately, with greatest accumulated difficulty) similar to the Bitcoin
consensus algorithm of finding the longest chain. The arena index of this
longest timer chain will be stored into timer_chain[]
.
The rationale of the timer chain is to provide a coarse-grained measurement of
time that cannot be influenced by a malicious attacker. Because timer blocks
are rare and generated slowly (if timer_chain_difficulty_ratio
is properly
high), a malicious attacker cannot prevent the growth of the timer chain unless
it has the majority of the computation power. Therefore how many timer chain
blocks appear in the past set of a block is a good indication about the latest
possible generation time of the block. We compute this value for each block and
store it in timer_chain_height
field of the block.
Weight Maintenance with Link-Cut Tree
To effectively maintain the pivot chain, we need to query the total weight of a
subtree. Conflux uses a Link-Cut Tree data structure to maintain the subtree
weights in O(log n). The Link-Cut Tree can also calculate the LCA of any two nodes
in the TreeGraph in O(log n). The weight_tree
field in ConsensusGraphInner
is the link-cut tree that stores the subtree weight of every node. Note that
the implementation of the Link-Cut Tree is in the utils/link-cut-tree
directory.
Adaptive Weight
If the TreeGraph is under a liveness attack, it may fail to converge under one
block for a while. To handle this situation, the GHAST algorithm idea is to
start to generate adaptive blocks, i.e., blocks whose weights are redistributed
significantly so that there will be many zero weight blocks with a rare set of
very heavy blocks. Specifically, if the PoW quality of an adaptive block is
adaptive_heavy_block_ratio
times of the target difficulty, the block
will have a weight of adaptive_heavy_block_ratio
; otherwise, the block will
have a weight of zero. This effectively slows down the confirmation
temporarily but will ensure the consensus progress.
Because adaptive weight is a mechanism to defend against rare liveness attacks,
it should not be turned on during the normal scenario. A new block is adaptive
only if: 1) one of its ancestor blocks is still not the dominant subtree
comparing to its siblings, and 2) a significantly long period of time has passed
between the generation of that ancestor block and the new block (i.e., the
difference of timer_chain_height
is sufficiently large). ConsensusGraphInner::adaptive_weight()
and its subroutines implement the algorithm to determine whether a block is
adaptive or not. Note that the implementation uses another link-cut-tree
adaptive_tree
as a helper. Please see the inlined comments for the
implementation details.
Partial Invalid
Note that the past set of a new block denotes all the blocks that the generator of the new block observes at the generation time. Therefore, from the past set of a new block, other full nodes could determine whether it chooses the correct parent block and whether it should be adaptive or not.
The Conflux consensus algorithm defines those blocks who choose incorrect
parents or fill in incorrect adaptive status as partial invalid blocks. For a
partial invalid block, the partial_invalid
field will be set to True. The
algorithm requires the partial invalid blocks being treated differently from
the normal blocks in three ways:
-
All honest nodes will not reference directly or indirectly partial invalid blocks until a significant period of time. This time period is measured with the
timer_chain_height
and the difference has to be more thantimer_chain_beta
. Yes, it means that if another otherwise perfectly fine block referencing the partial invalid block, both of these two blocks will not be referenced for a while. -
Partial invalid blocks will have no block reward. They are extremely unlikely to get any reward anyway because of their large anticone set due to the first rule.
-
Partial invalid blocks are excluded from the timer chain consideration.
To implement the first rule, the on_new_block()
routine in
ConsensusNewBlockHandler
is separated into two subroutine
preactivate_block()
and activate_block()
. preactivate_block()
compute and
determine whether a block is partial invalid or not, while activate_block()
fully integrate a block into the consensus graph inner data structures. For
every new block, the field active_cnt
tracks how many inactive blocks it
references. A block is inactive if it references directly or indirectly a
partial invalid block. activate_block()
will be called on a block only when
active_cnt
of the block becomes zero. The field activated
denotes whether a
block is active or not. For partially invalid blocks, their activation will be
delayed till the current timer chain height of the ledger is timer_chain_beta
higher than the invalid block. Newly generated blocks will not reference any
inactive blocks, i.e., these inactive blocks are treated as if they were not in
the TreeGraph.
Anticone, Past View, and Ledger View
In order to check the partial invalid status of each block, we need to operate under the past view of the block to determine its correct parent and its adaptivity. This is different from the current state of the TreeGraph or we call it the ledger view, i.e., all blocks in the anticone and the future set of the block are excluded. Because we process blocks in topological order, the future set of a new block is empty. We therefore need to eliminate all anticone blocks only.
compute_and_update_anticone()
in ConsensusNewBlockHandler
computes the
anticone set of a new block. Note that because the anticone set may be very
large, we have two implementation level optimizations. First, we represent the
anticone set as a set of barrier nodes in the TreeGraph, i.e., a set of
subtrees where each block in the subtrees is in the anticone set. Second, we
will maintain the anticone set of the recently accessed/inserted blocks
only. When checking whether a block is valid in its past view or not (e.g., in
adaptive_weight()
and in check_correct_parent()
), we first cut all barrier
subtrees from the link-cut weight trees accordingly to get the state of the
past view. After the computation, we restore these anticone subtrees.
Check Correct Parent
To check whether a new block chooses a correct parent block or not, we first
compute the set of blocks inside the epoch of the new block assuming that
the new block is on the pivot chain. We store this set to the field
blockset_in_own_view_of_epoch
. We then iterate over every candidate block in
this set to make sure that the chosen parent block is better than it.
Specifically, we find out the two fork blocks of the candidate block and the
parent block from their LCA and make sure that the fork of the parent is
heavier. This logic is implemented in check_correct_parent()
in
ConsensusNewBlockHandler
.
Note that blockset_in_own_view_of_epoch
may become too large to hold
consistently in memory as well. Especially if a malicious attacker tries to
generate invalid blocks to blow up this set. The current implementation will
only periodically clear the set and only keep the sets for pivot chain blocks.
Note that for pivot chain blocks, this set will also be used during the
transaction execution.
Fallback Brute Force Methods
There are situations where the anticone barrier set is too large if a malicious
attacker tries to launch a performance attack on Conflux. This will make the
default strategy worse than O(n) because there is a factor of O(log n) for each
block in the barrier set when we do the link-cut tree chopping. To this end, we
implemented a brute force routine compute_subtree_weights()
to compute the
subtree weights of each block in a past view for O(n). We also implement
check_correct_parent_brutal()
and adaptive_weight_impl_brutal()
to use the
brute-force computed subtree weight to do the checking instead.
Force Confirmation
The Conflux consensus algorithm will force confirm a block if 1) there are
timer_chain_beta
consecutive timer chain blocks under the subtree of the
block and 2) afterward there are at least timer_chain_beta
timer chain blocks
following (not required in the subtree though). Force confirmation means that
new blocks should follow this block as their ancestor no matter what, ignoring
subtree weights. Though extremely unlikely a force confirmed block will have
lesser weights than its siblings.
The force confirmation mechanism is to enable checkpoint, which we will describe later. It is based on the rationale that:
-
Reverting a
timer_chain_beta
length timer chain is impossible. -
Therefore force confirmed block will always move along the pivot chain, not drifting between its siblings.
We compute the accumulative LCA of the last timer_chain_beta
timer chain
blocks and store it at the timer_chain_accumulative_lca[]
field. This vector
is timer_chain_beta
shorter than timer_chain[]
because the force confirm
needs at least timer_chain_beta
timer chain block trailing, so their LCAs do
not matter. check_correct_parent()
and adaptive_weight()
and their
subroutines also respect this force confirm point during their checking.
Specifically, any fork before the force confirm height is ignored.
Note that this force confirm rule is also defined based on past view of each
block. With the computed anticone information, compute_timer_chain_tuple()
in
ConsensusGraphInner
computes the timer chain related information of each
block under its past view. The results of this function include the difference of
the timer_chain[]
, timer_chain_accumulative_lca[]
, and timer_chain_height
between the ledger view and the past view. We can use the diff and the current
ledger view values to get the past view values.
Era
In order to implement the checkpoint mechanism, the Conflux consensus algorithm split the
graph into eras. Every era contains era_epoch_count
epochs. For example, if the
era_epoch_count
is 50000, then there is a new era every 50000 epochs. The
pivot chain block at the height 50000 will be the genesis of a new era.
At the era boundary, there are several differences from the normal case.
-
A block will enter the total order for execution only if 1) it is under the subtree of the previous era genesis and 2) it is inside the past set of the next era genesis in the pivot chain.
-
Anticone penalty calculation for the block reward does not go across the era boundary.
Checkpoint
Inside ConsensusGraphInner
, there are two key height pointers, the current
checkpoint era genesis block height (cur_era_genesis_height
) and the current
stable era genesis block height (cur_era_stable_height
). These two height pointers
will always point to some era genesis (being a multiple of era_epoch_count
).
Initially, both of these two pointers will point to the true genesis (height
0).
A new era genesis block becomes stable (i.e., cur_era_stable_height
moves) if
the block is force confirmed in the current TreeGraph. A stable era genesis
block becomes a new checkpoint (i.e., cur_era_genesis_height
moves) if:
-
The block is force confirmed in the past view of the stable era genesis block.
-
In the anticone of this block, there is no timer chain block.
should_move_stable_height()
and should_form_checkpoint_at()
in
ConsensusNewBlockHandler
are invoked after every newly inserted block to test
the above two conditions. Generally speaking, the stable era genesis block will never be
reverted off the pivot chain. Any block in the past set of the checkpoint block
is no longer required for the future computation of the consensus layer.
Therefore, after a new checkpoint is formed, make_checkpoint_at()
in
ConsensusNewBlockHandler
is called to clean up those blocks that are not in
the future set of the new checkpoint.
Note that the checkpoint mechanism also changes how we handle a new block. For a new block:
-
If the new block is outside the subtree of the current checkpoint, we only need to insert a stub into our data structure (because a block under the subtree may be indirectly referenced via this stub block). We do not need to care about such a block because it is not going to change the timer chain and it is not going to be executed.
-
If the past set of the new block does not contain the stable era genesis block, we do not need to check the partial invalid status of this block. This is because this block will not change the timer chain (recall our assumption that the timer chain will not reorg for more than
timer_chain_beta
blocks) and future blocks can reference this block directly (since the timer chain difference is already more thantimer_chain_beta
).
Deferred Execution
Because the TreeGraph pivot chain may oscillate temporarily, we defer the
transaction execution for DEFERRED_STATE_EPOCH_COUNT
epochs (default 5).
After a pivot chain update, activate_block()
routine will enqueue the
execution task of the new pivot chain except for the last five epochs. It calls
enqueue_epoch()
in ConsensusExecutor
to enqueue each task.
Block Reward Calculation
Because there is no explicit coinbase transaction in Conflux, all block rewards
are computed implicitly during the transaction execution. In Conflux, the block
reward is determined by the base reward and the penalty ratio based on the total weight of
its anticone blocks divided by its epoch pivot block's target difficulty. This anticone set only
considers blocks appearing no later than the next REWARD_EPOCH_COUNT
epochs.
Specifically, if there is a new era then the anticone set will not count across
the era boundary as well. get_pivot_reward_index()
in ConsensusExecutor
counts this reward anticone threshold.
get_reward_execution_info_from_index()
in ConsensusExecutor
and its
subroutines compute this anticone set given the threshold point in the pivot
chain.
Blaming Mechanism
It is infeasible to validate the filled state root of a block because we
would need to execute all transactions in a different order in the past view of
that block. Instead, we will only ask full nodes to validate the state root
results on the current pivot chain. It then fills a blame number to indicate
how many levels ancestors from the parent who do not have correct state root.
When this number is greater than zero, the filled deferred state root becomes a
Merkel H256 vector that contains the corrected state roots of the ancestors
along with the correct one. get_blame_and_deferred_state_for_generation()
in
ConsensusGraph
computes the blame information for the block generation.
first_trusted_header_starting_from()
in ConsensusGraph
is a useful helper
function to compute the first trustworthy header based on the subtree blame
information.
Multi-Thread Design
The consensus layer has one thread dedicated to processing new blocks from the synchronization layer and one thread dedicated to executing transactions. It of course also has a set of interface APIs that RPC threads and synchronization threads may call.
Consensus Worker
Consensus Worker
is a thread created by the synchronization layer. During
the normal running phase, every new block will be sent to a channel connecting
the synchronization thread and the consensus worker thread. The consensus work
thread consumes each block one by one and invokes consensus::on_new_block()
to process it. Note that the synchronization layer ensures the new block to be
header-ready when it is delivered to Consensus Worker
, i.e., all of its
ancestor/past blocks are already delivered to the consensus layer before itself.
This enables the consensus layer to always deal with a well-defined
direct acyclic graph without holes.
One advantage of having a single thread to be dedicated to the consensus
protocol is that it simplifies the protocol implementation a lot. Because the
details of the consensus protocol are complicated and the implementation involves
many sophisticated data structure manipulations, the single thread design makes
sure that we do not need to worry about deadlocks or races. Upon the entrance
of consensus::on_new_block()
, the thread acquires the write lock of the inner
of the consensus struct (i.e., ConsensusGraphInner). During the normal phase,
this thread should be the only one modifying the inner struct of the consensus
layer.
Consensus Execution Worker
Consensus Execution Worker
is a thread created at the start of the consensus
layer. It is dedicated to transaction execution. There is a channel connecting
Consensus Worker
with Consensus Execution Worker
. Once the consensus
protocol determines the order of the pivot chain, it will send an ExecutionTask
for each epoch in the pivot chain to the channel. These tasks will be picked up
by the Consensus Execution Worker
thread one by one. The thread loads the
previous state before the executed epoch from the storage layer as the input,
runs all transactions in the executed epoch (see
ConsensusExecutor::process_epoch_transactions()
), and produces the result
state as the output.
The rationale of separating the transaction execution from the consensus
protocol implementation is for performance. With our blaming mechanism, the
execution result state is completely separated from the consensus protocol
implementation. The deferred execution mechanism gives us extra room to
pipeline the consensus protocol and the transaction execution. It is therefore
not wise to block the Consensus Worker
thread to wait for the execution
results from coming back.
Key Assumptions, Invariants, and Rules
If you want to write code to interact with the Conflux consensus layer, it is very important to understand the following assumptions and rules.
-
The consensus layer assumes that the passed
BlockDataManager
is in a consistent state. It means that theBlockDataManager
contains the correct current checkpoint/stable height. Blocks before the checkpoint and the stable height are properly checked during previous execution and they are persisted into theBlockDataManager
properly. The consensus layer does not check the results it fetches from the block data manager. If it is inconsistent, the consensus layer will execute incorrectly or crash! -
Besides the subroutines of
on_new_block()
, no one should hold the write lock of the inner struct! Right now the only exception for this rule isassemble_new_block_impl()
because of computing the adaptive field and this is not good we plan to change it. Acquiring the write lock of the inner struct is very likely to cause deadlock given the complexity of the Consensus layer and its dependency with many other components. Always try to avoid this!