- 23.08.22: Initial unfinished draft
- 14.09.22: The first finished version
- 02.12.22: Fixing missed gaps
- @Wondertan
I start to like writing ADRs step-by-step. Also, there is a trick that helps: imagine like you are talking to a dev who just joined the team to onboard him.
- LN - Light Node
- FN - Full Node
- BN - Bridge Node
- EDS(Extended Data Square) - plain Block data omitting headers and other block metadata
- ODS - Original Data Square or the first quadrant of the EDS. Contains real user data and padding
- NMT - Namespaced Merkle Tree
- DataHash - Hash commitment over DAHeader
Current block data synchronization is done over Bitswap, traversing NMT trees of rows and columns of data square quadrants. We know from empirical evidence that it takes more than 200 seconds(~65000 network requests) to download a 4MB block of 256kb shares, which is unacceptable and must be much less than the block time(15-30sec).
The DASing, on the other hand, shows acceptable metrics for the block sizes we are aiming for initially. In the case of the same block, a DAS operation takes 50ms * 8(technically 9) blocking requests, which is ~400ms in an ideal scenario (excluding disk IO).
Getting data by namespace also needs to be improved. The time it takes currently lies between BlockSync and DASing, where more data equals more requests and more time to fulfill the requests.
To facilitate and speed up the resolution of the problem, we decided to make the team gathering in Berlin for four days. With the help of preliminary preparations by @Wondertan and guest @willscott, we were able to find a solution in 2 days to match the following requirements:
- Sync time less than block time(ideally sub-second)
- Data by namespace less than block time(ideally sub-second)
- Pragmatic timeframe
- We need this done before incentivized testnet
- We don't have time to redesign the protocol from scratch
- Keep Bitswap, as it suffices for DAS and solves the data withholding attack
- Existing Bitswap logic kept as a fallback mechanism for the case of reconstruction from light nodes
- Keeping random hash-addressed access to shares for Bitswap to work
This ADR intends to outline design decisions for block data storage. In a nutshell, the decision is to use CAR format and Dagstore for extended block storage and custom p2p Req/Resp protocol for block data syncing(whole block and data by namespace id) in the happy path. The p2p portion of the document will come in the subsequent Part II document.
-
FNs/BNs store EDSes serialized as CAR files. CAR format provides an efficient way to store Merkle DAG data, like EDS with NMT. It packs such DAG data into a single blob which can be read sequentially in one read and transferred over the wire. Additionally, CARv2 introduces pluggable indexes over the blob allowing efficient random access to shares and NMT Proofs in one read (if the index is cached in memory).
-
FNs/BNs manage a top-level index for hash to CAR block file mapping. Current DASing for LNs requires FNs/BNs to serve simple hash to data requests. The top-level index maps any Share/NMT Proof hash to any block CARv1 file so that FNs/BNs can quickly serve DA requests.
-
FNs/BNs address EDSes by
DataHash
. The only alternative is by height; however, it does not allow block data deduplication in case EDSes are equal and couples the Data layer/pkg with the Header layer/pkg. -
FNs/BNs run a single instance of
DAGStore
to manage CAR block files. DAGStore provides the top-level indexing and CARv2-based indexing per each CAR file. In essence, it's an engine for managing any CAR files with indexing, convenient abstractions, tools, recovery mechanisms, etc.- EDSes as CARv1 files over CARv2. CARv2 encodes indexes into the file, while DAGStore maintains CARv2-based indexing. Usage of CARv1 keeps only one copy of the index, stores/transfers less metadata per EDS, and simplifies reading EDS from a file.
-
LNs DASing remains untouched. The networking protocol and storage for LNs remain intact as it fulfills the requirements. Bitswap usage as the backbone protocol for requesting samples and global Badger KVStore remain unaffected.
All the comments on the API definitions should be preserved and potentially improved by implementations.
The new block storage design is solely additive. All the existing storage-related components and functionality are kept with additional components introduced. Altogether, existing and new components will be recomposed to serve as the foundation of our improved block storage subsystem.
The central data structure representing Celestia block data is EDS(rsmt2d.ExtendedDataSquare
), and the new storage design
is focused around storing entire EDSes as a whole rather than a set of individual shares, s.t. storage subsystem
can handle storing and streaming/serving blocks of 4MB and more.
Storing EDS as a whole requires EDS (de)serialization. For this, the CAR format is chosen.
To write EDS into a stream/file, WriteEDS
is introduced. Internally, it
- Re-imports EDS similarly to
ipld.ImportShares
- Using
Blockservice
with offline exchange and in-memoryBlockstore
- With
NodeVisitor
, which saves to theBlockstore
only NMT Merkle proofs(no shares) NOTE:len(node.Links()) == 2
- Actual shares are written further in a particular way explained further
- Using
- Creates and writes header
CARv1Header
- Fills up
Roots
field withEDS.RowRoots/EDS.ColRoots
roots converted into CIDs
- Fills up
- Iterates over shares in quadrant-by-quadrant order via
EDS.GetCell
- Writes the shares in row-by-row order
- Iterates over in-memory Blockstore and writes NMT Merkle proofs stored in it
NOTES:
- CAR provides a utility to serialize any DAG into the file
and there is a way to serialize EDS into DAG(
share/ipld.ImportShares
). This approach is the simplest and traverses shares and Merkle Proofs in a depth-first manner packing them in a CAR file. However, this is incompatible with the requirement of being able to truncate the CAR file reading out only the first quadrant out of it without NMT proofs, so serialization must be different from the utility to support that. - Alternatively to
WriteEDS
, anEDSReader
could be introduced to make EDS-to-stream handling more idiomatic and efficient in some cases, with the cost of more complex implementation.
// WriteEDS writes the whole EDS into the given io.Writer as CARv1 file.
// All its shares and recomputed NMT proofs.
func WriteEDS(context.Context, *rsmt2d.ExtendedDataSquare, io.Writer) error
To read an EDS out of a stream/file, ReadEDS
is introduced. Internally, it
- Imports EDS with an empty pre-allocated slice. _NOTE: Size can be taken from CARHeader.
- Wraps given
io.Reader
withBlockReader
- Reads out blocks one by one and fills up the EDS quadrant via
EDS.SetCell
- In total there should be shares-in-quadrant amount of reads.
- Recomputes and validates via
EDS.Repair
// ReadEDS reads an EDS quadrant(1/4) from an io.Reader CAR file.
// It expects strictly first EDS quadrant(top left).
// The returned EDS is guaranteed to be full and valid against the DataHash, otherwise ReadEDS errors.
func ReadEDS(context.Context, io.Reader, DataHash) (*rsmt2d.ExtendedDataSquare, error)
To read only a quadrant/ODS out of full EDS, ODSReader
is introduced.
Its constructor wraps any io.Reader
containing EDS generated by WriteEDS
and produces an io.Reader
which reads
exactly an ODS out of it, similar to io.LimitReader
. The size of the EDS and ODS can be determined by the amount
of CIDs in the CARHeader
.
FNs/BNs keep an eds.Store
to manage every EDS on the disk. The eds.Store
type is introduced in the eds
pkg.
Each EDS together with its Merkle Proofs serializes into a CARv1 file. All the serialized CARv1 file blobs are mounted on
DAGStore via Local FS Mounts and registered
as Shards.
The introduced eds.Store
also maintains (via DAGStore) a top-level index enabling granular and efficient random access
to every share and/or Merkle proof over every registered CARv1 file. The eds.Store
provides a custom Blockstore
interface
implementation to achieve access. The main use-case is randomized sampling over the whole chain of EDS block data
and getting data by namespace.
type Store struct {
basepath string
dgstr dagstore.DAGStore
topIdx index.Inverted
carIdx index.FullIndexRepo
mounts *mount.Registry
...
}
// NewStore constructs Store over OS directory to store CARv1 files of EDSes and indices for them.
// Datastore is used to keep the inverted/top-level index.
func NewStore(basepath string, ds datastore.Batching) *Store {
topIdx := index.NewInverted(datastore)
carIdx := index.NewFSRepo(basepath + "/index")
mounts := mount.NewRegistry()
r := mount.NewRegistry()
err = r.Register("fs", &mount.FSMount{FS: os.DirFS(basePath + "/eds/")}) // registration is a must
if err != nil {
panic(err)
}
return &Store{
basepath: basepath,
dgst: dagstore.New(dagstore.Config{...}),
topIdx: index.NewInverted(datastore),
carIdx: index.NewFSRepo(basepath + "/index")
mounts: mounts,
}
}
NOTE: EDSStore should have lifecycle methods(Start/Stop).
To write an entire EDS Put
method is introduced. Internally it
- Opens a file under
Store.Path/DataHash
path - Serializes the EDS into the file via
share.WriteEDS
- Wraps it with
DAGStore
's FileMount - Converts
DataHash
to theshard.Key
- Registers the
Mount
as aShard
on theDAGStore
NOTE: Registering on the DAGStore populates the top-level index with shares/proofs accessible from stored EDS, which is out of the scope of the document.
// Put stores the given data square with DataHash as a key.
//
// The square is verified on the Exchange level and Put only stores the square trusting it.
// The resulting file stores all the shares and NMT Merkle Proofs of the EDS.
// Additionally, the file gets indexed s.t. Store.Blockstore can access them.
func (s *Store) Put(context.Context, DataHash, *rsmt2d.ExtendedDataSquare) error
To read an EDS as a byte stream GetCAR
method is introduced. Internally it
- Converts
DataHash
to theshard.Key
- Acquires
ShardAccessor
and returnsio.ReadCloser
from it
NOTES:
DAGStores
'sShardAccessor
has to be extended to return anio.ReadCloser
. Currently, it only returns aBlockstore
of the CAR- The returned
io.ReadCloer
represents full EDS exchanged. To get a quadrant an ODSReader should be used instead
// GetCAR takes a DataHash and returns a buffered reader to the respective EDS serialized as a CARv1 file.
//
// The Reader strictly reads our full EDS, and it's integrity is not verified.
//
// Caller must Close returned reader after reading.
func (s *Store) GetCAR(context.Context, DataHash) (io.ReadCloser, error)
Blockstore
method returns a Blockstore
interface implementation instance, providing random access over
share and NMT Merkle proof in every stored EDS. It is required for FNs/BNs to serve DAS requests over the Bitswap.
There is a Blockstore
over DAGStore
and CARv2
indexes.
NOTES:
- We can either use DAGStore's one or implement custom optimized for our needs.
- The Blockstore does not store whole Celestia Blocks, but IPFS blocks. We represent Merkle proofs and shares in IPFS blocks.
- EDIT: We went with custom implementation.
// Blockstore returns an IPFS Blockstore providing access to individual shares/nodes of all EDS
// registered on the Store. NOTE: The Blockstore does not store whole Celestia Blocks but IPFS blocks.
// We represent `shares` and NMT Merkle proofs as IPFS blocks and IPLD nodes so Bitswap can access those.
func (s *Store) Blockstore() blockstore.Blockstore
CARBlockstore
method returns a read-only Blockstore
interface implementation
instance to provide random access over share and NMT Merkle proof in a specific EDS identified by
DataHash, along with its corresponding DAH. It is required for FNs/BNs to enable reading data by
namespace.
NOTES:
- The returned Blockstore does not store whole Celestia Blocks, but IPFS blocks. We represent Merkle proofs and shares in IPFS blocks.
// CARBlockstore returns an IPFS Blockstore providing access to individual shares/nodes of a specific EDS identified by
// DataHash and registered on the Store. NOTE: The Blockstore does not store whole Celestia Blocks but IPFS blocks.
// We represent `shares` and NMT Merkle proofs as IPFS blocks and IPLD nodes so Bitswap can access those.
func (s *Store) CARBlockstore(context.Context, DataHash) (dagstore.ReadBlockstore, error)
The GetDAH
method returns the DAH (share.Root
) of the EDS identified by DataHash
. Internally it:
- Acquires a
ShardAccessor
for the corresponding shard - Reads the CAR Header from the accessor
- Converts the header's root CIDs into a
share.Root
- Verifies the integrity of the
share.Root
by comparing it with theDataHash
// GetDAH returns the DataAvailabilityHeader for the EDS identified by DataHash.
func (s *Store) GetDAH(context.Context, share.DataHash) (*share.Root, error)
To read an entire EDS Get
method is introduced. Internally it:
- Gets a serialized EDS
io.Reader
viaStore.GetCAR
- Deserializes the EDS and validates it via
share.ReadEDS
NOTE: It's unnecessary, but an API ergonomics/symmetry nice-to-have.
// Get reads EDS out of Store by given DataHash.
//
// It reads only one quadrant(1/4) of the EDS and verifies the integrity of the stored data by recomputing it.
func (s *Store) Get(context.Context, DataHash) (*rsmt2d.ExtendedDataSquare, error)
To check if EDSStore keeps an EDS Has
method is introduced. Internally it:
- Converts
DataHash
to theshard.Key
- Checks if
GetShardInfo
does not return ErrShardUnknown
NOTE: It's unnecessary, but an API ergonomics/symmetry nice-to-have.
// Has checks if EDS exists by the given DataHash.
func (s *Store) Has(context.Context, DataHash) (bool, error)
To remove stored EDS Remove
method is introduced. Internally it:
- Converts
DataHash
to theshard.Key
- Destroys
Shard
viaDAGStore
- Internally removes its
Mount
as well
- Internally removes its
- Removes CARv1 file from disk under
Store.Path/DataHash
path - Drops indecies
NOTES:
- It's unnecessary, but an API ergonomics/symmetry nice-to-have
- GC logic on the DAGStore has to be investigated so that Removing is correctly implemented
// Remove removes EDS from Store by the given DataHash and cleans up all the indexing.
func (s *Store) Remove(context.Context, DataHash) error
Generally stays unchanged with minor edits:
share/ipld.GetByNamespace
is kept to load data from disk only and not from the network anymore- Using
Blockservice
with offline exchange - Using
Blockstore
provided byeds.Store
- Using
share/ipld.GetByNamespace
is extended to return NMT Merkle proofs- Similar to
share/ipld.GetProofsForShares
- Ensure Merkle proofs are not duplicated!
- Similar to
As an extension, share/ipld.GetByNamespace
can be modified to share.CARByNamespace
, returning CARv1 Reader with
encoded shares and NMT Merkle Proofs.
Addressing EDS by DataHash allows us to deduplicate equal EDSes. EDS equality is very unlikely to happen in practice, beside empty Block case, which always produces the same EDS.
The empty block is valid and small EDS. It can happen in the early stage of the network. Its body is constant, and to avoid
transferring it over the wire, the eds.Store
should be pre-initialized with an empty EDS value.
The EDSStore on construction expects a directory to store CAR files and indices. The path should be gotten based
on node.Store.Path
.
- Extended blocks as a set of share blobs and Merkle proofs in global Store (current approach with KVStore)
- Extended block as a single blob only(computing Merkle proofs)
- Extended block as a single blob and Merkle proofs
- Extended block as a set of DAG/CAR blobs
- Extended block as a single DAG/CAR blob
-
EDS to/from CARv2 converting performance. Current sync design assumes two converts from CAR to EDS on the protocol layer and back to CAR when storing the EDS. Rsmt2d allocates on most operations with individual shares, and for more giant blocks during sync, these allocations put significant pressure on GC. One way to substantially alleviate this is to integrate the bytes buffer pool into rmst2d.
-
Disk usage increases from the top-level index. This is a temporary solution. The index will have to be removed. LNs know which block they sample and can provide
DataHash
together with sample request over Bitswap, removing the need for hash-to-eds-file mapping. This requires us to either facilitate implementation of Bitswap's auth extension or proposing a custom Bitswap message extension. Subsequently, the Blockstore implementation provided viaeds.Store
would have to be changed to expect DataHash to be passed through thecontext.Context
.