BlockSource
https://github.com/datafast-network/datafast-runtime/blob/main/src/components/block_source/mod.rs
Last updated
https://github.com/datafast-network/datafast-runtime/blob/main/src/components/block_source/mod.rs
Last updated
BlockSource polls the block-store for archived blockchain data & also new data.
There can be multiple variants of BlockSource. Currently the only variant implemented is DeltaLake which is the product of Delta-Ingestor
The data in DeltaLake block source produced by DeltaIngestor comes in DeltaLake format - meaning it is typically a LakeHouse data store.
DeltaLake is a new file format developed by Databricks, which basically is based from Parquet file format. This file-based data store helps reduce storage cost compared to Postgres and at the same time eliminate JSON-RPC repetitive calls for the same data.
This data format is a combination of both DeltaLake format & Protobuf, with the following schema definition:
block_number
u64 / long
number of block
block_partition
u64 / long
partition key for delta-lake, calculated by
block_number % A_PARITION_MOD_VALUE
hash
string
block's hash
parent_hash
string
block's parent-hash. The hash & parent-hash is used when handling duplicate data / reorg data
block_data
bytes
Protobuf byte format. In case of ethereum, its definition can be found in https://github.com/datafast-network/delta-ingestor/blob/main/ingestor/src/proto/ethereum.proto
created_at
u64 / long
The timestamp of when the block is written to the delta-lake - not actual block-mining timestamp. This value is to help with handling reorg & duplicate data during the Ingestor's running process
The block_data
column is Protobuf's byte data. By utilizing protobuf rather than human-readable data, it ensures the fast serialization / deserialization when ingesting or processing the data in the delta lake.
The deltalake block source uses datafusion SQL API to query blocks from DeltaLake for processing in an infinite polling loop, which can be described with the following pseudo-code:
The downloaded data is read & parsed into the runtime's compatible data format. This work is done in parallel-fashion using Rayon
When serializing is finished, data is wrapped in a Message struct and forwarded to the main-flow of the runtime through a kanal channel.
DeltaLake block source's configuration requires 2 values: the table path & the query step
Table path is the S3 path to the data directory in the Object Storage being in use.
The query-step is the number of blocks the source client would download for each round of query
The configuration is defined in the config.toml
in the following format