https://assets.10lun.com/nexus/uploads/2023/09/_images_eth-org.jpg

Geth v1.13 comes fairly close on the heels of the 1.12 release family, which is funky, considering it's main feature has been in development for a cool 6 years now. 🤯

This post will go into a number of technical and historical details, but if you just want the gist of it, Geth v1.13.0 ships a new database model for storing the Ethereum state, which is both faster than the previous scheme, and also has proper pruning implemented. No more junk accumulating on disk and no more guerilla (offline) pruning!

  • ¹Excluding ~589GB ancient data, the same across all configurations.
  • ²Hash scheme full sync exceeded our 1.8TB SSD at block ~15.43M.
  • ³Size difference vs snap sync attributed to compaction overhead.

Before going ahead though, a shoutout goes to Gary Rong who has been working on the crux of this rework for the better part of 2 years now! Amazing work and amazing endurance to get this huge chunk of work in!

Gory tech details

Ok, so what's up with this new data model and why was it needed in the first place?

In short, our old way of storing the Ethereum state did not allow us to efficiently prune it. We had a variety of hacks and tricks to accumulate junk slower in the database, but we nonetheless kept accumulating it indefinitely. Users could stop their node and prune it offline; or resync the state to get rid of the junk. But it was a very non-ideal solution.

In order to implement and ship real pruning; one that does not leave any junk behind, we needed to break a lot of eggs within Geth's codebase. Effort wise, we'd compare it to the Merge, only restricted to Geth's internal level:

  • Storing state trie nodes by hashes introduces an implicit deduplication (i.e. if two branches of the trie share the same content (more probable for contract storages), they get stored only once). This implicit deduplication means that we can never know how many parent's (i.e. different trie paths, different contracts) reference some node; and as such, we can never know what is safe and what is unsafe to delete from disk.
  • Any form of deduplication across different paths in the trie had to go before pruning could be implemented. Our new data model stores state trie nodes keyed by their path, not their hash. This slight change means that if previously two branches has the same hash and were stored only once; now they will have different paths leading to them, so even though they have the same content, they will be stored separately, twice.
  • Storing multiple state tries in the database introduces a different form of deduplication. For our old data model, where we stored trie nodes keyed by hash, the vast majority of trie nodes stay the same between consecutive blocks. This results in the same issue, that we have no idea how many blocks reference the same state, preventing a pruner from operating effectively. Changing the data model to path based keys makes storing multiple tries impossible altogether: the same path-key (e.g. empty path for the root node) will need to store different things for each block.
  • The second invariant we needed to break was the capability to store arbitrarily many states on disk. The only way to have effective pruning, as well as the only way to represent trie nodes keyed by path, was to restrict the database to contain exactly 1 state trie at any point in time. Originally this trie is the genesis state, after which it needs to follow the chain state as the head is progressing.
  • The simplest solution with storing 1 state trie on disk is to make it that of the head block. Unfortunately, that is overly simplistic and introduces two issues. Mutating the trie on disk block-by-block entails a lot of writes. Whilst in sync it may not be that noticeable, but importing many blocks (e.g. full sync or catchup) it becomes unwieldy. The second issue is that before finality, the chain head might wiggle a bit across mini-reorgs. They are not common, but since they can happen, Geth needs to handle them gracefully. Having the persistent state locked to the head makes it very hard to switch to a different side-chain.
  • The solution is analogous to how Geth's snapshots work. The persistent state does not track the chain head, rather it is a number of blocks behind. Geth will always maintain the trie changes done in the last 128 blocks in memory. If there are multiple competing branches, all of them are tracked in memory in a tree shape. As the chain moves forward, the oldets (HEAD-128) diff layer is flattened down. This permits Geth to do blazing fast reorgs within the top 128 blocks, side-chain switches essentially being free.
  • The diff layers however do not solve the issue that the persistent state needs to move forward on every block (it would just be delayed). To avoid disk writes block-by-block, Geth also has a dirty cache in between the persistent state and the diff layers, which accumulates writes. The advantage is that since consecutive blocks tend to change the same storage slots a lot, and the top of the trie is overwritten all the time; the dirty buffer short circuits these writes, which will never need to hit disk. When the buffer gets full however, everything is flushed to disk.
  • With the diff layers in place, Geth can do 128 block-deep reorgs instantly. Sometimes however, it can be desirable to do a deeper reorg. Perhaps the beacon chain is not finalizing; or perhaps there was a consensus bug in Geth and an upgrade needs to "undo" a larger portion of the chain. Previously Geth could just roll back to an old state it had on disk and reprocess blocks on top. With the new model of having only ever 1 state on disk, there's nothing to roll back to.
  • Our solution to this issue is the introduction of a notion called reverse diffs. Every time a new block is imported, a diff is created which can be used to convert the post-state of the block back to it's pre-state. The last 90K of these reverse diffs are stored on disk. Whenever a very deep reorg is requested, Geth can take the persistent state on disk and start applying diffs on top until the state is mutated back to some very old version. Then is can switch to a different side-chain and process blocks on top of that.

The above is a condensed summary of what we needed to modify in Geth's internals to introduce our new pruner. As you can see, many invariants changed, so much so, that Geth essentially operates in a completely different way compared to how the old Geth worked. There is no way to simply switch from one model to the other.

We of course recognize that we can't just "stop working" because Geth has a new data model, so Geth v1.13.0 has two modes of operation (talk about OSS maintanance burden). Geth will keep supporting the old data model (furthermore it will stay the default for now), so your node will not do anything "funny" just because you updated Geth. You can even force Geth to stick to the old mode of operation longer term via --state.scheme=hash.

If you wish to switch to our new mode of operation however, you will need to resync the state (you can keep the ancients FWIW). You can do it manually or via geth removedb (when asked, delete the state database, but keep the ancient database). Afterwards, start Geth with --state.scheme=path. For now, the path-model is not the default one, but if a previous database already exist, and no state scheme is explicitly requested on the CLI, Geth will use whatever is inside the database. Our suggestion is to always specify --state.scheme=path just to be on the safe side. If no serious issues are surfaced in our path scheme implementation, Geth v1.14.x will probably switch over to it as the default format.

A couple notes to keep in mind:

  • If you are running private Geth networks using geth init, you will need to specify --state.scheme for the init step too, otherwise you will end up with an old style database.
  • For archive node operators, the new data model will be compatible with archive nodes (and will bring the same amazing database sizes as Erigon or Reth), but needs a bit more work before it can be enabled.

Also, a word of warning: Geth's new path-based storage is considered stable and production ready, but was obviously not battle tested yet outside of the team. Everyone is welcome to use it, but if you have significant risks if your node crashes or goes out of consensus, you might want to wait a bit to see if anyone with a lower risk profile hits any issues.

Now onto some side-effect surprises...

Semi-instant shutdowns

Head state missing, repairing chain... 😱

...the startup log message we're all dreading, knowing our node will be offline for hours... is going away!!! But before saying goodbye to it, lets quickly recap what it was, why it happened, and why it's becoming irrelevant.

Prior to Geth v1.13.0, the Merkle Patricia trie of the Ethereum state was stored on disk as a hash-to-node mapping. Meaning, each node in the trie was hashed, and the value of the node (whether leaf or internal node) was inserted in a key-value store, keyed by the computed hash. This was both very elegant from a mathematical perspective, and had a cute optimization that if different parts of the state had the same subtrie, those would get deduplicated on disk. Cute... and fatal.

When Ethereum launched, there was only archive mode. Every state trie of every block was persisted to disk. Simple and elegant. Of course, it soon became clear that the storage requirement of having all the historical state saved forever is prohibitive. Fast sync did help. By periodically resyncing, you could get a node with only the latest state persisted and then pile only subsequent tries on top. Still, the growth rate required more frequent resyncs than tolerable in production.

What we needed, was a way to prune historical state that is not relevant anymore for operating a full node. There were a number of proposals, even 3-5 implementations in Geth, but each had such a huge overhead, that we've discarded them.

Geth ended up having a very complex ref-counting in-memory pruner. Instead of writing new states to disk immediately, we kept them in memory. As the blocks progressed, we piled new trie nodes on top and deleted old ones that weren't referenced by the last 128 blocks. As this memory area got full, we dripped the oldest, still-referenced nodes to disk. Whilst far from perfect, this solution was an enormous gain: disk growth got drastically cut, and the more memory given, the better the pruning performance.

The in-memory pruner however had a caveat: it only ever persisted very old, still live nodes; keeping anything remotely recent in RAM. When the user wanted to shut Geth down, the recent tries - all kept in memory - needed to be flushed to disk. But due to the data layout of the state (hash-to-node mapping), inserting hundreds of thousands of trie nodes into the database took many many minutes (random insertion order due to hash keying). If Geth was killed faster by the user or a service monitor (systemd, docker, etc), the state stored in memory was lost.

On the next startup, Geth would detect that the state associated with the latest block never got persisted. The only resolution is to start rewinding the chain, until a block is found with the entire state available. Since the pruner only ever drips nodes to disk, this rewind would usually undo everything until the last successful shutdown. Geth did occasionally flush an entire dirty trie to disk to dampen this rewind, but that still required hours of processing after a crash.

We dug ourselves a very deep hole:

  • The pruner needed as much memory as it could to be effective. But the more memory it had, the higher probability of a timeout on shutdown, resulting in data loss and chain rewind. Giving it less memory causes more junk to end up on disk.
  • State was stored on disk keyed by hash, so it implicitly deduplicated trie nodes. But deduplication makes it impossible to prune from disk, being prohibitively expensive to ensure nothing references a node anymore across all tries.
  • Reduplicating trie nodes could be done by using a different database layout. But changing the database layout would have made fast sync inoperable, as the protocol was designed specifically to be served by this data model.
  • Fast sync could be replaced by a different sync algorithm that does not rely on the hash mapping. But dropping fast sync in favor of another algorithm requires all clients to implement it first, otherwise the network splinters.
  • A new sync algorithm, one based on state snapshots, instead of tries is very effective, but it requires someone maintaining and serving the snapshots. It is essentially a second consensus critical version of the state.

It took us quite a while to get out of the above hole (yes, these were the laid out steps all along):

  • 2018: Snap sync's initial designs are made, the necessary supporting data structures are devised.
  • 2019: Geth starts generating and maintaining the snapshot acceleration structures.
  • 2020: Geth prototypes snap sync and defines the final protocol specification.
  • 2021: Geth ships snap sync and switches over to it from fast sync.
  • 2022: Other clients implement consuming snap sync.
  • 2023: Geth switches from hash to path keying.
  • Geth becomes incapable of serving the old fast sync.
  • Geth reduplicates persisted trie nodes to permit disk pruning.
  • Geth drops in-memory pruning in favor of proper persistent disk pruning.

One request to other clients at this point is to please implement serving snap sync, not just consuming it. Currently Geth is the only participant of the network that maintains the snapshot acceleration structure that all other clients use to sync.

Where does this very long detour land us? With Geth's very core data representation swapped out from hash-keys to path-keys, we could finally drop our beloved in-memory pruner in exchange for a shiny new, on-disk pruner, which always keeps the state on disk fresh/recent. Of course, our new pruner also uses an in-memory component to make it a bit more optimal, but it primarilly operates on disk, and it's effectiveness is 100%, independent of how much memory it has to operate in.

With the new disk data model and reimplemented pruning mechanism, the data kept in memory is small enough to be flushed to disk in a few seconds on shutdown. But even so, in case of a crash or user/process-manager insta-kill, Geth will only ever need to rewind and reexecute a couple hundred blocks to catch up with its prior state.

Say goodbye to the long startup times, Geth v1.13.0 opens brave new world (with --state.scheme=path, mind you).

Drop the --cache flag

No, we didn't drop the --cache flag, but chances are, you should!

Geth's --cache flag has a bit of a murky past, going from a simple (and ineffective) parameter to a very complex beast, where it's behavior is fairly hard to convey and also to properly account.

Back in the Frontier days, Geth didn't have many parameters to tweak to try and make it go faster. The only optimization we had was a memory allowance for LevelDB to keep more of the recently touched data in RAM. Interestingly, allocating RAM to LevelDB vs. letting the OS cache disk pages in RAM is not that different. The only time when explicitly assigning memory to the database is beneficial, is if you have multiple OS processes shuffling lots of data, thrashing each other's OS caches.

Back then, letting users allocate memory for the database seemed like a good shoot-in-the-dark attempt to make things go a bit faster. Turned out it was also a good shoot-yourself-in-the-foot mechanism, as it turned out Go's garbage collector really really dislikes large idle memory chunks: the GC runs when it piles up as much junk, as it had useful data left after the previous run (i.e. it will double the RAM requirement). Thus began the saga of Killed and OOM crashes...

Fast-forward half a decade and the --cache flag, for better or worse, evolved:

  • Depending whether you're on mainnet or testnet, --cache defaults to 4GB or 512MB.
  • 50% of the cache allowance is allocated to the database to use as dumb disk cache.
  • 25% of the cache allowance is allocated to in-memory pruning, 0% for archive nodes.
  • 10% of the cache allowance is allocated to snapshot caching, 20% for archive nodes.
  • 15% of the cache allowance is allocated to trie node caching, 30% for archive nodes.

The overall size and each percentage could be individually configured via flags, but let's be honest, nobody understands how to do that or what the effect will be. Most users bumped the --cache up because it lead to less junk accumulating over time (that 25% part), but it also lead to potential OOM issues.

Over the past two years we've been working on a variety of changes, to soften the insanity:

  • Geth's default database was switched to Pebble, which uses caching layers outide of the Go runtime.
  • Geth's snapshot and trie node cache started using fastcache, also allocating outside of the Go runtime.
  • The new path schema prunes state on the fly, so the old pruning allowance was reassigned to the trie cache.

The net effect of all these changes are, that using Geth's new path database scheme should result in 100% of the cache being allocated outside of Go's GC arena. As such, users raising or lowering it should not have any adverse effects on how the GC works or how much memory is used by the rest of Geth.

That said, the --cache flag also has no influece whatsoever any more on pruning or database size, so users who previously tweaked it for this purpose, can drop the flag. Users who just set it high because they had the available RAM should also consider dropping the flag and seeing how Geth behaves without it. The OS will still use any free memory for disk caching, so leaving it unset (i.e. lower) will possibly result in a more robust system.

Epilogue

As with all our previous releases, you can find the:

  • Source code, git tags and whatnot on our GitHub release page.
  • Pre-built binaries for all platforms on our downloads page.
  • Docker images published under ethereum/client-go.
  • Ubuntu packages in our Launchpad PPA repository.
  • OSX packages in our Homebrew Tap repository.