[RFC] Consume the Givens: Atomic Cross-Shard Counter Updates

Situation

I’d consider the “effectiveness” of this system to be the minimal number of market orders we can process per-market, per-second. The slowest market (or the stopped market) determines this number.

The risk engine is valuable because it inherently increases this number. This is because it drives us closer to certainty that future throughput will be at least non-zero. If transactions halt due to a capital problem (in one market), it drives the number we’re optimizing to zero.

Bad news.

Task

One problem is that we don’t want to have to hold all account balances in memory at once on any one machine, without using a database other than our persistent message queues! (!!!)

This means we have to shard our counters, probably by hashing the string name so one counter winds up on one shard regardless of shard count. The tricky part is updating many counters atomically on the condition none go negative. These counters may be on different shards, of course. In a sentence, we solve this problem by performing a set of debits followed by a set of credits rather than trying to do both at the same time.

Action

Let’s say each “Count” message in the “SecureCounter” queue contains a mapping of strings to integer deltas; all must be positive non-zero integers. Non-existent counters are initialized to zero before the delta is applied.

The SecureCounter queue also supports “Lock” messages containing a random ID, counters and positive amounts to lock from each. We need a “Given” message, which contains the request Lock ID and counters/amounts locked. Multiple Given messages appear for one Lock because the counters being locked exist in memory on different shards. When some amount is “Given” from a counter, that counter has been (!) decreased by that positive amount. The “Given” message tokenizes a debit so that the amount can be consumed by application logic.

The Lock is obtained on the request side once all requested amounts have Given messages. So now we can request to Lock multiple amounts from multiple different counters across shards atomically prior to trade execution. The trade can be executed with the “given” liquidity and “given” input amount; the result maps counters and their positive deltas sent as a “Count” message to the SecureCounter queue. This write-set must go through as we only make credits (“consume the Givens”).

Results

- This allows us to split execution horizontally across many machines without balance memory concerns.

  • This allows us to create a “market” for liquidity where traders request to execute against some percent of the pool and must wait for that to be available before execution. In most cases, every trade will get all of the liquidity but in competitive markets we may see voluntary liquidity splitting. Requested liquidity is filled in priority of (request size, request time).

  • This makes the exchange even more risk-free as it drives trades in the most competitive markets towards using less liquidity for the same inputs. It’s important to note that this is not arbitrary: it’s a natural consequence of the horizonal execution that competitive markets must undergo to be fully secured.

Questions

*Q: What to do if one of the counters in the Lock fails to debit for value reasons?

A: Configure a timeout on the Lock; if it’s zero, all must be acquired immediately (or pass a Fail message with the request Lock id). If non-zero then wait this long in seconds for for funds, else pass a Fail message with the request Lock id.

So I went and tried to implement this. I found a few challenges:

  • We want to somehow hold all exchange liquidity balances in memory on AMM shards.
    This can be accomplished by having the AMM shard run counters for a subset of all counters. These counters are read-only. In order to lock liquidity for a market the AMM will have to send a Lock request to be handled by Counter shards. We accept that these read-only balances will not perfectly synchronize but this is probably OK for risk.
  • We don’t need to split liquidity at all.
    On an AMM shard, we can pull from the message queue (for trades) until it’s empty and execute these trades one-by-one against locked liquidity in-memory. We execute in batch like this because it lets us secure the balances (and calculate risk) all at once, meaning we pay for one Lock of latency. Then (after making the trades in memory) we can send one big message updating balances–this always goes through because it only credits amounts. We can get through hundreds or even thousands (tens of thousands?) of trades per second, per market this way.