TLDR
We formalize choosing funding rates as an adversarial bandit problem and run Monte Carlo simulations using agent-based model to find the bandit algorithm with optimal performance. See the full write-up and appendix here.
Motivation
In our previous post New Framework for Funding Rates, we investigated the current landscape of funding rates for perpetual futures and explored two new approaches: risk-based and adaptive. For this write-up, we look to unify both approaches by inheriting a profit maximization objective and formalizing setting funding rates as an adversarial multi-armed bandit problem.
Problem Formalization
Since AMM-based exchanges receive funding payments due to a long-short imbalance, we frame the problem of setting variable funding rates similarly to that of a market maker setting bid-ask spreads to maximize profit. Let L_t, S_t \in \mathbb{R}^+ be long, short open interest (in number of contracts) at some time t respectively. For terminal time T, define cumulative realized PnL as RPnL_T := F^t_T + F^f_T + X_T where F^t_T is cumulative trading fees received, F^f_T is cumulative excess funding received (funding paid from the overweight side to the exchange), and X_T is realized PnL against traders due to the AMM-based exchange acting as the implicit counterparty against traders. Define total unrealized PnL as UPnL_T:= -P^{liq}_T(L_T - S_T) where P^{liq}_T is the liquidation value of the imbalance liability resulting from the long-short imbalance. Hence, our objective function is
We choose a reinforcement learning (RL) approach to reduce model risk and apply a multi-armed bandit (MAB) framework due to the computational constraints of blockchain. In particular, we use adversarial bandits for our highly dynamic context.
Arms
Consider each round as a time t. Define realized PnL per round as
where f^t_t is trading fee (%), V_t is trading volume, f^f_t is the funding rate (%), and P_t is the index price. Define unrealized PnL per round as
where \Delta P_{t+1} := P_{t+1}-P_{t}. Note that \sum_{t=1}^{T}(\Delta RPnL_t + \Delta UPnL_t) = RPnL_T + UPnL_T. To reduce model risk, we utilse observable metrics and choose the parametrization for desired funding received by the exchange per round to be
where \alpha \in (0, 1]. Intuitively, we assume that (1-\alpha) of the realized PnL is set aside for external LPs and set \lambda_t as our arm for bandit algorithms, essentially choosing the rate at which any potential loss is covered with \alpha of funding payments (see the full write-up for the derivation). During each round, the choice of an arm is constrained to the subset of arms that do not result in a funding rate f_t^f \geq 1, which would charge traders more funding than their total collateral. Note that we refer to this arm parametrization as Risk Decay (\alpha) in our simulation code.
Reward Function
We consider asymmetrically dampened PnL as our reward function, which is used in a number of previous works on market making with reinforcement learning (GaĹĄperov et al., 2021). More precisely, we define our reward function as the sum of realized PnL and unrealized PnL with an additional penalty term on positive unrealized PnL:
where \eta \in [0, 1]. As \eta increases, we add more penalty for positive unrealized PnL while keeping negative unrealized PnL intact, thereby discouraging the AMM from gaining profits through speculation by holding a long-short imbalance. In practice, the reward must be normalized to \tilde{r}_t such that \tilde{r}_t \in [0,1] (see the full write-up for more details). Note that we refer to this reward function as RwDU (\eta), realized PnL with dampened unrealized PnL, in our simulation code.
Policy
We employ multiple adversarial bandit algorithms used in the literature: \varepsilon-greedy (Vermorel and Mohri, 2005), Follow-the-Perturbed-Leader (Abernethy and Kale, 2013), Exp3 (Auer et al., 2002), and Boltzmann-Gumbel (Cesa-Bianchi et al., 2017). Given normalized reward \tilde{r}_t, each algorithm updates weights assigned to each arm i from p^i_{t} to p^i_{t+1}, with which it selects the next arm. While Exp3 is commonly used for adversarial bandit problems due to its sublinear regret bound (performance loss against the best arm in hindsight), we explore other algorithms as well since practical performance tends to be dependent on the context. In particular, multiple studies have shown that \varepsilon-greedy algorithm, despite having a linear regret bound with a constant \varepsilon, produces comparable empirical results and is worth simulating (Vermorel and Mohri, 2005).
Simulation
We construct an agent-based model (ABM) and perform Monte Carlo simulations to examine the PnL profiles of different bandit parametrizations. Inspired by the Lux-Marchesi multi-agent model, we include 3 agents in our ABM: chartists (c), fundamentalists (f), and funding rate arbitrageurs (a). Each agent i with type \in \{c, f, a\} is endowed with wealth w^{type}_i \sim \text{Pareto}(1, \alpha) and assigned a time horizon h^{type}_i \sim U(1, h^{type}_{max}) with granularity g^{type}_i, which they use as the lookback window length to calculate moving averages.
Agents
- Chartists are speculative agents who trade on price trend. Chartists will buy if P_t > P^{ma}_t and sell otherwise, where P^{ma}_t = \frac{1}{h^c} \sum_{s=t - h^c}^{t} P_s is the moving average price at time t. Chartists size their orders \propto \frac{w^c|P_t - P^{ma}_t|}{P_t} in number of contracts.
- Fundamentalists are stabilizing agents who expect the price to revert to its fundamental value. We capture the fundamental value by a moving average price with larger granularity g^f > g^c (hence longer time horizons) than chartists. Fundamentalists will buy if P_{t} < P^{ma}_t and sell otherwise, sizing their orders similarly to chartists.
- Funding rate arbitrageurs (henceforth arbitrageurs) receive funding by helping to correct the long-short imbalance while remaining delta-neutral. Arbitrageurs observe a moving average of the funding rate f^{ma}_t and are assigned a market friction fr \sim U(0, fr_{max}). If the expected funding received is greater than market friction, i.e. f^{ma}_t \cdot h^{f} > fr, then they will open a position in the direction of receiving funding (buy if f^{ma}_t < 0, sell otherwise). They will close their position if f^{ma}_t \cdot h^{f} \leq fr. Arbitrageurs size their orders \propto f^{ma}_t \cdot h^{f} - fr. During each step of the simulation, the AMM selects a funding rate via bandit algorithm after observing a reward from the previous step. The AMM charges funding from traders who have positions open and then proceeds to accept new orders. Agents with negative or zero wealth are removed from the simulation. See below for a flowchart of the simulation.
Importantly, both chartists and fundamentalists do not make trading decisions based on the funding rate, but are nonetheless charged funding. If funding rate remains overly high for a long period of time, these agents will have less wealth, thereby reducing trading volume and hence the exchangeâs fee revenue.
Results
To keep our calibration procedure simple, we assume an exogenous price time series and look to replicate trading activity (see details in the full write-up). We use grid search and conduct Monte Carlo simulations to obtain the 95% confidence intervals for mean terminal PnL under each parametrization to find the following optimal parametrization:
(See our simulation code here)
Discussion
For the future, there are several limitations that can be addressed. First, a profit maximization objective seems to be more fitted to a market maker and does not fully capture our objective of designing a âgoodâ exchange. This is clear from simulated high trading volume despite high funding rates due to funding rate arbitrageurs, reminiscent of The Perpetual PvP Ponzi. Instead, a welfare maximization perspective with an objective of minimizing both cost of funding and insolvency risk can be the next framework to explore.
High funding rates also motivated our approach in constraining the choice of arms, which should also be investigated further in terms of theoretical regret bounds and empirical performance. Importantly, there remains a problematic edge case when funding charged exceeds the overweight sidesâ total collateral and unrealized PnL, resulting in insolvency. It follows that additional guard rails, such as open interest caps, may have to be implemented to complement a MAB framework.
The calibration and validation of our ABM can also be improved in various ways. We observed that if the exchange happened to hold a favorable imbalance liability, the resulting PnL would be tremendously positive. To combat this, chartists and fundamentalists could be more intelligent by refusing to trade when the funding rate is unreasonably high. Alternatively, we can also simulate using price paths on other dates. In terms of choosing the best bandit algorithm, it seems that the optimal algorithm is rather context dependent.
While our current approach is to have a single optimization algorithm choosing an appropriate arm (funding rate) to employ, it is of interest to explore how this can be extended to a more decentralized setting (in a similar way as how Aera aggregates portfolio submissions).
Questions
- What objectives, other than profit maximization, are worth exploring?
- Is constraining the arms optimal or are there solutions that can be implemented to prevent the aforementioned edge case?
- How can the ABM be improved to reflect more realistic trading activity?