Bandit Funding Rates


We formalize choosing funding rates as an adversarial bandit problem and run Monte Carlo simulations using agent-based model to find the bandit algorithm with optimal performance. See the full write-up and appendix here.


In our previous post New Framework for Funding Rates, we investigated the current landscape of funding rates for perpetual futures and explored two new approaches: risk-based and adaptive. For this write-up, we look to unify both approaches by inheriting a profit maximization objective and formalizing setting funding rates as an adversarial multi-armed bandit problem.

Problem Formalization

Since AMM-based exchanges receive funding payments due to a long-short imbalance, we frame the problem of setting variable funding rates similarly to that of a market maker setting bid-ask spreads to maximize profit. Let L_t, S_t \in \mathbb{R}^+ be long, short open interest (in number of contracts) at some time t respectively. For terminal time T, define cumulative realized PnL as RPnL_T := F^t_T + F^f_T + X_T where F^t_T is cumulative trading fees received, F^f_T is cumulative excess funding received (funding paid from the overweight side to the exchange), and X_T is realized PnL against traders due to the AMM-based exchange acting as the implicit counterparty against traders. Define total unrealized PnL as UPnL_T:= -P^{liq}_T(L_T - S_T) where P^{liq}_T is the liquidation value of the imbalance liability resulting from the long-short imbalance. Hence, our objective function is

\text{maximize} \quad RPnL_T + UPnL_T

We choose a reinforcement learning (RL) approach to reduce model risk and apply a multi-armed bandit (MAB) framework due to the computational constraints of blockchain. In particular, we use adversarial bandits for our highly dynamic context.


Consider each round as a time t. Define realized PnL per round as

\Delta RPnL_{t+1} := P_tf^t_tV_t + P_tf^f_t(L_t - S_t)

where f^t_t is trading fee (%), V_t is trading volume, f^f_t is the funding rate (%), and P_t is the index price. Define unrealized PnL per round as

\Delta UPnL_{t+1} := -(\Delta P_{t+1})(L_t - S_t)

where \Delta P_{t+1} := P_{t+1}-P_{t}. Note that \sum_{t=1}^{T}(\Delta RPnL_t + \Delta UPnL_t) = RPnL_T + UPnL_T. To reduce model risk, we utilse observable metrics and choose the parametrization for desired funding received by the exchange per round to be

F^f_t = -\frac{1}{\alpha\lambda_t}\cdot \min(0, \sum_{s=1}^t(\alpha \Delta RPnL_s + \Delta UPnL_s))

where \alpha \in (0, 1]. Intuitively, we assume that (1-\alpha) of the realized PnL is set aside for external LPs and set \lambda_t as our arm for bandit algorithms, essentially choosing the rate at which any potential loss is covered with \alpha of funding payments (see the full write-up for the derivation). During each round, the choice of an arm is constrained to the subset of arms that do not result in a funding rate f_t^f \geq 1, which would charge traders more funding than their total collateral. Note that we refer to this arm parametrization as Risk Decay (\alpha) in our simulation code.

Reward Function

We consider asymmetrically dampened PnL as our reward function, which is used in a number of previous works on market making with reinforcement learning (Gašperov et al., 2021). More precisely, we define our reward function as the sum of realized PnL and unrealized PnL with an additional penalty term on positive unrealized PnL:

r_t = \Delta RPnL_t + \Delta UPnL_t - \eta \cdot \max(0, \Delta UPnL_t)

where \eta \in [0, 1]. As \eta increases, we add more penalty for positive unrealized PnL while keeping negative unrealized PnL intact, thereby discouraging the AMM from gaining profits through speculation by holding a long-short imbalance. In practice, the reward must be normalized to \tilde{r}_t such that \tilde{r}_t \in [0,1] (see the full write-up for more details). Note that we refer to this reward function as RwDU (\eta), realized PnL with dampened unrealized PnL, in our simulation code.


We employ multiple adversarial bandit algorithms used in the literature: \varepsilon-greedy (Vermorel and Mohri, 2005), Follow-the-Perturbed-Leader (Abernethy and Kale, 2013), Exp3 (Auer et al., 2002), and Boltzmann-Gumbel (Cesa-Bianchi et al., 2017). Given normalized reward \tilde{r}_t, each algorithm updates weights assigned to each arm i from p^i_{t} to p^i_{t+1}, with which it selects the next arm. While Exp3 is commonly used for adversarial bandit problems due to its sublinear regret bound (performance loss against the best arm in hindsight), we explore other algorithms as well since practical performance tends to be dependent on the context. In particular, multiple studies have shown that \varepsilon-greedy algorithm, despite having a linear regret bound with a constant \varepsilon, produces comparable empirical results and is worth simulating (Vermorel and Mohri, 2005).


We construct an agent-based model (ABM) and perform Monte Carlo simulations to examine the PnL profiles of different bandit parametrizations. Inspired by the Lux-Marchesi multi-agent model, we include 3 agents in our ABM: chartists (c), fundamentalists (f), and funding rate arbitrageurs (a). Each agent i with type \in \{c, f, a\} is endowed with wealth w^{type}_i \sim \text{Pareto}(1, \alpha) and assigned a time horizon h^{type}_i \sim U(1, h^{type}_{max}) with granularity g^{type}_i, which they use as the lookback window length to calculate moving averages.


  • Chartists are speculative agents who trade on price trend. Chartists will buy if P_t > P^{ma}_t and sell otherwise, where P^{ma}_t = \frac{1}{h^c} \sum_{s=t - h^c}^{t} P_s is the moving average price at time t. Chartists size their orders \propto \frac{w^c|P_t - P^{ma}_t|}{P_t} in number of contracts.
  • Fundamentalists are stabilizing agents who expect the price to revert to its fundamental value. We capture the fundamental value by a moving average price with larger granularity g^f > g^c (hence longer time horizons) than chartists. Fundamentalists will buy if P_{t} < P^{ma}_t and sell otherwise, sizing their orders similarly to chartists.
  • Funding rate arbitrageurs (henceforth arbitrageurs) receive funding by helping to correct the long-short imbalance while remaining delta-neutral. Arbitrageurs observe a moving average of the funding rate f^{ma}_t and are assigned a market friction fr \sim U(0, fr_{max}). If the expected funding received is greater than market friction, i.e. f^{ma}_t \cdot h^{f} > fr, then they will open a position in the direction of receiving funding (buy if f^{ma}_t < 0, sell otherwise). They will close their position if f^{ma}_t \cdot h^{f} \leq fr. Arbitrageurs size their orders \propto f^{ma}_t \cdot h^{f} - fr. During each step of the simulation, the AMM selects a funding rate via bandit algorithm after observing a reward from the previous step. The AMM charges funding from traders who have positions open and then proceeds to accept new orders. Agents with negative or zero wealth are removed from the simulation. See below for a flowchart of the simulation.

Importantly, both chartists and fundamentalists do not make trading decisions based on the funding rate, but are nonetheless charged funding. If funding rate remains overly high for a long period of time, these agents will have less wealth, thereby reducing trading volume and hence the exchange’s fee revenue.


To keep our calibration procedure simple, we assume an exogenous price time series and look to replicate trading activity (see details in the full write-up). We use grid search and conduct Monte Carlo simulations to obtain the 95% confidence intervals for mean terminal PnL under each parametrization to find the following optimal parametrization:

Arm Reward Policy
\alpha = 0.1 \eta = 0.5 Exp3

Fig 1. PnL heatmap of different bandit algorthms

Fig 2. PnL under each strategy in a single run

Fig 3. PnL distribution under each strategy

(See our simulation code here)


For the future, there are several limitations that can be addressed. First, a profit maximization objective seems to be more fitted to a market maker and does not fully capture our objective of designing a “good” exchange. This is clear from simulated high trading volume despite high funding rates due to funding rate arbitrageurs, reminiscent of The Perpetual PvP Ponzi. Instead, a welfare maximization perspective with an objective of minimizing both cost of funding and insolvency risk can be the next framework to explore.

High funding rates also motivated our approach in constraining the choice of arms, which should also be investigated further in terms of theoretical regret bounds and empirical performance. Importantly, there remains a problematic edge case when funding charged exceeds the overweight sides’ total collateral and unrealized PnL, resulting in insolvency. It follows that additional guard rails, such as open interest caps, may have to be implemented to complement a MAB framework.

The calibration and validation of our ABM can also be improved in various ways. We observed that if the exchange happened to hold a favorable imbalance liability, the resulting PnL would be tremendously positive. To combat this, chartists and fundamentalists could be more intelligent by refusing to trade when the funding rate is unreasonably high. Alternatively, we can also simulate using price paths on other dates. In terms of choosing the best bandit algorithm, it seems that the optimal algorithm is rather context dependent.

While our current approach is to have a single optimization algorithm choosing an appropriate arm (funding rate) to employ, it is of interest to explore how this can be extended to a more decentralized setting (in a similar way as how Aera aggregates portfolio submissions).


  1. What objectives, other than profit maximization, are worth exploring?
  2. Is constraining the arms optimal or are there solutions that can be implemented to prevent the aforementioned edge case?
  3. How can the ABM be improved to reflect more realistic trading activity?

Using a bandit approach is a fresh perspective to DeFi mechanism design!

  • How long does a round last? In the sims it looks like you are just using a generic “step count”. In your mind, what is this timescale indicative of - minute, hourly, or daily?

  • One approach to reflect realistic trading activity is to test extreme events - whats the minimum amount of directional trading activity (in both directions) required to get “good enough” performance?

  • Maximizing profit sounds extremely risky and the function would maximize profit over risk so I think the objective of minimizing risk would be more important than maximizing profit by itself. Or simiilarly the keeping the “probability for systemtic defaults” as low as possible

1 Like

Since we’re using 1-min frequency sampling for price time series, the simulation step represents 1 minute. But we are currently thinking of rounds as blocks.

Not sure if I fully understand, could you expand on this? One issue we did find was that agent behavior was not heterogenous enough when spikes in the price happened.

Agreed, we’re considering a new objective from the perspective of optimal risk allocation, i.e. minimize \rho(X) + F^f_t, the sum of insolvency risk and total funding charged over some horizon.

Another aspect to explore is to somehow prevent funding rate from being highly variable (which is what we observed from bandits exploring) since it could lead to unnecessary liquidations.

oh that is what I was wondering about where “the agent behavior isn’t heterogeneous enough when price spikes occur”. I am not convinced that you can get a theoretic optimal so was thinking about just getting “good enough” performance instead during price spikes.

If you are considering total funding over some horizon, you could add a rolling window to smooth the funding rate spike over the funding horizon period.

1 Like

Right, and another way could be an EWMA of the imbalance and parametrize the arms to be something simple like arm \cdot (L - S)_{EWMA} . I wonder if these implicit ways to account for funding rate variability could hinder the bandit performance; maybe have the reward function include a penalty for change of funding rate instead?

1 Like

yep it sounds like you’re on the right track!