I'm a lecturer at the Department of Computer Science, the University of Oxford. I work primarily in computational game theory, multi-agent systems, and AI. My research focuses on understanding and shaping interactions among intelligent agents in complex real-world scenarios. We explore the power of computing to address these challenges and solve the underlying algorithmic âpuzzlesâ. Through this, we develop algorithmic tools to guide agents toward broader organisational and societal goals. My work bridges theory and practice, addressing issues such as fairness, sustainability and safety in multi-agent interactions.
đ I'm looking for motivated PhD students. Feel free to get in touch if you're interested in my research areas! (See more about PhD at Oxford here.) For any application-related inquiries, please directly contact me (jiarui.gan@cs.ox.ac.uk) or the Department of Computer Science.
Lecturer @Oxford ◄ Postdoc @MPI-SWS (host: Rupak Majumdar) ◄ PhD @Oxford (supervisors: Edith Elkind & Michael Wooldridge)
Insights from well-established frameworksâsuch as principal-agency and dynamic mechanism designâprovide blueprints for architecting and optimising multi-agent systems. By coupling these frameworks with computational models, we create automated solutions that ensure advanced AI agentsâwhether in transportation, digital platforms, or broader societal ecosystemsâremain reliably aligned with organisational objectives. Our work focuses on establishing the computational foundations necessary for these missions, developing systematic approaches to guide multi-agent systems toward beneficial and equitable outcomes.
We study how to design incentive mechanisms that effectively coordinate autonomous agents towards organisational objectives, drawing on principal-agency ideas from economics. By framing these design tasks as computational problems, we aim to develop algorithms capable of computing optimal mechanisms. This approach allows advanced computing machineries to be leveraged to generate tangible solutions, especially in complex or large-scale settings where manually crafted solutions are infeasible or insufficient. Our recent work focuses on two key directions: 1) developing universal computational frameworks for principal-agency to illuminate the shared algorithmic structures underlying diverse principal-agent scenarios, and 2) extending these frameworks to address the more intricate and dynamic settings where interactions between agents unfold over multiple stages. Our overarching goal is to discover fundamental computational theories and algorithmic tools that are broadly applicable to principal-agent problems across a wide range of domains.
-
J. Gan, R. Majumdar.
Value-set iteration: computing optimal correlated equilibria in infinite-horizon multi-player stochastic games. [arXiv]We study the problem of computing optimal correlated equilibria (CEs) in infinite-horizon multi-player stochastic games, where correlation signals are provided over time. In this setting, optimal CEs require history-dependent policies; this poses new representational and algorithmic challenges as the number of possible histories grows exponentially with the number of time steps. We focus on computing $(\epsilon, \delta)$-optimal CEs---solutions that achieve a value within $\epsilon$ of an optimal CE, while allowing the agents' incentive constraints to be violated by at most $\delta$. Our main result is an algorithm that computes an $(\epsilon,\delta)$-optimal CE in time polynomial in $1/(\epsilon\delta(1 - \gamma))^{n+1}$, where $\gamma$ is the discount factor, and $n$ is the number of agents. For (a slightly more general variant of) turn-based games, we further reduce the complexity to a polynomial in $n$. We also establish that the bi-criterion approximation is necessary by proving matching inapproximability bounds.
Our technical core is a novel approach based on inducible value sets, which leverages a compact representation of history-dependent CEs through the values they induce to overcome the representational challenge. We develop the value-set iteration algorithm---which operates by iteratively updating estimates of inducible value sets---and characterize CEs as the greatest fixed point of the update map. Our algorithm provides a groundwork for computing optimal CEs in general multi-player stochastic settings. -
Generalized principal-agency: contracts, information, games and beyond. WINE '24.
J. Gan, M. Han, J. Wu, H. Xu. [arXiv]In the principal-agent problem formulated by Myerson'82, agents have private information (type) and make private decisions (action), both of which are unobservable to the principal. Myerson pointed out an elegant linear programming solution that relies on the revelation principle. This paper extends Myerson's results to a more general setting where the principal's action space can be infinite and subject to additional design constraints. Our generalized principal-agent model unifies several important design problems including contract design, information design, and Bayesian Stackelberg games, and encompasses them as special cases. We first extend the revelation principle to this general model, based on which a polynomial-time algorithm is then derived for computing the optimal mechanism for the principal. This algorithm not only implies new efficient solutions simultaneously for all the aforementioned special cases but also significantly simplifies previously known algorithms designed for special cases. Inspired by the recent interest in the algorithmic design of a single contract and menu of contracts, we study such constrained design problems to our general principal-agent model. In contrast to the above unification, our results here illustrate the other facet of diversity among different principal-agent design problems and demonstrate how their different structures can lead to different complexities: some are tractable whereas others are APX-hard. Finally, we reveal an interesting connection of our model to the problem of information acquisition for decision making and study its algorithmic properties in general.
-
Stochastic principal-agent problems: efficient computation and learning.
J. Gan, R. Majumdar, D. Mandal, G. Radanovic. [arXiv]We introduce a stochastic principal-agent model. A principal and an agent interact in a stochastic environment, each privy to observations about the state not available to the other. The principal has the power of commitment, both to elicit information from the agent and to provide signals about her own information. The players communicate with each other and then select actions independently. Each of them receives a payoff based on the state and their joint action, and the environment transitions to a new state. The interaction continues over a finite time horizon. Both players are far-sighted, aiming to maximize their total payoffs over the time horizon. The model encompasses as special cases extensive-form games (EFGs) and stochastic games of incomplete information, partially observable Markov decision processes (POMDPs), as well as other forms of sequential principal-agent interactions, including Bayesian persuasion and automated mechanism design problems.
We consider both the computation and learning of the principal's optimal policy. Since the general problem, which subsumes POMDPs, is intractable, we explore algorithmic solutions under hindsight observability, where the state and the interaction history are revealed at the end of each step. Though the problem becomes more amenable under this condition, the number of possible histories remains exponential in the length of the time horizon, making approaches for EFG-based models infeasible. We present an efficient algorithm based on the inducible value sets. The algorithm computes an $\epsilon$-approximate optimal policy in time polynomial in $1/\epsilon$. Additionally, we show an efficient learning algorithm for an episodic reinforcement learning setting where the transition probabilities are unknown. The algorithm guarantees sublinear regret $\tilde{O}(T^{2/3})$ for both players over $T$ episodes. -
Persuading a credible agent.
J. Gan, A. Ghosh, N. Teh.
[arXiv]
How to optimally persuade an agent who has a private type? When elicitation is feasible, this amounts to a fairly standard principal-agent-style mechanism design problem, where the persuader employs a mechanism to first elicit the agent's type and then plays the corresponding persuasion strategy based on the agent's report. The optimal mechanism design problem in this setting is relatively well-understood in the literature, with incentive compatible (IC) mechanisms known to be optimal and computationally tractable. In this paper, we study this problem given a credible agent, i.e., if the agent claims they are of a certain type in response to the mechanism's elicitation, then they will act optimally with respect to the claimed type, even if they are actually not of that type.
We present several interesting findings in this new setting that differ significantly from results in the non-credible setting. In terms of the structure of optimal mechanisms, we show that not only may IC mechanisms fail to be optimal, but all mechanisms following the standard 'eliciting-then-persuading' mechanism design structure may be suboptimal. To achieve optimality requires two additional instruments â pre-signaling and non-binding elicitation â which naturally result in multi-stage mechanisms. We characterize optimal mechanisms under these design choices. Based on our characterization, we provide a polynomial-time algorithm for computing optimal multi-stage mechanisms. We also discover that in scenarios that allow for it, partial information elicitation can be employed to improve the principal's payoff even further. Though, surprisingly, an unbounded number of rounds of information exchange between the principal and the agent may be necessary to achieve optimality. -
Sequential decision making with information asymmetry. CONCUR '22.
J. Gan, R. Majumdar, G. Radanovic, A. Singla. [link]
We survey some recent results in sequential decision making under uncertainty, where there is an information asymmetry among the decision-makers. We consider two versions of the problem: persuasion and mechanism design. In persuasion, a more-informed principal influences the actions of a less-informed agent by signaling information. In mechanism design, a less-informed principal incentivizes a more-informed agent to reveal information by committing to a mechanism, so that the principal can make more informed decisions. We define Markov persuasion processes and Markov mechanism processes that model persuasion and mechanism design into dynamic models. Then we survey results on optimal persuasion and optimal mechanism design on myopic and far-sighted agents. These problems are solvable in polynomial time for myopic agents but hard for far-sighted agents.
-
Bayesian persuasion in sequential decision-making. AAAI '22.
J. Gan, R. Majumdar, G. Radanovic, A. Singla. [pdf | arXiv | link]
We study a dynamic model of Bayesian persuasion in sequential decision-making settings. An informed principal observes an external parameter of the world and advises an uninformed agent about actions to take over time. The agent takes actions in each time step based on the current state, the principal's advice/signal, and beliefs about the external parameter. The action of the agent updates the state according to a stochastic process. The model arises naturally in many applications, e.g., an app (the principal) can advise the user (the agent) on possible choices between actions based on additional real-time information the app has. We study the problem of designing a signaling strategy from the principal's point of view. We show that the principal has an optimal strategy against a myopic agent, who only optimizes their rewards locally, and the optimal strategy can be computed in polynomial time. In contrast, it is NP-hard to approximate an optimal policy against a far-sighted agent. Further, we show that if the principal has the power to threaten the agent by not providing future signals, then we can efficiently design a threat-based strategy. This strategy guarantees the principal's payoff as if playing against an agent who is far-sighted but myopic to future signals.
đ„ Also see: our tutorial at WINE '24.
As autonomous agents grow increasingly capable, we face the spectre of manipulative behaviours, ranging from hidden collusions to deceptions, that lead to unintended outcomes. We investigate how information misreporting can evade basic auditing by synchronising behaviour with fabricated claimsâa form of deception we term imitative deception. Such deceptive strategies pose a significant challenge to learning-based mechanism design, undermining its reliabilityâparticularly in high-stakes contexts like security. If agents consistently mask their motives through this approach, what implications does this have for the safety and efficacy of mechanisms built on behavioural data? We initiate algorithmic studies around this question, investigating the vulnerabilities as well as countermeasures associated with deceptive tactics.
-
Persuading a credible agent.
J. Gan, A. Ghosh, N. Teh.
[arXiv]
How to optimally persuade an agent who has a private type? When elicitation is feasible, this amounts to a fairly standard principal-agent-style mechanism design problem, where the persuader employs a mechanism to first elicit the agent's type and then plays the corresponding persuasion strategy based on the agent's report. The optimal mechanism design problem in this setting is relatively well-understood in the literature, with incentive compatible (IC) mechanisms known to be optimal and computationally tractable. In this paper, we study this problem given a credible agent, i.e., if the agent claims they are of a certain type in response to the mechanism's elicitation, then they will act optimally with respect to the claimed type, even if they are actually not of that type.
We present several interesting findings in this new setting that differ significantly from results in the non-credible setting. In terms of the structure of optimal mechanisms, we show that not only may IC mechanisms fail to be optimal, but all mechanisms following the standard 'eliciting-then-persuading' mechanism design structure may be suboptimal. To achieve optimality requires two additional instruments â pre-signaling and non-binding elicitation â which naturally result in multi-stage mechanisms. We characterize optimal mechanisms under these design choices. Based on our characterization, we provide a polynomial-time algorithm for computing optimal multi-stage mechanisms. We also discover that in scenarios that allow for it, partial information elicitation can be employed to improve the principal's payoff even further. Though, surprisingly, an unbounded number of rounds of information exchange between the principal and the agent may be necessary to achieve optimality. -
Learning a Stackelberg Leader's Incentives from Optimal Commitments. EC' 25.
Y. Chen, X. Deng, J. Gan, Y. Li.
[arXiv]
Stackelberg equilibria, as functions of the players' payoffs, can inversely reveal information about the players' incentives. In this paper, we study to what extent one can learn about the leader's incentives by actively querying the leader's optimal commitments against strategically designed followers. We show that, by using polynomially many queries and operations, one can learn a payoff function that is strategically equivalent to the leader's, in the sense that: 1) it preserves the leader's preference over almost all strategy profiles; and 2) it preserves the set of all possible (strong) Stackelberg equilibria the leader may engage in, considering all possible follower types. As an application, we show that the information acquired by our algorithm is sufficient for a follower to induce the best possible Stackelberg equilibrium by imitating a different follower type. To the best of our knowledge, we are the first to demonstrate that this is possible without knowing the leader's payoffs beforehand.
-
Optimally deceiving a learning leader in Stackelberg games.
JAIR & NeurIPS '20.
G. Birmpas, J. Gan, A. Hollender, F. Marmolejo-CossĂo, N. Rajgopal, A. Voudouris. [pdf]Recent results have shown that algorithms for learning the optimal commitment in a Stackelberg game are susceptible to manipulation by the follower. These learning algorithms operate by querying the best responses of the follower, who consequently can deceive the algorithm by using fake best responses, typically by responding according to fake payoffs that are different from the actual ones. For this strategic behavior to be successful, the main challenge faced by the follower is to pinpoint the fake payoffs that would make the learning algorithm output a commitment that benefits them the most. While this problem has been considered before, the related literature has only focused on a simple setting where the follower can only choose from a finite set of payoff matrices, thus leaving the general version of the problem unanswered. In this paper, we fill this gap by showing that it is always possible for the follower to efficiently compute (near-)optimal fake payoffs, for various scenarios of learning interaction between the leader and the follower. Our results also establish an interesting connection between the followerâs deception and the leaderâs maximin utility: through deception, the follower can induce almost any (fake) Stackelberg equilibrium if and only if the leader obtains at least their maximin utility in this equilibrium.
-
Manipulating a learning defender and ways to counteract. NeurIPS '19.
J. Gan, Q. Guo, L. Tran-Thanh, B. An, M. Wooldridge. [pdf | arXiv | poster]In Stackelberg security games when information about the attacker's payoffs is uncertain, algorithms have been proposed to learn the optimal defender commitment by interacting with the attacker and observing their best responses. In this paper, we show that, however, these algorithms can be easily manipulated if the attacker responds untruthfully. As a key finding, attacker manipulation normally leads to the defender learning a maximin strategy, which effectively renders the learning attempt meaningless as to compute a maximin strategy requires no additional information about the other player at all. We then apply a game-theoretic framework at a higher level to counteract such manipulation, in which the defender commits to a policy that specifies her strategy commitment according to the learned information. We provide a polynomial-time algorithm to compute the optimal such policy, and in addition, a heuristic approach that applies even when the attacker's payoff space is infinite or completely unknown. Empirical evaluation shows that our approaches can improve the defender's utility significantly as compared to the situation when attacker manipulation is ignored.
-
Imitative follower deception in Stackelberg games. EC '19.
J. Gan, H. Xu, Q. Guo, L. Tran-Thanh, Z. Rabinovich, M. Wooldridge. [pdf | arXiv | poster]Information uncertainty is one of the major challenges facing applications of game theory. In the context of Stackelberg games, various approaches have been proposed to deal with the leader's incomplete knowledge about the follower's payoffs, typically by gathering information from the leader's interaction with the follower. Unfortunately, these approaches rely crucially on the assumption that the follower will not strategically exploit this information asymmetry, i.e., the follower behaves truthfully during the interaction according to their actual payoffs. As we show in this paper, the follower may have strong incentives to deceitfully imitate the behavior of a different follower type and, in doing this, benefit significantly from inducing the leader into choosing a highly suboptimal strategy. This raises a fundamental question: how to design a leader strategy in the presence of a deceitful follower? To answer this question, we put forward a basic model of Stackelberg games with (imitative) follower deception and show that the leader is indeed able to reduce the loss due to follower deception with carefully designed policies. We then provide a systematic study of the problem of computing the optimal leader policy and draw a relatively complete picture of the complexity landscape; essentially matching positive and negative complexity results are provided for natural variants of the model. Our intractability results are in sharp contrast to the situation with no deception, where the leader's optimal strategy can be computed in polynomial time, and thus illustrate the intrinsic difficulty of handling follower deception. Through simulations we also examine the benefit of considering follower deception in randomly generated games.
Standard models often assume neat and predictable agent behaviour to simplify theoretical analysis. While these assumptions have led to foundational insights, they frequently fail to account for the complexities and nuances of real-world scenarios. For example, agents may ignore, or be unable to discern, small payoff differences, thereby making suboptimal decisions. Similarly, in sequential decision-making, while standard models assume that agents value future rewards at a constant discount rate, this may vary over time according to real-world empirical evidences. Strategies produced by such idealised models can be surprisingly unreliable once put in practice. We identify these gaps and propose refined solution concepts that more accurately reflect realistic agent behaviours. More importantly, we investigate the computational challenges introduced by these refinements, discovering effective computational approaches to more robust and practical decision-making when interacting with autonomous agents.
-
Robust Stackelberg equilibrium. EC '23.
J. Gan, M. Han, J. Wu, H. Xu.
[link |
arXiv]
This paper provides a systematic study of the robust Stackelberg equilibrium (RSE), which naturally generalizes the widely adopted solution concept of the strong Stackelberg equilibrium (SSE). The RSE accounts for any possible up-to-ÎŽ suboptimal follower responses in Stackelberg games and is adopted to improve the robustness of the leader's strategy. While a few variants of robust Stackelberg equilibrium have been considered in previous literature, the RSE solution concept we consider is importantly different â in some sense, it relaxes previously studied robust Stackelberg strategies and is applicable to much broader sources of uncertainties.
We provide a thorough investigation of several fundamental properties of RSE, including its utility guarantees, algorithmics, and learnability. We first show that the RSE we defined always exists and thus is well-defined. Then we characterize how the leader's utility in RSE changes with the robustness level considered. On the algorithmic side, we show that, in sharp contrast to the tractability of computing an SSE, it is NP-hard to obtain a fully polynomial approximation scheme (FPTAS) for any constant robustness level. Nevertheless, we develop a quasi-polynomial approximation scheme (QPTAS) for RSE. Finally, we examine the learnability of the RSE in a natural learning scenario, where both players' utilities are not known in advance, and provide almost tight sample complexity results on learning the RSE. As a corollary of this result, we also obtain an algorithm for learning SSE, which strictly improves a key result of Bai et al. in terms of both utility guarantee and computational efficiency. -
Markov decision processes with time-varying geometric discounting. AAAI '23.
J. Gan, A. Hennes, R. Majumdar, D. Mandal, G. Radanovic. [link | pdf]Canonical models of Markov decision processes (MDPs) usually consider geometric discounting based on a constant discount factor. While this standard modeling approach has led to many elegant results, some recent studies indicate the necessity of modeling time-varying discounting in certain applications. This paper studies a model of infinite-horizon MDPs with time-varying discount factors. We take a game-theoretic perspectiveâwhereby each time step is treated as an independent decision maker with their own (fixed) discount factorâand we study the subgame perfect equilibrium (SPE) of the resulting game as well as the related algorithmic problems. We present a constructive proof of the existence of an SPE and demonstrate the EXPTIME-hardness of computing an SPE. We also turn to the approximate notion of Ï”-SPE and show that an Ï”-SPE exists under milder assumptions. An algorithm is presented to compute an Ï”-SPE, of which an upper bound of the time complexity, as a function of the convergence property of the time-varying discount factor, is provided.
We explore how to embed fairness, safety, and sustainability into multi-agent systems to improve social good. Our work spans developing fair algorithms for resource allocation and decision-making, analysing societal dynamics like demographic segregation, and leveraging game theory to optimise security, transportation, and energy systems. Together, our efforts highlight how computational game theory can address societal challenges while driving innovation in algorithmic research and multi-agent systems.
-
Socially fair reinforcement learning.
D. Mandal, J. Gan.
[arXiv]
We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives -- minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with n, the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O(H)$ for a weaker notion of regret.
-
Envy-free policy teaching to multiple agents. NeurIPS '22.
J. Gan, R. Majumdar, G. Radanovic, A. Singla. [pdf | link]We study envy-free policy teaching. A number of agents independently explore a common Markov decision process (MDP), but each with their own reward function and discounting rate. A teacher wants to teach a target policy to this diverse group of agents, by means of modifying the agents' reward functions: providing additional bonuses to certain actions, or penalizing them. When personalized reward modification programs are used, an important question is how to design the programs so that the agents think they are treated fairly. We adopt the notion of envy-freeness (EF) from the literature on fair division to formalize this problem and investigate several fundamental questions about the existence of EF solutions in our setting, the computation of cost-minimizing solutions, as well as the price of fairness (PoF), which measures the increase of cost due to the consideration of fairness. We show that 1) an EF solution may not exist if penalties are not allowed in the modifications, but otherwise always exists. 2) Computing a cost-minimizing EF solution can be formulated as convex optimization and hence solved efficiently. 3) The PoF increases but at most quadratically with the geometric sum of the discount factor, and at most linearly with the size of the MDP and the number of agents involved; we present tight asymptotic bounds on the PoF. These results indicate that fairness can be incorporated in multi-agent teaching without significant computational or PoF burdens.
-
Defense coordination in security games: equilibrium analysis and mechanism design. Artificial Intelligence.
J. Gan, E. Elkind, S. Kraus, M. Wooldridge. [link]Real-world security scenarios sometimes involve multiple defenders: security agencies of two or more countries might patrol the same border areas, and domestic security agencies might also operate in the same locations when their areas of jurisdiction overlap. Motivated by these scenarios and the observation that uncoordinated movements of the defenders may lead to an inefficient defense, we introduce a model of multi-defender security games and explore the possibility of improving efficiency by coordinating the defenders â specifically, by pooling the defenders' resources and allocating them jointly. The model generalizes the standard model of Stackelberg security games, where a defender (now a group of defenders) allocates security resources to protect a set of targets, and an attacker picks the best target to attack. In particular, we are interested in the situation with heterogeneous defenders, who may value the same target differently. Our task is twofold. First, we need to develop a good understanding of the uncoordinated situation, as the baseline to be improved. To this end we formulate a new equilibrium concept, and prove that an equilibrium under this concept always exists and can be computed efficiently. Second, to coordinate the heterogeneous defenders we take a mechanism design perspective and aim to find a mechanism to generate joint resource allocation strategies. We seek a mechanism that improves the defenders' utilities upon the uncoordinated baseline, achieves Pareto efficiency, and incentivizes the defenders to report their true incentives and execute the recommended strategies. Our analysis establishes several impossibility results, indicating the intrinsic difficulties of defense coordination. Specifically, we show that even the basic properties listed above are in conflict with each other: no mechanism can simultaneously satisfy them all, or even some proper subsets of them. In terms of positive results, we present mechanisms that satisfy all combinations of the properties that are not ruled out by our impossibility results, thereby providing a comprehensive profile of the mechanism design problem with respect to the properties considered.
(Preliminary versions appeared in AAMAS '18 and '20.) -
Schelling games on graphs. Artificial Intelligence.
A. Agarwal, E. Elkind, J. Gan, A. Igarashi, W. Suksompong, A. Voudouris. [link]We study strategic games inspired by Schelling's seminal model of residential segregation. These games are played on undirected graphs, with the set of agents partitioned into multiple types; each agent either aims to maximize the fraction of her neighbors who are of her own type, or occupies a node of the graph and never moves away. We consider two natural variants of this model: in jump games agents can jump to empty nodes of the graph to increase their utility, while in swap games they can swap positions with other agents. We investigate the existence, computational complexity, and quality of equilibrium assignments in these games, both from a social welfare perspective and from a diversity perspective. Some of our results extend to a more general setting where the preferences of the agents over their neighbors are defined by a social network rather than a partition into types.
(Preliminary versions appeared in IJCAI '19 and AAAI '20.)
(See more in full list.)
[αÎČ] denotes alphabetical ordering of authors. (Why?)
-
J. Gan, R. Majumdar.
Value-set iteration: computing optimal correlated equilibria in infinite-Horizon multi-player stochastic games. [arXiv]We study the problem of computing optimal correlated equilibria (CEs) in infinite-horizon multi-player stochastic games, where correlation signals are provided over time. In this setting, optimal CEs require history-dependent policies; this poses new representational and algorithmic challenges as the number of possible histories grows exponentially with the number of time steps. We focus on computing $(\epsilon, \delta)$-optimal CEs---solutions that achieve a value within $\epsilon$ of an optimal CE, while allowing the agents' incentive constraints to be violated by at most $\delta$. Our main result is an algorithm that computes an $(\epsilon,\delta)$-optimal CE in time polynomial in $1/(\epsilon\delta(1 - \gamma))^{n+1}$, where $\gamma$ is the discount factor, and $n$ is the number of agents. For (a slightly more general variant of) turn-based games, we further reduce the complexity to a polynomial in $n$. We also establish that the bi-criterion approximation is necessary by proving matching inapproximability bounds.
Our technical core is a novel approach based on inducible value sets, which leverages a compact representation of history-dependent CEs through the values they induce to overcome the representational challenge. We develop the value-set iteration algorithm---which operates by iteratively updating estimates of inducible value sets---and characterize CEs as the greatest fixed point of the update map. Our algorithm provides a groundwork for computing optimal CEs in general multi-player stochastic settings. -
T. K. Buening, J. Gan, D. Mandal, M. Kwiatkowska.
Strategyproof reinforcement learning from human feedback. [arXiv]We study Reinforcement Learning from Human Feedback (RLHF), where multiple individuals with diverse preferences provide feedback strategically to sway the final policy in their favor. We show that existing RLHF methods are not strategyproof, which can result in learning a substantially misaligned policy even when only one out of $k$ individuals reports their preferences strategically. In turn, we also find that any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, highlighting an inherent trade-off between incentive alignment and policy alignment. We then propose a pessimistic median algorithm that, under appropriate coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of individuals and samples increases.
-
J. Gan, R. Majumdar, D. Mandal, G. Radanovic.
Stochastic principal-agent problems: efficient computation and learning. [arXiv]We introduce a stochastic principal-agent model. A principal and an agent interact in a stochastic environment, each privy to observations about the state not available to the other. The principal has the power of commitment, both to elicit information from the agent and to provide signals about her own information. The players communicate with each other and then select actions independently. Each of them receives a payoff based on the state and their joint action, and the environment transitions to a new state. The interaction continues over a finite time horizon. Both players are far-sighted, aiming to maximize their total payoffs over the time horizon. The model encompasses as special cases extensive-form games (EFGs) and stochastic games of incomplete information, partially observable Markov decision processes (POMDPs), as well as other forms of sequential principal-agent interactions, including Bayesian persuasion and automated mechanism design problems.
We consider both the computation and learning of the principal's optimal policy. Since the general problem, which subsumes POMDPs, is intractable, we explore algorithmic solutions under hindsight observability, where the state and the interaction history are revealed at the end of each step. Though the problem becomes more amenable under this condition, the number of possible histories remains exponential in the length of the time horizon, making approaches for EFG-based models infeasible. We present an efficient algorithm based on the inducible value sets. The algorithm computes an $\epsilon$-approximate optimal policy in time polynomial in $1/\epsilon$. Additionally, we show an efficient learning algorithm for an episodic reinforcement learning setting where the transition probabilities are unknown. The algorithm guarantees sublinear regret $\tilde{O}(T^{2/3})$ for both players over $T$ episodes. -
J. Gan, A. Ghosh, N. Teh.
Persuading a credible agent. [arXiv]How to optimally persuade an agent who has a private type? When elicitation is feasible, this amounts to a fairly standard principal-agent-style mechanism design problem, where the persuader employs a mechanism to first elicit the agent's type and then plays the corresponding persuasion strategy based on the agent's report. The optimal mechanism design problem in this setting is relatively well-understood in the literature, with incentive compatible (IC) mechanisms known to be optimal and computationally tractable. In this paper, we study this problem given a credible agent, i.e., if the agent claims they are of a certain type in response to the mechanism's elicitation, then they will act optimally with respect to the claimed type, even if they are actually not of that type.
We present several interesting findings in this new setting that differ significantly from results in the non-credible setting. In terms of the structure of optimal mechanisms, we show that not only may IC mechanisms fail to be optimal, but all mechanisms following the standard 'eliciting-then-persuading' mechanism design structure may be suboptimal. To achieve optimality requires two additional instruments â pre-signaling and non-binding elicitation â which naturally result in multi-stage mechanisms. We characterize optimal mechanisms under these design choices. Based on our characterization, we provide a polynomial-time algorithm for computing optimal multi-stage mechanisms. We also discover that in scenarios that allow for it, partial information elicitation can be employed to improve the principal's payoff even further. Though, surprisingly, an unbounded number of rounds of information exchange between the principal and the agent may be necessary to achieve optimality. -
Y. Chen, X. Deng, J. Gan, Y. Li.
Learning a Stackelberg leader's incentives from optimal commitments. EC '25. [arxiv]Stackelberg equilibria, as functions of the players' payoffs, can inversely reveal information about the players' incentives. In this paper, we study to what extent one can learn about the leader's incentives by actively querying the leader's optimal commitments against strategically designed followers. We show that, by using polynomially many queries and operations, one can learn a payoff function that is strategically equivalent to the leader's, in the sense that: 1) it preserves the leader's preference over almost all strategy profiles; and 2) it preserves the set of all possible (strong) Stackelberg equilibria the leader may engage in, considering all possible follower types. As an application, we show that the information acquired by our algorithm is sufficient for a follower to induce the best possible Stackelberg equilibrium by imitating a different follower type. To the best of our knowledge, we are the first to demonstrate that this is possible without knowing the leader's payoffs beforehand.
-
F. Bacchiocchi, J. Gan, M. Castiglioni, A. Marchesi, N. Gatti
Contract design under approximate best responses. ICML '25. [arxiv]Principal-agent problems model scenarios where a principal incentivizes an agent to take costly, unobservable actions through the provision of payments. Such problems are ubiquitous in several real-world applications, ranging from blockchain to the delegation of machine learning tasks. In this paper, we initiate the study of hidden-action principal-agent problems under approximate best responses, in which the agent may select any action that is not too much suboptimal given the principal's payment scheme (a.k.a. contract). Our main result is a polynomial-time algorithm to compute an optimal contract under approximate best responses. This positive result is perhaps surprising, since, in Stackelberg games, computing an optimal commitment under approximate best responses is computationally intractable. We also investigate the learnability of contracts under approximate best responses, by providing a no-regret learning algorithm for a natural application scenario where the principal has no prior knowledge about the environment.
-
J. Shaki, J. Gan, S. Kraus.
Bayesian persuasion with externalities: exploiting agent types. AAAI '25. [pdf]We study a Bayesian persuasion problem with externalities. In this model, a principal sends signals to inform multiple agents about the state of the world. Simultaneously, due to the existence of externalities in the agents' utilities, the principal also acts as a correlation device to correlate the agents' actions. We consider the setting where the agents are categorized into a small number of types. Agents of the same type share identical utility functions and are treated equitably in the utility functions of both other agents and the principal. We study the problem of computing optimal signaling strategies for the principal, under three different types of signaling channels: public, private, and semi-private. Our results include revelation-principle-style characterizations of optimal signaling strategies, linear programming formulations, and analysis of in/tractability of the optimization problems. It is demonstrated that when the maximum number of deviating agents is bounded by a constant, our LP-based formulations compute optimal signaling strategies in polynomial time. Otherwise, the problems are NP-hard.
-
X. Wu, B. Li, J. Gan.
Approximate envy-freeness in indivisible resource allocation with budget constraints. Information and Computation. [pdf]We study the fair allocation of indivisible resources under knapsack constraints, where a set of items with varied costs and values are to be allocated among a group of agents. Each agent has a budget constraint on the total cost of items she can receive. The goal is to compute a budget-feasible allocation that is envy-free (EF), in which the agents do not envy each other for the items they receive, nor do they envy a charity, which is endowed with all the unallocated items. Since EF allocations barely exist (even without the budget constraints), we are interested in the relaxed notion of envy-freeness up to one item (EF1). Our results are twofold. Firstly, for the general setting where agents have heterogeneous valuations and budgets, we show that a budget-feasible allocation that maximizes the Nash social welfare (NSW) achieves a 1/4-approximation of EF1. This approximation ratio carries to the general case of arbitrary monotone subadditive valuations. The approximation ratio improves gracefully when the items have small cost compared with the agents' budgets; it converges to 1/2 when the budget-cost ratio approaches infinity, and to 1 if the agents further have identical valuations. Secondly, when agents have identical valuations, we design a polynomial-time algorithm that computes a 1/2-approximate EF1 allocation for an arbitrary number of agents. For the case of identical agents and the case of two agents, we propose polynomial-time algorithms for computing EF1 allocations.
-
J. Gan, M. Han, J. Wu, H. Xu.
Robust Stackelberg equilibrium. EC '23. [link | arXiv]This paper provides a systematic study of the robust Stackelberg equilibrium (RSE), which naturally generalizes the widely adopted solution concept of the strong Stackelberg equilibrium (SSE). The RSE accounts for any possible up-to-ÎŽ suboptimal follower responses in Stackelberg games and is adopted to improve the robustness of the leader's strategy. While a few variants of robust Stackelberg equilibrium have been considered in previous literature, the RSE solution concept we consider is importantly different â in some sense, it relaxes previously studied robust Stackelberg strategies and is applicable to much broader sources of uncertainties.
We provide a thorough investigation of several fundamental properties of RSE, including its utility guarantees, algorithmics, and learnability. We first show that the RSE we defined always exists and thus is well-defined. Then we characterize how the leader's utility in RSE changes with the robustness level considered. On the algorithmic side, we show that, in sharp contrast to the tractability of computing an SSE, it is NP-hard to obtain a fully polynomial approximation scheme (FPTAS) for any constant robustness level. Nevertheless, we develop a quasi-polynomial approximation scheme (QPTAS) for RSE. Finally, we examine the learnability of the RSE in a natural learning scenario, where both players' utilities are not known in advance, and provide almost tight sample complexity results on learning the RSE. As a corollary of this result, we also obtain an algorithm for learning SSE, which strictly improves a key result of Bai et al. in terms of both utility guarantee and computational efficiency. -
J. Gan, A. Hennes, R. Majumdar, D. Mandal, G. Radanovic.
Markov decision processes with time-varying geometric discounting. AAAI '23. [link | pdf]Canonical models of Markov decision processes (MDPs) usually consider geometric discounting based on a constant discount factor. While this standard modeling approach has led to many elegant results, some recent studies indicate the necessity of modeling time-varying discounting in certain applications. This paper studies a model of infinite-horizon MDPs with time-varying discount factors. We take a game-theoretic perspectiveâwhereby each time step is treated as an independent decision maker with their own (fixed) discount factorâand we study the subgame perfect equilibrium (SPE) of the resulting game as well as the related algorithmic problems. We present a constructive proof of the existence of an SPE and demonstrate the EXPTIME-hardness of computing an SPE. We also turn to the approximate notion of Ï”-SPE and show that an Ï”-SPE exists under milder assumptions. An algorithm is presented to compute an Ï”-SPE, of which an upper bound of the time complexity, as a function of the convergence property of the time-varying discount factor, is provided.
-
J. Gan, R. Majumdar, G. Radanovic, A. Singla.
Envy-free policy teaching to multiple agents. NeurIPS '22. [pdf | link]We study envy-free policy teaching. A number of agents independently explore a common Markov decision process (MDP), but each with their own reward function and discounting rate. A teacher wants to teach a target policy to this diverse group of agents, by means of modifying the agents' reward functions: providing additional bonuses to certain actions, or penalizing them. When personalized reward modification programs are used, an important question is how to design the programs so that the agents think they are treated fairly. We adopt the notion of envy-freeness (EF) from the literature on fair division to formalize this problem and investigate several fundamental questions about the existence of EF solutions in our setting, the computation of cost-minimizing solutions, as well as the price of fairness (PoF), which measures the increase of cost due to the consideration of fairness. We show that 1) an EF solution may not exist if penalties are not allowed in the modifications, but otherwise always exists. 2) Computing a cost-minimizing EF solution can be formulated as convex optimization and hence solved efficiently. 3) The PoF increases but at most quadratically with the geometric sum of the discount factor, and at most linearly with the size of the MDP and the number of agents involved; we present tight asymptotic bounds on the PoF. These results indicate that fairness can be incorporated in multi-agent teaching without significant computational or PoF burdens.
-
J. Gan, E. Elkind, S. Kraus, M. Wooldridge.
Defense coordination in security games: equilibrium analysis and mechanism design. Artificial Intelligence. [link]Real-world security scenarios sometimes involve multiple defenders: security agencies of two or more countries might patrol the same border areas, and domestic security agencies might also operate in the same locations when their areas of jurisdiction overlap. Motivated by these scenarios and the observation that uncoordinated movements of the defenders may lead to an inefficient defense, we introduce a model of multi-defender security games and explore the possibility of improving efficiency by coordinating the defenders â specifically, by pooling the defenders' resources and allocating them jointly. The model generalizes the standard model of Stackelberg security games, where a defender (now a group of defenders) allocates security resources to protect a set of targets, and an attacker picks the best target to attack. In particular, we are interested in the situation with heterogeneous defenders, who may value the same target differently. Our task is twofold. First, we need to develop a good understanding of the uncoordinated situation, as the baseline to be improved. To this end we formulate a new equilibrium concept, and prove that an equilibrium under this concept always exists and can be computed efficiently. Second, to coordinate the heterogeneous defenders we take a mechanism design perspective and aim to find a mechanism to generate joint resource allocation strategies. We seek a mechanism that improves the defenders' utilities upon the uncoordinated baseline, achieves Pareto efficiency, and incentivizes the defenders to report their true incentives and execute the recommended strategies. Our analysis establishes several impossibility results, indicating the intrinsic difficulties of defense coordination. Specifically, we show that even the basic properties listed above are in conflict with each other: no mechanism can simultaneously satisfy them all, or even some proper subsets of them. In terms of positive results, we present mechanisms that satisfy all combinations of the properties that are not ruled out by our impossibility results, thereby providing a comprehensive profile of the mechanism design problem with respect to the properties considered.
(Preliminary versions appeared in AAMAS '18 and '20.) -
J. Gan, R. Majumdar, G. Radanovic, A. Singla.
Bayesian persuasion in sequential decision-making. AAAI '22 [pdf | arXiv | link]
(đ„ Outstanding Paper Honorable Mention @ AAAI '22)We study a dynamic model of Bayesian persuasion in sequential decision-making settings. An informed principal observes an external parameter of the world and advises an uninformed agent about actions to take over time. The agent takes actions in each time step based on the current state, the principal's advice/signal, and beliefs about the external parameter. The action of the agent updates the state according to a stochastic process. The model arises naturally in many applications, e.g., an app (the principal) can advise the user (the agent) on possible choices between actions based on additional real-time information the app has. We study the problem of designing a signaling strategy from the principal's point of view. We show that the principal has an optimal strategy against a myopic agent, who only optimizes their rewards locally, and the optimal strategy can be computed in polynomial time. In contrast, it is NP-hard to approximate an optimal policy against a far-sighted agent. Further, we show that if the principal has the power to threaten the agent by not providing future signals, then we can efficiently design a threat-based strategy. This strategy guarantees the principal's payoff as if playing against an agent who is far-sighted but myopic to future signals.
-
A. Agarwal, E. Elkind, J. Gan, A. Igarashi, W. Suksompong, A. Voudouris.
Schelling games on graphs. Artificial Intelligence. [link]We study strategic games inspired by Schelling's seminal model of residential segregation. These games are played on undirected graphs, with the set of agents partitioned into multiple types; each agent either aims to maximize the fraction of her neighbors who are of her own type, or occupies a node of the graph and never moves away. We consider two natural variants of this model: in jump games agents can jump to empty nodes of the graph to increase their utility, while in swap games they can swap positions with other agents. We investigate the existence, computational complexity, and quality of equilibrium assignments in these games, both from a social welfare perspective and from a diversity perspective. Some of our results extend to a more general setting where the preferences of the agents over their neighbors are defined by a social network rather than a partition into types.
(Preliminary versions appeared in IJCAI '19 and AAAI '20.) -
G. Birmpas, J. Gan, A. Hollender, F. Marmolejo-CossĂo, N. Rajgopal, A. Voudouris.
Optimally deceiving a learning leader in Stackelberg games. Journal of AI Research (JAIR). [pdf]Recent results have shown that algorithms for learning the optimal commitment in a Stackelberg game are susceptible to manipulation by the follower. These learning algorithms operate by querying the best responses of the follower, who consequently can deceive the algorithm by using fake best responses, typically by responding according to fake payoffs that are different from the actual ones. For this strategic behavior to be successful, the main challenge faced by the follower is to pinpoint the fake payoffs that would make the learning algorithm output a commitment that benefits them the most. While this problem has been considered before, the related literature has only focused on a simple setting where the follower can only choose from a finite set of payoff matrices, thus leaving the general version of the problem unanswered. In this paper, we fill this gap by showing that it is always possible for the follower to efficiently compute (near-)optimal fake payoffs, for various scenarios of learning interaction between the leader and the follower. Our results also establish an interesting connection between the followerâs deception and the leaderâs maximin utility: through deception, the follower can induce almost any (fake) Stackelberg equilibrium if and only if the leader obtains at least their maximin utility in this equilibrium.
(A preliminary version appeared in NeurIPS '20.) -
E. Elkind, J. Gan, S. Obraztsova, Z. Rabinovich, A. Voudouris.
Protecting elections by recounting ballots. Artificial Intelligence. [link]Complexity of voting manipulation is a prominent topic in computational social choice. In this work, we consider a two-stage voting manipulation scenario. First, a malicious party (an attacker) attempts to manipulate the election outcome in favor of a preferred candidate by changing the vote counts in some of the voting districts. Afterwards, another party (a defender), which cares about the voters' wishes, demands a recount in a subset of the manipulated districts, restoring their vote counts to their original values. We investigate the resulting Stackelberg game for the case where votes are aggregated using two variants of the Plurality rule, and obtain an almost complete picture of the complexity landscape, both from the attacker's and from the defender's perspective.
(A preliminary version appeared in IJCAI '19.) -
J. Gan, Q. Guo, L. Tran-Thanh, B. An, M. Wooldridge.
Manipulating a learning defender and ways to counteract. NeurIPS '19. [pdf | arXiv | poster]In Stackelberg security games when information about the attacker's payoffs is uncertain, algorithms have been proposed to learn the optimal defender commitment by interacting with the attacker and observing their best responses. In this paper, we show that, however, these algorithms can be easily manipulated if the attacker responds untruthfully. As a key finding, attacker manipulation normally leads to the defender learning a maximin strategy, which effectively renders the learning attempt meaningless as to compute a maximin strategy requires no additional information about the other player at all. We then apply a game-theoretic framework at a higher level to counteract such manipulation, in which the defender commits to a policy that specifies her strategy commitment according to the learned information. We provide a polynomial-time algorithm to compute the optimal such policy, and in addition, a heuristic approach that applies even when the attacker's payoff space is infinite or completely unknown. Empirical evaluation shows that our approaches can improve the defender's utility significantly as compared to the situation when attacker manipulation is ignored.
-
J. Gan, H. Xu, Q. Guo, L. Tran-Thanh, Z. Rabinovich, M. Wooldridge.
Imitative follower deception in Stackelberg games. EC '19. [pdf | arXiv | poster]Information uncertainty is one of the major challenges facing applications of game theory. In the context of Stackelberg games, various approaches have been proposed to deal with the leader's incomplete knowledge about the follower's payoffs, typically by gathering information from the leader's interaction with the follower. Unfortunately, these approaches rely crucially on the assumption that the follower will not strategically exploit this information asymmetry, i.e., the follower behaves truthfully during the interaction according to their actual payoffs. As we show in this paper, the follower may have strong incentives to deceitfully imitate the behavior of a different follower type and, in doing this, benefit significantly from inducing the leader into choosing a highly suboptimal strategy. This raises a fundamental question: how to design a leader strategy in the presence of a deceitful follower? To answer this question, we put forward a basic model of Stackelberg games with (imitative) follower deception and show that the leader is indeed able to reduce the loss due to follower deception with carefully designed policies. We then provide a systematic study of the problem of computing the optimal leader policy and draw a relatively complete picture of the complexity landscape; essentially matching positive and negative complexity results are provided for natural variants of the model. Our intractability results are in sharp contrast to the situation with no deception, where the leader's optimal strategy can be computed in polynomial time, and thus illustrate the intrinsic difficulty of handling follower deception. Through simulations we also examine the benefit of considering follower deception in randomly generated games.
-
AAAI 19-25, ICLR 22-25,
SAGT 22, ICML 21-24, IJCAI 19-24, AAMAS 20-24, NeurIPS 20-25,
EC 22, ECAI 20&23, EAMO 21, MFCS 19, GameSec 19.
- Artificial Intelligence (AIJ), Journal of Artificial Intelligence Research (JAIR), Theoretical Computer Science (TCS), IEEE Trans. on AI (TAI), IEEE Trans. on Information Forensics and Security (TIFS), Autonomous Agents and Multi-agent Systems (JAAMAS).
- Artificial Intelligence (2022/23/24), lecturer, Oxford
- Advanced Topics in Machine Learning (2022), lecturer, Oxford
https://www.cs.ox.ac.uk/people/jiarui.gan/
Department of Computer Science
University of Oxford
Parks Rd, Oxford OX1 3QD, UK
© Jiarui Gan 2025