Abstract
Multiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. Many applications have focused on two-player zero-sum games, employing standard reinforcement learning to compute oracle'' or
exploiter'' response policies via approximate best response. In this paper, we introduce Monte Carlo tree search with generative world state sampling to augment the best response steps. We show empirical convergence to Nash equilibria and the effects of various choices of meta-solvers across a suite of general-sum and n-player sequential games. We then present case studies on negotiation games including Colored Trails and a multi-issue bargaining game ``Deal or no Deal''. We propose two new forms of meta-solvers based on the Nash Bargaining Solution (NBS) and simple gradient ascent algorithms to solve them. The NBS meta-solvers produce agents that achieve higher social welfare than purely Nash-inspired ones, and reach closest to the Pareto-frontier in Colored Trails. Finally, we report on the generalization capabilities of agents trained via this regime by evaluating them against human participants. Overall, we find that search and generative modeling helps find stronger policies during training, enables online Bayesian co-player prediction, and trains fair agents that can achieve comparable social welfare negotiating with humans as humans trading among themselves.