Nov 06

BenchMARL – Benchmarking Multi-Agent Reinforcement Learning



  • We introduce BenchMARL, a training library for benchmarking MARL algorithms, tasks, and models backed by TorchRL.
  • BenchMARL already contains a variety of SOTA algorithms and tasks.
  • BenchMARL is grounded by its core tenets: standardization and reproducibility

What is BenchMARL 🧐?

BenchMARL is a Multi-Agent Reinforcement Learning (MARL) training library created to enable reproducibility
and benchmarking across different MARL algorithms and environments.
Its mission is to present a standardized interface that allows easy integration of new algorithms and environments to
provide a fair comparison with existing solutions.
BenchMARL uses TorchRL as its backend, which grants it high performance
and state-of-the-art implementations.
It also uses hydra for flexible and modular configuration,
and its data reporting is compatible with marl-eval
for standardised and statistically strong evaluations.

BenchMARL core design tenets are:

  • Reproducibility through systematical grounding and standardization of configuration
  • Standardised and statistically-strong plotting and reporting
  • Experiments that are independent of the algorithm, environment, and model choices
  • Breadth over the MARL ecosystem
  • Easy implementation of new algorithms, environments, and models
  • Leveraging the know-how and infrastructure of TorchRL, without reinventing the wheel

Why would I BenchMARL 🤔?

Why would you BenchMARL, I see you ask.
Well, you can BenchMARL to compare different algorithms, environments, models,
to check how your new research compares to existing ones, or if you just want to approach
the domain and want to easily take a picture of the landscape.

Why does it exist?

We created it because, compared to other ML domains, RL has always been more fragmented
in terms of shared community standards, tools, and interfaces.
In MARL, this problem is even more evident, with new libraries being frequently
introduced that focus on specific algorithms, environments, or models. Furthermore,
these libraries often implement components from scratch, without leveraging the know-how of
the single-agent RL community. In fact, the great majority of components used in MARL is shared
with single-agent RL (e.g., losses like MAPPO, models, probability distributions, replay buffers, and much more).

This fragmentation of the domain has led to a reproducibility crisis, recently highlighted in
a NeurIPS paper [^1]. While authors in [^1] propose a set of tools for statistically-strong results’ reporting,
there is still the need for a standardized library to run such benchmarks.
This is where BenchMARL comes in. Its mission is to provide a benchmarking tool for MARL,
leveraging the components of TorchRL for a solid RL backend.

[^1]: Gorsane, Rihab, et al. "Towards a standardised performance evaluation protocol
for cooperative marl." Advances in Neural Information Processing Systems 35 (2022): 5510-5521.

How do I use it?

Command line

Simple, to run an experiment from the command line do:

python benchmarl/ algorithm=mappo task=vmas/balance

to run multiple experiments, a benchmark you can do:

python benchmarl/ -m algorithm=mappo,qmix,masac task=vmas/balance,vmas/sampling seed=0,1

Multirun has many launchers supported in the backend.
The default implementation for hydra multirun is sequential, but parallel and
slurm launchers are also available.


Run an experiment:

experiment = Experiment(

Run a benchmark:

benchmark = Benchmark(
    seeds={0, 1},


The goal of BenchMARL is to bring different MARL environments and algorithms
under the same interfaces to enable fair and reproducible comparison and benchmarking.
BenchMARL is a full-pipline unified training library with the goal of enabling users to run
any comparison they want across our algorithms and tasks in just one line of code.
To achieve this, BenchMARL interconnects components from TorchRL,
which provides an efficient and reliable backend.

The library has a default configuration for each of its components.
While parts of this configuration are supposed to be changed (for example experiment configurations),
other parts (such as tasks) should not be changed to allow for reproducibility.
To aid in this, each version of BenchMARL is paired to a default configuration.

Let’s now introduce each component in the library.

Experiment. An experiment is a training run in which an algorithm, a task, and a model are fixed.
Experiments are configured by passing these values alongside a seed and the experiment hyperparameters.
The experiment hyperparameters cover both
on-policy and off-policy algorithms, discrete and continuous actions, and probabilistic and deterministic policies
(as they are agnostic of the algorithm or task used).
An experiment can be launched from the command line or from a script.

Benchmark. In the library we call benchmark a collection of experiments that can vary in tasks, algorithm, or model.
A benchmark shares the same experiment configuration across all of its experiments.
Benchmarks allow to compare different MARL components in a standardized way.
A benchmark can be launched from the command line or from a script.

Algorithms. Algorithms are an ensemble of components (e.g., losss, replay buffer) which
determine the training strategy. Here is a table with the currently implemented algorithms in BenchMARL.

Name On/Off policy Actor-critic Full-observability in critic Action compatibility Probabilistic actor
MAPPO On Yes Yes Continuous + Discrete Yes
IPPO On Yes No Continuous + Discrete Yes
MADDPG Off Yes Yes Continuous No
IDDPG Off Yes No Continuous No
MASAC Off Yes Yes Continuous + Discrete Yes
ISAC Off Yes No Continuous + Discrete Yes
QMIX Off No NA Discrete No
VDN Off No NA Discrete No
IQL Off No NA Discrete No

Tasks. Tasks are scenarios from a specific environment which constitute the MARL
challenge to solve.
They differ based on many aspects, here is a table with the current environments in BenchMARL

Environment Tasks Cooperation Global state Reward function Action space Vectorized
VMAS 5 Cooperative + Competitive No Shared + Independent + Global Continuous + Discrete Yes
SMACv2 15 Cooperative Yes Global Discrete No
MPE 8 Cooperative + Competitive Yes Shared + Independent Continuous + Discrete No
SISL 2 Cooperative No Shared Continuous No

BenchMARL uses the TorchRL MARL API for grouping agents.
In competitive environments like MPE, for example, teams will be in different groups. Each group has its own loss,
models, buffers, and so on. Parameter sharing options refer to sharing within the group. See the example on creating
a custom algorithm
for more info.

Models. Models are neural networks used to process data. They can be used as actors (policies) or,
when requested, as critics. We provide a set of base models (layers) and a SequenceModel to concatenate
different layers. All the models can be used with or without parameter sharing within an
agent group. Here is a table of the models implemented in BenchMARL

Name Decentralized Centralized with local inputs Centralized with global input
MLP Yes Yes Yes

And the ones that are work in progress

Name Decentralized Centralized with local inputs Centralized with global input
GNN Yes Yes No
CNN Yes Yes Yes


BenchMARL has many features. In this section we will dive deep
in the features that correspond to our core design tenets, but there are many more cool
nuggets here and there, such as:

  • A test CI with integration and training test routines that are run for all simulators and algorithms
  • Integration in the official TorchRL ecosystem for dedicated support
  • Experiment checkpointing and restoring using torch
  • Experiment logging compatible with many loggers (wandb, csv, mflow, tensorboard).
    The wandb logger is fully compatible with experiment restoring and will automatically resume the run of the loaded experiment.

In the following we illustrate the features which are core to our tenets.

Fine-tuned public benchmarks

In the fine_tuned folder
we are collecting some tested hyperparameters for
specific environments to enable users to bootstrap their benchmarking.
You can just run the scripts in this folder to automatically use the proposed hyperparameters.

We will tune benchmarks for you and publish the config and benchmarking plots on
Wandb publicly

Currently available ones are:


In the following, we report a table of the results:


Sample efficiency curves (all tasks)

Performance profile

Aggregate scores


Reporting and plotting

Reporting and plotting is compatible with marl-eval.
If experiment.create_json=True (this is the default in the experiment config)
a file named {experiment_name}.json will be created in the experiment output folder with the format of marl-eval.
You can load and merge these files using the utils in eval_results to create beautiful plots of
your benchmarks. No more struggling with matplotlib and latex!



The project can be configured either the script itself or via hydra.
Each component in the project has a corresponding yaml configuration in the BenchMARL
conf tree.
Components’ configurations are loaded from these files into python dataclasses that act as schemas for
validation of parameter names and types. That way we keep the best of both words: separation of all
configuration from code and strong typing for validation! You can also directly load and validate
configuration yaml files without using hydra from a script by calling ComponentConfig.get_from_yaml().

Here are some examples on how you can override configurations:


python benchmarl/ task=vmas/balance algorithm=mappo experiment.evaluation=true experiment.train_device="cpu"


python benchmarl/ task=vmas/balance algorithm=masac algorithm.num_qvalue_nets=3 algorithm.target_entropy=auto algorithm.share_param_critic=true


Be careful, for benchmarking stability this is not suggested.

python benchmarl/ task=vmas/balance algorithm=mappo task.n_agents=4


python benchmarl/ task=vmas/balance algorithm=mappo model=sequence "model.intermediate_sizes=[256]" "model/layers@model.layers.l1=mlp" "model/layers@model.layers.l2=mlp" "+model/layers@model.layers.l3=mlp" "model.layers.l3.num_cells=[3]"

Check out the section on how to configure BenchMARL and
our examples.


One of the core tenets of BenchMARL is allowing users to leverage the existing algorithm
and tasks implementations to benchmark their newly proposed solution.

For this reason we expose standard interfaces with simple abstract methods
for algorithms,
tasks and
To introduce your solution in the library, you just need to implement the abstract methods
exposed by these base classes which use objects from the TorchRL library.

Here is an example on how you can create a custom algorithm.

Here is an example on how you can create a custom task.

Here is an example on how you can create a custom model.

Next steps

BenchMARL is just born and is constantly looking for collaborators to extend and improve its capabilities.
If you are interested in joining the project, please reach out!

The next steps will include extending the library as well as fine-tuning sets of benchmark hyperparameters
to make them available to the community.