Algorithms for Federated Reinforcement Learning

research
reinforcement learning
federated learning
Author

Ziang Liu

Published

February 15, 2026

Federated Reinforcement Learning with Environment Heterogeneity

H. Jin, Y. Peng, W. Yang, S. Wang, and Z. Zhang, “Federated Reinforcement Learning with environment heterogeneity,” arXiv [cs.LG], 2022. code

Problem setting:

  • \(n\) agents located in \(n\) different environments.
  • Each agent \(i\) has the same state space \(\mathcal{S}\), action space \(\mathcal{A}\), reward function \(r\), but different transition dynamics \(P_i\).

Algorithm:

Learn a uniformly good policy: QAvg and PAvg

Personalization: embedding-based method, applied to DQNAvg and DDPGAvg

QAvg

Each agent \(i\) in iteration \(t\) maintains a local Q-function \(Q^t_i\). Agents perform local updates using their own data. After local updates, agents communicate their Q-functions to get the average Q-function:

\[ \bar{Q}_t(s, a) \leftarrow \frac{1}{n} \sum_{i=1}^n Q_t^i(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A} \]

Then,

\[ Q_{t}^i(s, a) \leftarrow \bar{Q}_t(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A}, i = 1, \ldots, n. \]

PAvg

Each agent \(i\) repeats the local update for several iterations to get a local policy \(\pi_{t}^i(\cdot|s)\). Then, agents communicate their policies to get the average policy:

\[ \bar{\pi}_t(a|s) \leftarrow \frac{1}{n} \sum_{i=1}^n \pi_t^i(a|s), \quad \forall s \in \mathcal{S}, a \in \mathcal{A} \]

Then,

\[ \pi_{t}^i(a|s) \leftarrow \bar{\pi}_t(a|s), \quad \forall s \in \mathcal{S}, a \in \mathcal{A}, i = 1, \ldots, n. \]

Personalization

Each environment is associated with a unique ID, and the ID is embedded into a vector. The embedding vector is concatenated with the state representation and fed into the Q-network or policy network.

For never-seen-before environments, the vector is initialized as the average of the embedding vectors of the training environments.