Algorithms for Federated Reinforcement Learning

Federated Reinforcement Learning with Environment Heterogeneity

H. Jin, Y. Peng, W. Yang, S. Wang, and Z. Zhang, “Federated Reinforcement Learning with environment heterogeneity,” arXiv [cs.LG], 2022.

Problem setting:

\(n\) agents located in \(n\) different environments.
Each agent \(i\) has the same state space \(\mathcal{S}\), action space \(\mathcal{A}\), reward function \(r\), but different transition dynamics \(P_i\).

Algorithm:

Learn a uniformly good policy: QAvg and PAvg

Personalization: embedding-based method, applied to DQNAvg and DDPGAvg

QAvg

Each agent \(i\) in iteration \(t\) maintains a local Q-function \(Q^t_i\). Agents perform local updates using their own data. After local updates, agents communicate their Q-functions to get the average Q-function:

\[ \bar{Q}_t(s, a) \leftarrow \frac{1}{n} \sum_{i=1}^n Q_t^i(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A} \]

Then,

\[ Q_{t}^i(s, a) \leftarrow \bar{Q}_t(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A}, i = 1, \ldots, n. \]

PAvg

Each agent \(i\) repeats the local update for several iterations to get a local policy \(\pi_{t}^i(\cdot|s)\). Then, agents communicate their policies to get the average policy:

\[ \bar{\pi}_t(a|s) \leftarrow \frac{1}{n} \sum_{i=1}^n \pi_t^i(a|s), \quad \forall s \in \mathcal{S}, a \in \mathcal{A} \]

Then,

\[ \pi_{t}^i(a|s) \leftarrow \bar{\pi}_t(a|s), \quad \forall s \in \mathcal{S}, a \in \mathcal{A}, i = 1, \ldots, n. \]