Papers – Ziang Liu

Recording of research papers that I have read, with a brief summary and my thoughts. I will update this page regularly as I read more papers.

Reinforcement learning
Federated learning

Reinforcement learning

RL for inventory management

B. Rolf, I. Jackson, M. Müller, S. Lang, T. Reggelin, and D. Ivanov, “A review on reinforcement learning algorithms and applications in supply chain management,” Int. J. Prod. Res., vol. 61, no. 20, pp. 7151–7179, Oct. 2023.
- Review paper
F. Stranieri, F. Stella, and C. Kouki, “Performance of deep reinforcement learning algorithms in two-echelon inventory control systems,” Int. J. Prod. Res., vol. 62, no. 17, pp. 6211–6226, Sept. 2024. Code
- Problem: two-echelon inventory control systems, seasonal demand, multi-products
- Method:
  - MDP formulation; DRL algorithms; Balance allocation rule;
  - BO for heuristic policies;
T. Temizöz, C. Imdahl, R. Dijkman, D. Lamghari-Idrissi, and W. van Jaarsveld, “Deep controlled learning for inventory control,” Eur. J. Oper. Res., vol. 324, no. 1, pp. 104–117, July 2025. Code written in C++
- Problem: lost sales, perishable inventory, and random lead times
- Methods:
  - New algorithm, Deep controlled learning, for Input-Driven MDPs
  - RL as classification problem
H. Dehaybe, D. Catanzaro, and P. Chevalier, “Deep Reinforcement Learning for inventory optimization with non-stationary uncertain demand,” Eur. J. Oper. Res., vol. 314, no. 2, pp. 433–445, Apr. 2024. Code
- Problem:
  - Single-Item Stochastic Lot-Sizing Problem (SISLSP) with non-stationary uncertain demand
- Methods:
  - State Embedding of Forecast Windows
I. Kaynov, M. van Knippenberg, V. Menkovski, A. van Breemen, and W. van Jaarsveld, “Deep Reinforcement Learning for One-Warehouse Multi-Retailer inventory management,” Int. J. Prod. Econ., vol. 267, no. 109088, p. 109088, Jan. 2024.
- Problem: One-Warehouse Multi-Retailer (OWMR) inventory management
- Methods:
  - Sequential allocation rule
- Experiments:
  - Shows the proportional allocation rule does not work well and the sequential allocation rule performs better

RL for operations research problems

A. Ramanujam et al., “SafeOR-Gym: A benchmark suite for safe reinforcement learning algorithms on practical operations research problems,” arXiv [cs.LG], 02-June-2025. Code
- Problems: 9 OR environments
- Methods: Safe RL algorithms
  - Constrained Markov Decision Process (CMDP)
  - Constraints handling methods
  - Constraint RL algorithms

Federated learning

Federated reinforcement learning

H. Jin, Y. Peng, W. Yang, S. Wang, and Z. Zhang, “Federated Reinforcement Learning with environment heterogeneity,” arXiv [cs.LG], 2022. code
- Problem setting: \(n\) agents located in \(n\) different environments, with the same state space, action space, reward function, but different transition dynamics.
- Algorithm: Learn a uniformly good policy (QAvg and PAvg) and personalization (embedding-based method, applied to DQNAvg and DDPGAvg).

Federated learning for supply chain management

H. Wang, F. Xie, Q. Duan, and J. Li, “Federated learning for supply chain demand forecasting,” Math. Probl. Eng., vol. 2022, pp. 1–8, Nov. 2022. Code
- vertical federated LSTM model.

E2E

M. Qi et al., “A practical end-to-end inventory management model with deep learning,” Manage. Sci., vol. 69, no. 2, pp. 759–773, Feb. 2023. Code
X. Liu, C. Alexopoulos, and Y. Peng, “A simulation-driven machine learning framework for large-scale inventory management,” Ann. Oper. Res., pp. 1–27, Oct. 2025. Code
- Imitation learning with target heuristic policies
- Real data from JD.com (not public)
- Problem: Multi-product, single and multi-echelon
- Computational complexity; Optimiality proofs;

Qi et al. (2023) propose an end-to-end model to directly map input features to replenishment decisions. They consider a multiperiod inventory management problem with infinite horizon where the demand and vendor lead time are stochastic. The closed-form optimal order quantity is derived given the demand and lead time at each period. Then, they use a neural network to learn the mapping from input features (e.g., features related to demand and VLT; general item-level features; review period; and initial stock level) to three outputs: the order quantity, the demand forecast, and the VLT forecast. A multiquantile RNN (MQRNN) is used in their model. The experiments are conducted on real data from JD.com.

Liu et al. (2025) extend the end-to-end model by integrating deep reinforcement learning and deep learning.

Table of contents