Papers on Reinforcement Learning for Operations Research

Recording of recent research papers related to reinforcement learning for operations research problems.

RL for inventory management

F. Stranieri, F. Stella, and C. Kouki, “Performance of deep reinforcement learning algorithms in two-echelon inventory control systems,” Int. J. Prod. Res., vol. 62, no. 17, pp. 6211–6226, Sept. 2024. Code
- Problem: two-echelon inventory control systems, seasonal demand, multi-products
- Method:
  - MDP formulation; DRL algorithms; Balance allocation rule;
  - BO for heuristic policies;
X. Liu, C. Alexopoulos, and Y. Peng, “A simulation-driven machine learning framework for large-scale inventory management,” Ann. Oper. Res., pp. 1–27, Oct. 2025. Code
- Imitation learning with target heuristic policies
- Real data from JD.com (not public)
- Problem: Multi-product, single and multi-echelon
- Computational complexity; Optimiality proofs;
T. Temizöz, C. Imdahl, R. Dijkman, D. Lamghari-Idrissi, and W. van Jaarsveld, “Deep controlled learning for inventory control,” Eur. J. Oper. Res., vol. 324, no. 1, pp. 104–117, July 2025. Code written in C++
- Problem: lost sales, perishable inventory, and random lead times
- Methods:
  - New algorithm, Deep controlled learning, for Input-Driven MDPs
  - RL as classification problem
H. Dehaybe, D. Catanzaro, and P. Chevalier, “Deep Reinforcement Learning for inventory optimization with non-stationary uncertain demand,” Eur. J. Oper. Res., vol. 314, no. 2, pp. 433–445, Apr. 2024. Code
- Problem:
  - Single-Item Stochastic Lot-Sizing Problem (SISLSP) with non-stationary uncertain demand
- Methods:
  - State Embedding of Forecast Windows
I. Kaynov, M. van Knippenberg, V. Menkovski, A. van Breemen, and W. van Jaarsveld, “Deep Reinforcement Learning for One-Warehouse Multi-Retailer inventory management,” Int. J. Prod. Econ., vol. 267, no. 109088, p. 109088, Jan. 2024.
- Problem: One-Warehouse Multi-Retailer (OWMR) inventory management
- Methods:
  - Sequential allocation rule
- Experiments:
  - Shows the proportional allocation rule does not work well and the sequential allocation rule performs better

RL for operations research problems

A. Ramanujam et al., “SafeOR-Gym: A benchmark suite for safe reinforcement learning algorithms on practical operations research problems,” arXiv [cs.LG], 02-June-2025. Code
- Problems: 9 OR environments
- Methods: Safe RL algorithms
  - Constrained Markov Decision Process (CMDP)
  - Constraints handling methods
  - Constraint RL algorithms