Markov Decision Processes

Markov Decision Processes#

Markov decision processes (MDPs) are associative problems in which the action taken in the current period affects the future states and rewards. MDPs are idealized models of reinforcement learning problems. In MDPs, complete knowledge of the environment is available.

Formulation of MDP#

A Markov decision process (MDP) is a tuple $(\mathcal{S}, \mathcal{A}, p, r, \gamma)$ where:

$\mathcal{S}$: Set of states
$\mathcal{A}$: Set of actions
$p(s', r | s, a)$: Dynamics function $p: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]$
$r(s, a)$: Reward function $r: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$
$\gamma$: Discount factor $\gamma \in [0, 1]$

Agent-Environment Interface#

MDPs are used to formulate sequential decision-making problems. The agent interacts with the environment to achieve a goal.

Agent: The learner and decision-maker
Environment: Everything outside the agent

../../../_images/agent_env.svg — Fig. 6 Agent-Environment Interface#

The agent interacts with the environment in discrete time $t = 0, 1, 2, \ldots$. At each time step $t$, the agent receives a representation of the environment’s state $S_t \in \mathcal{S}$, selects an action $A_t \in \mathcal{A}(S_t)$, and receives a reward $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$ and the next state $S_{t+1}$.

The interaction between the agent and the environment can be summarized as trajectory:

\[S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots\]

Dynamics function#

The transition function $p$ defines the probability of transitioning to state $s'$ and receiving reward $r$ given state $s$ and action $a$.

\[ p(s', r | s, a) = \Pr\{S_{t} = s', R_{t} = r | S_{t-1} = s, A_{t-1} = a\} \]

The state $s_t$ is Markov if and only if:

\[\begin{split} \Pr\{S_{t+1} = s_{t+1}, R_{t+1} = r_{t+1} | S_t = s_t, A_t = a_t\}\\ = \Pr\{S_{t+1} = s_{t+1}, R_{t+1} = r_{t+1} | S_t = s_t, A_t = a_t, S_{t-1} = s_{t-1}, A_{t-1} = a_{t-1}, \ldots\} \end{split}\]

Note that $p$ captures all the environment’s dynamics completely. The possibility of each possible $S_t$ and $R_t$ depends only on the preceding state $S_{t-1}$ and action $A_{t-1}$.

Reward function#

The reward function $r(s, a)$ defines the expected rewards given state $s$ and action $a$.

\[\begin{split} \begin{aligned} r(s, a) &= \mathbb{E}[R_{t} | S_{t} = s, A_{t} = a] \\ &= \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p(s', r | s, a) \end{aligned} \end{split}\]

Return#

The agent’s objective is to maximize the expected value of the cumulative sum of a received scalar signal (reward).

To formalize the objective, we define the return $G_t$ as a function of the rewards received after time $t$. The simplest form of return is the sum of rewards:

\[ G_t = R_{t+1} + R_{t+2} + R_{t+3} + \ldots + R_T \]

where $T$ is the final time step. Note that this definition of return is suitable for tasks that will eventually end. In such cases, each episode ends in a terminal state.

Some MDPs do not have terminal states. We call such MDPs continuing tasks. In continuing tasks, the return that we defined above could be infinite. To handle such cases, we introduce the concept of discounted return:

\[\begin{split} \begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \\ &= \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \end{aligned} \end{split}\]

where $\gamma$ is the discount factor that satisfies $0 \leq \gamma \leq 1$.

The discount return can be represented as the following recursive form:

\[\begin{split} \begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \\ &= R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \ldots) \\ &= R_{t+1} + \gamma G_{t+1} \end{aligned} \end{split}\]

Policy and Value function#

Policy#

The agent interacts with the environment by selecting actions. We introduce the concept of policy to describe the agent’s behavior. A policy is a mapping from states to probabilities of selecting each possible action. A policy is denoted by $\pi$.

\[ \pi(a | s) = \Pr\{A_t = a | S_t = s\} \]

Value function#

The value function is the expected return when starting in state $s$ and following policy $\pi$ afterward. The state-value function for policy $\pi$ is denoted by $v_{\pi}(s)$.

\[ v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s] \]

Similarly, we use an action-value function to represent the expected return when starting in state $s$, taking action $a$, and following policy $\pi$ afterward. The action-value function for policy $\pi$ is denoted by $q_{\pi}(s, a)$.

\[ q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] \]

An essential property of value functions is that they satisfy recursive relationships. The value of a state can be expressed in terms of the value of its possible successor states:

\[\begin{split} \begin{aligned} v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t | S_t = s] \\ &= \mathbb{E}_{\pi}[{R_{t+1} + \gamma G_{t+1} | S_t = s}] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s']] \\ &= \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \end{aligned} \end{split}\]

The last equation is known as the Bellman equation.

\[ v_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')] \]

It is written in a recursive form that indicates the relationship between $v_{\pi}(s)$ and all the possible successor states’ values $v_{\pi}(s')$.

Optimal Policies and Optimal Value Functions#

For all $s \in \mathcal{S}$, if $v_{\pi}(s) \geq v_{\pi'}(s)$, then $\pi$ is better than or equal to $\pi'$, denoted by $\pi \geq \pi'$. At least one policy is always better than or equal to all other policies. This policy is called the optimal policy and is denoted by $\pi_*$.

Using the concept of optimal policy, we can define the optimal state-value function and optimal action-value function as follows:

\[ v_*(s) = \max_{\pi} v_{\pi}(s), \quad \forall s \in \mathcal{S} \]

\[ q_*(s, a) = \max_{\pi} q_{\pi}(s, a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A} \]

Bellman Optimality Equation#

\[\begin{split} \begin{aligned} v_*(s) &= \max_{a} q_*(s, a) \\ &= \max_{a} \mathbb{E}_{\pi_*}[G_t | S_t = s, A_t = a] \\ &= \max_{a} \mathbb{E}_{\pi_*}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\ &= \max_{a} \mathbb{E}[R_{t+1} + \gamma v_*({S_{t+1}}) | S_t = s, A_t = a] \\ &= \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')] \end{aligned} \end{split}\]

The last equation is known as the Bellman optimality equation for the state-value function.

\[ v_*(s) = \max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')] \]

The Bellman optimality equation is a system of nonlinear equations. The solution to the system of equations is the optimal value function. For a finite MDP with $n$ states, the system of equations has $n$ equations and $n$ unknowns.

In addition, the Bellman optimality equation for the action-value function is:

\[ q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')] \]

Note that if we know the optimal value function $v_*(s)$, for all $s \in \mathcal{S}$, we can easily find the optimal policy $\pi_*(s)$ by selecting the action that maximizes the right-hand side of the Bellman optimality equation.

\[ \pi_*(s) = \arg\max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')] \]

Linear Programming#

The Bellman optimality equation can be solved using linear programming. This is a less frequently used method for solving MDP. The ideas are as follows

If $v(s) \geq r(s, a) + \gamma \sum_{s'} p(s' | s, a) v(s')$ for all $s \in S$ and $a \in A$, then $v(s)$ is an upper bound on $v_*(s)$.
$v_*(s)$ must be the smallest such solution satisfying the Bellman optimality equation.

The linear programming formulation of the Bellman optimality equation is as follows:

\[\begin{split} \begin{aligned} \textnormal{minimize } & \sum_{s \in S} \alpha_s v(s) \\ \textnormal{s.t. } & v(s) \geq r(s, a) + \gamma \sum_{s'} p(s' | s, a) v(s'), \forall s \in S, \forall a \in A \\ & \textnormal{The constants $\alpha_s$ are arbitrary positive numbers.} \end{aligned} \end{split}\]

Notes:

Linear programming methods can also be used to solve MDPs
Linear programming methods become impractical at a much smaller number of states than DP methods (by a factor of about 100).

Python Implementation#

Linear Programming for Cliff Walking Problem#

The agent is placed in a 4x12 grid world. The agent can move in four directions: up, down, left, and right. The agent receives a reward of -1 for each step taken, -100 for falling off the cliff, and 0 for reaching the goal. The figure below shows the cliff walking problem implemented in OpenAI Gym.

../../../_images/cliff_walking.jpg — Fig. 7 Cliff Walking Problem#

The agent starts at the bottom-left corner [3, 0], and the goal is at the bottom-right corner [3, 11]. The cliff is at the bottom row [3, 1] to [3, 10]. For simplicity, the state is represented as a single integer from 0 to 47. The state is computed as current_row * n_col + current_col.

The goal is to find the optimal policy for moving an agent from a starting position to a goal position as quickly as possible while avoiding falling off a cliff.

# r: reward matrix, n_state * n_action
# p: transition probability matrix, n_state * n_action * n_state
# gamma: discount factor

from gurobipy import GRB, Model, quicksum


def lp_solver(r, p, gamma):

    action_set = set(range(r.shape[1]))
    state_set = set(range(r.shape[0]))
    n_state = len(state_set)

    # create a model instance
    model = Model()

    # create variables
    for s in range(n_state):
        model.addVar(name=f"v_{s}", lb=-GRB.INFINITY)

    # update the model
    model.update()

    # create constraints
    for state in state_set:
        for action in action_set:
            model.addConstr(
                model.getVarByName(f"v_{state}")
                >= quicksum(
                    gamma
                    * p[state, action, next_state]
                    * model.getVarByName(f"v_{next_state}")
                    for next_state in state_set
                )
                + r[state, action]
            )

    # set objective
    model.setObjective(
        quicksum(
            model.getVarByName(f"v_{state}") for state in state_set), 
            GRB.MINIMIZE
            )

    # optimize
    model.optimize()

    return model

from gurobipy import GRB, Model, quicksum
import gymnasium as gym
import numpy as np
from lp_solver import lp_solver

# create an environment
env = gym.make("CliffWalking-v0")
n_state = env.unwrapped.nS
n_action = env.unwrapped.nA
state_set = set(range(n_state))
action_set = set(range(n_action))
# The player cannot be at the cliff, nor at the goal
terminal_state_set = [47]
unreachable_state_set = list(range(37, 47))
# the reachable state set is the set of all states except the cliff and the goal.
# only the states in the reachable state set are considered.
reachable_state_set = set(
    set(state_set) - set(terminal_state_set) - set(unreachable_state_set)
)

# set parameters
gamma = 1

# initialize reward and transition probability
r = np.zeros((n_state, n_action))
p = np.zeros((n_state, n_action, n_state))

for state in reachable_state_set:
    for action in action_set:
        for prob, next_state, reward, terminated in env.unwrapped.P[state][action]:
            r[state, action] += prob * reward
            p[state, action, next_state] += prob

# solve the mdp problem using linear programming
model = lp_solver(r, p, gamma)

# state value
value_function = {}
for state in reachable_state_set:
    value_function[state] = model.getVarByName(f"v_{state}").x

policy = {}
for state in terminal_state_set:
    value_function[47] = 0

for state in reachable_state_set:
    q_max_value = -np.inf
    for action in action_set:
        q_value_temp = sum(
            [
                prob * (reward + gamma * value_function[next_state])
                for prob, next_state, reward, terminated in env.unwrapped.P[state][
                    action
                ]
            ]
        )
        if q_value_temp > q_max_value:
            q_max_value = q_value_temp
            policy[state] = action

# print value function 4*12, 1 digital after decimal point

print("value function = ")
for i in range(4):
    for j in range(12):
        if i * 12 + j in value_function:
            print("{:.1f}".format(value_function[i * 12 + j]), end="\t")
        else:
            print("x", end="\t")
    print()

print("optimal policy = ")
for i in range(4):
    for j in range(12):
        if i * 12 + j in policy:
            print(policy[i * 12 + j], end="\t")
        else:
            print("x", end="\t")
    print()

model.write("model.lp")

Summary#

Markov decision processes (MDPs) are used to formulate sequential decision-making problems.
MDPs are a tuple $(\mathcal{S}, \mathcal{A}, p, r, \gamma)$.
The Bellman optimality equation is a system of nonlinear equations that can be solved to find the optimal value function.
Linear programming can be used to find the optimal value function.