7 Exercises
Exercise 7.1 In \(\epsilon\)-greedy action selection, for the case of two actions and \(\epsilon = 0.5\), what is the probability that the greedy action is selected?
Solution 7.1. Let \(k\) be the number of actions. The probability that the greedy action is selected is \(1 - \epsilon + \frac{\epsilon}{k}\). In this case, \(k = 2\) and \(\epsilon = 0.5\), so the probability that the greedy action is selected is \(1 - 0.5 + \frac{0.5}{2} = 0.75\).
Exercise 7.2 (Bandit example) Consider a \(k\)-armed bandit problem with \(k = 4\) actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using \(\epsilon\)-greedy action selection, sample-average action-value estimates, and initial estimates of \(Q_1(a) = 0\), for all \(a\). Suppose the initial sequence of actions and rewards is \(A_1 = 1\), \(R_1 = -1\), \(A_2 = 2\), \(R_2 = 1\), \(A_3 = 2\), \(R_3 = -2\), \(A_4 = 2\), \(R_4 = 2\), \(A_5 = 3\), \(R_5 = 0\). On some of these time steps the \(\epsilon\) case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?
Solution 7.2.
| Time step | Action | Reward | Q(1) | Q(2) | Q(3) | Q(4) |
|---|---|---|---|---|---|---|
| 1 | 1 | -1 | -1 | 0 | 0 | 0 |
| 2 | 2 | 1 | -1 | 1 | 0 | 0 |
| 3 | 2 | -2 | -1 | -0.5 | 0 | 0 |
| 4 | 2 | 2 | -1 | 0.33 | 0 | 0 |
| 5 | 3 | 0 | -1 | 0.33 | 0 | 0 |
The \(\epsilon\) case definitely occurred at time step 4 and 5.
On all other time steps, it could have occurred.
Exercise 7.3 In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.
Solution 7.3. In the long run, the method with \(\epsilon = 0.01\) will perform best in terms of cumulative reward and probability of selecting the best action.
Quantitatively, as \(t \to \infty\), we have \(Q_t(a)\) converging to \(q_*(a)\) for all \(a\). The probability of selecting the best action will converge to \(1 - \epsilon + \frac{\epsilon}{10}\), which is \(0.991\) for \(\epsilon = 0.01\) and \(0.91\) for \(\epsilon = 0.1\).
Still working on the cumulative reward part…