18 Iteration
note¶
Value Iteration¶
Considering finite horizons, The time-limited value for a state s with a time-limit of k timesteps is denoted \(U_k(s)\),
Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit to compute time-limited values until convergence1 , operating as follows:
- \(\forall s \in S\), initialze \(U_{0}(s)=0\), since no actions can be taken to acquire rewards.
- repear the following update rule until convergence (B is called the Bellman operator):
$$ \forall s\in S,U_{k+1}(s)\leftarrow\max_{a}\sum_{s^{\prime}}T(s,a,s^{\prime})[R(s,a,s^{\prime})+\gamma U_{k}(s^{\prime})] ~~ (or U_{k+1} \leftarrow BU_{k}) $$
When convergence is reached, the Bellman equation will hold for every state: \(\forall s \in S, U_{k+1}(s)=U_{k}(s) = U^*(s)\).
Q-value Iteration¶
Q-value iteration is a dynamic programming algorithm that computes time-limited Q-values. It is described in the following equation:
Policy Iteration¶
policy extraction¶
policy evaluation¶
For a policy π, policy evaluation means computing \(U^π(s)\) for all states s, where \(U^π(s)\) is expected utility of starting in state s when following π :
policy improvement¶
Once we’ve evaluated the current policy, use policy improvement to generate a better policy.
Define the policy at iteration i of policy iteration as \(\pi_{i}\) , we have
finally improve policy with:
summary¶
link¶
-
convergence means that \(\forall s \in S, U_{k+1}(s) = U_{k}(s)\). ↩
