IT박스

가치 반복과 정책 반복의 차이점은 무엇입니까?

itboxs 2020. 11. 20. 08:45
반응형

가치 반복과 정책 반복의 차이점은 무엇입니까?


강화 학습에서 정책 반복가치 반복 의 차이점은 무엇 입니까?

내가 아는 한 가치 반복에서는 Bellman 방정식을 사용하여 최적의 정책을 해결하는 반면, 정책 반복에서는 무작위로 정책 π를 선택하고 해당 정책의 보상을 찾습니다.

내 의심은 PI에서 임의의 정책 π를 선택하는 경우 여러 임의의 정책을 선택하더라도 어떻게 최적의 정책이 보장됩니까?


나란히 살펴 보겠습니다. 비교를위한 주요 부분이 강조 표시됩니다. 수치는 Sutton과 Barto의 책 : 강화 학습 : 소개에서 발췌 한 것 입니다.

여기에 이미지 설명 입력 키 포인트:

  1. Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
  2. Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
  3. Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep of all states regardless of convergence).
  4. The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
  5. Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.

In my experience, policy iteration is faster than value iteration, as a policy converges more quickly than a value function. I remember this is also described in the book.

I guess the confusion mainly came from all these somewhat similar terms, which also confused me before.


In policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.

In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the optimality Bellman operator.

In some sense, both algorithms share the same working principle, and they can be seen as two cases of the generalized policy iteration. However, the optimality Bellman operator contains a max operator, which is non linear and, therefore, it has different features. In addition, it's possible to use hybrid methods between pure value iteration and pure policy iteration.


The basic difference is -

In Policy Iteration - You randomly select a policy and find value function corresponding to it , then find a new (improved) policy based on the previous value function, and so on this will lead to optimal policy .

In Value Iteration - You randomly select a value function , then find a new (improved) value function in an iterative process, until reaching the optimal value function , then derive optimal policy from that optimal value function .

Policy iteration works on principle of “Policy evaluation —-> Policy improvement”.

Value Iteration works on principle of “ Optimal value function —-> optimal policy”.


As far as I am concerned, in contrary to @zyxue 's idea, VI is generally much faster than PI.

The reason is very straightforward, as you already knew, Bellman Equation is used for solving value function for given policy. Since we can solve the value function for optimal policy directly, solving value function for current policy is obviously a waste of time.

As for your question about the convergency of PI, I think you might overlook the fact that if you improve the strategy for each information state, then you improve the strategy for the whole game. This is also easy to prove, if you were familiar with Counterfactual Regret Minimization -- the sum of the regret for each information state has formed the upperbound of the overall regret, and thus minimizing the regret for each state will minimize the overall regret, which leads to the optimal policy.

참고 URL : https://stackoverflow.com/questions/37370015/what-is-the-difference-between-value-iteration-and-policy-iteration

반응형