Hi,
I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:
|
def actor_loss(self, seq, target): |
|
# Actions: 0 [a1] [a2] a3 |
|
# ^ | ^ | ^ | |
|
# / v / v / v |
|
# States: [z0]->[z1]-> z2 -> z3 |
|
# Targets: t0 [t1] [t2] |
|
# Baselines: [v0] [v1] v2 v3 |
|
# Entropies: [e1] [e2] |
|
# Weights: [ 1] [w1] w2 w3 |
|
# Loss: l1 l2 |
|
metrics = {} |
|
# Two states are lost at the end of the trajectory, one for the boostrap |
|
# value prediction and one because the corresponding action does not lead |
|
# anywhere anymore. One target is lost at the start of the trajectory |
|
# because the initial state comes from the replay buffer. |
|
policy = self.actor(tf.stop_gradient(seq['feat'][:-2])) |
|
if self.config.actor_grad == 'dynamics': |
|
objective = target[1:] |
|
elif self.config.actor_grad == 'reinforce': |
|
baseline = self._target_critic(seq['feat'][:-2]).mode() |
|
advantage = tf.stop_gradient(target[1:] - baseline) |
|
action = tf.stop_gradient(seq['action'][1:-1]) |
|
objective = policy.log_prob(action) * advantage |
advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())
Based on my understanding:
seq['feat'] contains time steps from 0 to horizon.
target contains time steps from 0 to horizon-1, since the value at the last step is used as a bootstrap for lambda_return.
- Therefore,
baseline in Line 271 includes time steps from 0 to horizon-2, and target[1:] includes time steps from 1 to horizon-1.
If I understand correctly, the code uses $V_{t+1}^{\lambda} - v_\xi\left(\hat{z}_t\right)$ to calculate the advantage,
not $V_t^{\lambda} - v_{\xi}(\hat{z}_t)$ as stated in the paper?
I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!
Hi,
I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:
dreamerv2/dreamerv2/agent.py
Lines 252 to 274 in 07d906e
Based on my understanding:
seq['feat']contains time steps from0tohorizon.targetcontains time steps from0tohorizon-1, since the value at the last step is used as a bootstrap forlambda_return.baselinein Line 271 includes time steps from0tohorizon-2, andtarget[1:]includes time steps from1tohorizon-1.If I understand correctly, the code uses$V_{t+1}^{\lambda} - v_\xi\left(\hat{z}_t\right)$ to calculate the advantage,
not$V_t^{\lambda} - v_{\xi}(\hat{z}_t)$ as stated in the paper?
I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!