TTD¶
-
class
TTD
(trace_decay: float, **kwargs)¶ Bases:
pandemonium.demons.offline_td.OfflineTD
Truncated \(\TD{(\lambda)}\)
Notes
Generalizes \(n\)-step \(\TD\) by allowing arbitrary mixing of \(n\)-step returns via \(\lambda\) parameter.
Depending on the algorithm, vector v would contain different bootstrapped estimates of values:
\(\TD(\lambda)\) (forward view): state value estimates \(V_t(s)\)
\(\Q(\lambda)\): action value estimates \(\max\limits_{a}Q_t(s_t, a)\)
\(\SARSA(\lambda)\): action value estimates \(Q_t(s_t, a_t)\)
\(\text{CategoricalQ}\): atom values of the distribution
The resulting vector u contains target returns for each state along the trajectory, with \(V(S_i)\) for \(i \in \{0, 1, \dots, n-1\}\) getting updated towards \([n, n-1, \dots, 1]\)-step \(\lambda\) returns respectively.
References
- Sutton and Barto (2018) ch. 12.3, 12.8, equation (12.18)
- van Seijen (2016) Appendix B, https://arxiv.org/pdf/1608.05151v1.pdf
https://github.com/deepmind/rlax/blob/master/rlax/_src/multistep.py#L33
Methods Summary
target
(self, trajectory, v)Computes discounted returns for each step in the trajectory.
Methods Documentation
-
target
(self, trajectory: pandemonium.experience.experience.Trajectory, v: torch.Tensor)¶ Computes discounted returns for each step in the trajectory.