TTD

class TTD(trace_decay: float, **kwargs)

Bases: pandemonium.demons.offline_td.OfflineTD

Truncated \(\TD{(\lambda)}\)

Notes

Generalizes \(n\)-step \(\TD\) by allowing arbitrary mixing of \(n\)-step returns via \(\lambda\) parameter.

Depending on the algorithm, vector v would contain different bootstrapped estimates of values:

  • \(\TD(\lambda)\) (forward view): state value estimates \(V_t(s)\)

  • \(\Q(\lambda)\): action value estimates \(\max\limits_{a}Q_t(s_t, a)\)

  • \(\SARSA(\lambda)\): action value estimates \(Q_t(s_t, a_t)\)

  • \(\text{CategoricalQ}\): atom values of the distribution

The resulting vector u contains target returns for each state along the trajectory, with \(V(S_i)\) for \(i \in \{0, 1, \dots, n-1\}\) getting updated towards \([n, n-1, \dots, 1]\)-step \(\lambda\) returns respectively.

References

Sutton and Barto (2018) ch. 12.3, 12.8, equation (12.18)

http://incompleteideas.net/book/the-book.html

van Seijen (2016) Appendix B, https://arxiv.org/pdf/1608.05151v1.pdf

https://github.com/deepmind/rlax/blob/master/rlax/_src/multistep.py#L33

Methods Summary

target(self, trajectory, v)

Computes discounted returns for each step in the trajectory.

Methods Documentation

target(self, trajectory: pandemonium.experience.experience.Trajectory, v: torch.Tensor)

Computes discounted returns for each step in the trajectory.