TDControl¶

class TDControl(**kwargs)¶

Bases: pandemonium.demons.demon.ParametricDemon, pandemonium.demons.demon.ControlDemon

Methods Summary

`behavior_policy`(self, x)	Specifies behavior of the agent
`q_t`(self, exp, ForwardRef])	Computes action-value targets $Q(s_{t+1}, \hat{a})$.
`q_tm1`(self, x, a)	Computes values associated with action batch a
`target`(self, exp, ForwardRef])

Methods Documentation

behavior_policy(self, x: torch.Tensor)¶

Specifies behavior of the agent

\[\mu: \mathcal{S} \times \mathcal{A} \mapsto [0, 1]\]

The distribution across all possible motor commands of the agent could be specified in this way.

q_t(self, exp: Union[ForwardRef(‘Transition’), ForwardRef(‘Trajectory’)])¶

Computes action-value targets $Q(s_{t+1}, \hat{a})$.

Algorithms differ in the way $\hat{a}$ is chosen.

\[\begin{split}\begin{align*} \text{Q-learning} &: \hat{a} = \argmax_{a \in \mathcal{A}}Q(s_{t+1}, a) \\ \SARSA &: \hat{a} = \mu(s_{t+1}) \end{align*}\end{split}\]

q_tm1(self, x, a)¶

Computes values associated with action batch a

Is overridden by distributional agents that use azf to evaluate actions instead of aqf.

Returns
Return type: A batch of action values $Q(s_{t-1}, a_{t-1})$.

target(self, exp: Union[ForwardRef(‘Transition’), ForwardRef(‘Trajectory’)])¶