TDControl¶
-
class
TDControl
(**kwargs)¶ Bases:
pandemonium.demons.demon.ParametricDemon
,pandemonium.demons.demon.ControlDemon
Methods Summary
behavior_policy
(self, x)Specifies behavior of the agent
q_t
(self, exp, ForwardRef])Computes action-value targets \(Q(s_{t+1}, \hat{a})\).
q_tm1
(self, x, a)Computes values associated with action batch a
target
(self, exp, ForwardRef])Methods Documentation
-
behavior_policy
(self, x: torch.Tensor)¶ Specifies behavior of the agent
\[\mu: \mathcal{S} \times \mathcal{A} \mapsto [0, 1]\]The distribution across all possible motor commands of the agent could be specified in this way.
-
q_t
(self, exp: Union[ForwardRef(‘Transition’), ForwardRef(‘Trajectory’)])¶ Computes action-value targets \(Q(s_{t+1}, \hat{a})\).
Algorithms differ in the way \(\hat{a}\) is chosen.
\[\begin{split}\begin{align*} \text{Q-learning} &: \hat{a} = \argmax_{a \in \mathcal{A}}Q(s_{t+1}, a) \\ \SARSA &: \hat{a} = \mu(s_{t+1}) \end{align*}\end{split}\]
-
q_tm1
(self, x, a)¶ Computes values associated with action batch a
Is overridden by distributional agents that use azf to evaluate actions instead of aqf.
- Returns
- Return type
A batch of action values $Q(s_{t-1}, a_{t-1})$.
-
target
(self, exp: Union[ForwardRef(‘Transition’), ForwardRef(‘Trajectory’)])¶
-