ControlDemon¶
-
class
ControlDemon
(aqf: Callable, avf: Callable = None, **kwargs)¶ Bases:
pandemonium.demons.demon.Demon
Learns the optimal policy while learning to predict
Can be thought of as an accumulator of procedural knowledge.
In addition to the approximate value function (avf), has a an approximate Q-value function (aqf) that produces value estimates for state-action pairs.
Methods Summary
behavior_policy
(self, x)Specifies behavior of the agent
implied_avf
(self, x)State-value function in terms of action-value function
predict_adv
(self, x)Computes the advantage in a given state.
predict_q
(self, x)Computes action-values in a given state.
predict_target_adv
(self, x)Computes the target advantage in a given state.
predict_target_q
(self, x)Computes target action-values in a given state.
Methods Documentation
-
behavior_policy
(self, x: torch.Tensor)¶ Specifies behavior of the agent
\[\mu: \mathcal{S} \times \mathcal{A} \mapsto [0, 1]\]The distribution across all possible motor commands of the agent could be specified in this way.
-
implied_avf
(self, x)¶ State-value function in terms of action-value function
\[V^{\pi}(s) = \sum_{a \in \mathcal{A}} \pi (a|s) * Q^{\pi}(a,s)\]Is overridden in duelling architecture by an independent estimator.
TODO: does not apply for continuous action spaces TODO: handle predictions made to compute targets via target_aqf
-
predict_adv
(self, x)¶ Computes the advantage in a given state.
-
predict_q
(self, x)¶ Computes action-values in a given state.
-
predict_target_adv
(self, x)¶ Computes the target advantage in a given state.
-
predict_target_q
(self, x)¶ Computes target action-values in a given state.
-