ControlDemon¶

class ControlDemon(aqf: Callable, avf: Callable = None, **kwargs)¶

Bases: pandemonium.demons.demon.Demon

Learns the optimal policy while learning to predict

Can be thought of as an accumulator of procedural knowledge.

In addition to the approximate value function (avf), has a an approximate Q-value function (aqf) that produces value estimates for state-action pairs.

Methods Summary

`behavior_policy`(self, x)	Specifies behavior of the agent
`implied_avf`(self, x)	State-value function in terms of action-value function
`predict_adv`(self, x)	Computes the advantage in a given state.
`predict_q`(self, x)	Computes action-values in a given state.
`predict_target_adv`(self, x)	Computes the target advantage in a given state.
`predict_target_q`(self, x)	Computes target action-values in a given state.

Methods Documentation

behavior_policy(self, x: torch.Tensor)¶

Specifies behavior of the agent

\[\mu: \mathcal{S} \times \mathcal{A} \mapsto [0, 1]\]

The distribution across all possible motor commands of the agent could be specified in this way.

implied_avf(self, x)¶

State-value function in terms of action-value function

\[V^{\pi}(s) = \sum_{a \in \mathcal{A}} \pi (a|s) * Q^{\pi}(a,s)\]

Is overridden in duelling architecture by an independent estimator.

TODO: does not apply for continuous action spaces TODO: handle predictions made to compute targets via target_aqf

predict_adv(self, x)¶: Computes the advantage in a given state.

predict_q(self, x)¶: Computes action-values in a given state.

predict_target_adv(self, x)¶: Computes the target advantage in a given state.

predict_target_q(self, x)¶: Computes target action-values in a given state.