ControlDemon

class ControlDemon(aqf: Callable, avf: Callable = None, **kwargs)

Bases: pandemonium.demons.demon.Demon

Learns the optimal policy while learning to predict

Can be thought of as an accumulator of procedural knowledge.

In addition to the approximate value function (avf), has a an approximate Q-value function (aqf) that produces value estimates for state-action pairs.

Methods Summary

behavior_policy(self, x)

Specifies behavior of the agent

implied_avf(self, x)

State-value function in terms of action-value function

predict_adv(self, x)

Computes the advantage in a given state.

predict_q(self, x)

Computes action-values in a given state.

predict_target_adv(self, x)

Computes the target advantage in a given state.

predict_target_q(self, x)

Computes target action-values in a given state.

Methods Documentation

behavior_policy(self, x: torch.Tensor)

Specifies behavior of the agent

\[\mu: \mathcal{S} \times \mathcal{A} \mapsto [0, 1]\]

The distribution across all possible motor commands of the agent could be specified in this way.

implied_avf(self, x)

State-value function in terms of action-value function

\[V^{\pi}(s) = \sum_{a \in \mathcal{A}} \pi (a|s) * Q^{\pi}(a,s)\]

Is overridden in duelling architecture by an independent estimator.

TODO: does not apply for continuous action spaces TODO: handle predictions made to compute targets via target_aqf

predict_adv(self, x)

Computes the advantage in a given state.

predict_q(self, x)

Computes action-values in a given state.

predict_target_adv(self, x)

Computes the target advantage in a given state.

predict_target_q(self, x)

Computes target action-values in a given state.