Demon

class Demon(gvf: pandemonium.gvf.GVF, avf: Callable, feature, behavior_policy: pandemonium.policies.policy.Policy, eligibility: Optional[pandemonium.traces.EligibilityTrace])

Bases: object

General Value Function Approximator

Each demon is an independent reinforcement learning agent responsible for learning one piece of predictive knowledge about the main agent’s interaction with its environment.

Demon learns an approximate value function \(\tilde{V}\) (avf), to the general value function (gvf) that corresponds to the setting of the three “question” functions: \(\pi\), \(\gamma\), and z. The tools that the demon uses to learn the approximation are called “answer” functions and are comprised of \(\mu\), \(\phi\) and \(\lambda\).

gvf

General Value Function to be estimated by the demon

avf

Approximate Value Function learned by the demon to approximate gvf

φ

Feature generator learning useful state representations

μ

Behavior policy that collects experience

λ

Eligibility trace assigning credit to experiences

Methods Summary

behavior_policy(self, s)

Specifies behavior of the agent

delta(self, experience, ForwardRef])

Specifies the update rule for approximate value function (avf)

eligibility(self, s)

Specifies eligibility trace-decay rate

feature(self, \*args, \*\*kwargs)

A mapping from MDP states to features

learn(self, experience, ForwardRef])

predict(self, x)

Predict the value (or value distribution) of the state

Methods Documentation

behavior_policy(self, s)

Specifies behavior of the agent

\[\mu: \mathcal{S} \times \mathcal{A} \mapsto [0, 1]\]

The distribution across all possible motor commands of the agent could be specified in this way.

delta(self, experience: Union[ForwardRef(‘Transition’), ForwardRef(‘Trajectory’)]) → Tuple[Union[torch.Tensor, NoneType], dict]

Specifies the update rule for approximate value function (avf)

Depending on whether the algorithm is online or offline, the demon will be learning from a single Transition vs a Trajectory of experiences.

eligibility(self, s)

Specifies eligibility trace-decay rate

\[\lambda: \mathcal{S} \mapsto \mathbb{R}\]
feature(self, \*args, \*\*kwargs)

A mapping from MDP states to features

\[\phi: \mathcal{S} \mapsto \mathbb{R}^n\]

Feature tensor could be constructed from the robot’s external sensor readings (not just the ones corresponding to light).

We can use any representation learning module here.

learn(self, experience: Union[ForwardRef(‘Transition’), ForwardRef(‘Trajectory’)])
predict(self, x)

Predict the value (or value distribution) of the state