GVF

class GVF(target_policy: pandemonium.policies.policy.Policy, continuation: pandemonium.continuations.ContinuationFunction, cumulant: pandemonium.cumulants.Cumulant)

Bases: object

General Value Function

Consider a stream of data \(\{ (x_t, A_t) \}^{\infty}_{t=0}\), produced by agent-environment interaction. Here, \(x\) is a tensor of experience (see pandemonium.experience.Transition) and \(A\) is an action from a finite action space \(\mathcal{A}\).

The target \(G\) is a summary of the future value of the cumulant \(Z\), discounted according to the termination function \(\gamma\):

\[G_t = Z_{t+1} + \sum_{\tau=t+1}^{\infty} \gamma_{\tau} Z_{\tau+1}\]

GVF estimates the expected value of the target cumulant, given actions are generated according to the target policy:

\[\mathbb{E}_π [G_t|S_t = s]\]

To make things more concrete, keep in mind an example of predicting a robot’s light sensor as it drives around a room. We will stick to this example throughout definitions in this abstract class.

Note

The value produced is not necessarily scalar, i.e. in case of estimating an action-function(Q) we get a row vector with values corresponding to each possible action.

Methods Summary

continuation(self, s)

Outputs continuation signal based on the agent’s observation

cumulant(self, s)

Accumulates future values of the signal.

target_policy(self, s)

The policy, whose value we would like to learn

Methods Documentation

continuation(self, s)

Outputs continuation signal based on the agent’s observation

\[\begin{split}\gamma: \mathcal{S} \mapsto[0, 1] \\\end{split}\]

Notice that this is different from an MDP discounting factor \(\gamma\) in classic RL. Here we allow the termination to be state-dependent.

cumulant(self, s)

Accumulates future values of the signal.

\[z: \mathcal{S} \mapsto \mathbb{R}\]

For example, this could be current light sensor reading of a robot.

target_policy(self, s) → torch.distributions.distribution.Distribution

The policy, whose value we would like to learn

\[\pi: \mathcal{S} \times \mathcal{A} \mapsto [0, 1]\]