RewardPrediction

class RewardPrediction(replay_buffer: pandemonium.experience.buffers.SkewedER, feature, output_dim: int = 3, sequence_size: int = 3, **kwargs)

Bases: pandemonium.demons.demon.ParametricDemon

A demon that maximizes un-discounted \(n\)-step return.

Learns the sign (+, 0, -) of the reward at the end of a state sequence. Used as an auxiliary task in UNREAL architecture.

References

RL with unsupervised auxiliary tasks (Jaderberd et al., 2016)

Methods Summary

delta(self, trajectory)

Specifies the update rule for approximate value function (avf)

learn(self, transitions)

target(trajectory)

Ternary classification target for -, 0, + rewards

Methods Documentation

delta(self, trajectory: pandemonium.experience.experience.Trajectory) → Tuple[Union[torch.Tensor, NoneType], dict]

Specifies the update rule for approximate value function (avf)

Depending on whether the algorithm is online or offline, the demon will be learning from a single Transition vs a Trajectory of experiences.

learn(self, transitions: List[ForwardRef(‘Transition’)])
static target(trajectory: pandemonium.experience.experience.Trajectory)

Ternary classification target for -, 0, + rewards