RewardPrediction¶

class RewardPrediction(replay_buffer: pandemonium.experience.buffers.SkewedER, feature, output_dim: int = 3, sequence_size: int = 3, **kwargs)¶

Bases: pandemonium.demons.demon.ParametricDemon

A demon that maximizes un-discounted \(n\)-step return.

Learns the sign (+, 0, -) of the reward at the end of a state sequence. Used as an auxiliary task in UNREAL architecture.

References

RL with unsupervised auxiliary tasks (Jaderberd et al., 2016)

Methods Summary

`delta`(self, trajectory)	Specifies the update rule for approximate value function (avf)
`learn`(self, transitions)
`target`(trajectory)	Ternary classification target for -, 0, + rewards

Methods Documentation

delta(self, trajectory: pandemonium.experience.experience.Trajectory) → Tuple[Union[torch.Tensor, NoneType], dict]¶

Specifies the update rule for approximate value function (avf)

Depending on whether the algorithm is online or offline, the demon will be learning from a single Transition vs a Trajectory of experiences.

learn(self, transitions: List[ForwardRef(‘Transition’)])¶

static target(trajectory: pandemonium.experience.experience.Trajectory)¶: Ternary classification target for -, 0, + rewards