RewardPrediction¶
-
class
RewardPrediction
(replay_buffer: pandemonium.experience.buffers.SkewedER, feature, output_dim: int = 3, sequence_size: int = 3, **kwargs)¶ Bases:
pandemonium.demons.demon.ParametricDemon
A demon that maximizes un-discounted \(n\)-step return.
Learns the sign (+, 0, -) of the reward at the end of a state sequence. Used as an auxiliary task in UNREAL architecture.
References
RL with unsupervised auxiliary tasks (Jaderberd et al., 2016)
Methods Summary
delta
(self, trajectory)Specifies the update rule for approximate value function (avf)
learn
(self, transitions)target
(trajectory)Ternary classification target for -, 0, + rewards
Methods Documentation
-
delta
(self, trajectory: pandemonium.experience.experience.Trajectory) → Tuple[Union[torch.Tensor, NoneType], dict]¶ Specifies the update rule for approximate value function (avf)
Depending on whether the algorithm is online or offline, the demon will be learning from a single Transition vs a Trajectory of experiences.
-
learn
(self, transitions: List[ForwardRef(‘Transition’)])¶
-
static
target
(trajectory: pandemonium.experience.experience.Trajectory)¶ Ternary classification target for -, 0, + rewards
-