DQN¶

class DQN(feature: callable, behavior_policy: pandemonium.policies.policy.Policy, replay_buffer: pandemonium.experience.buffers.ReplayBuffer, aqf: callable = None, avf: callable = None, target_update_freq: int = 0, warm_up_period: int = None, num_atoms: int = 1, v_min: float = None, v_max: float = None, duelling: bool = False, double: bool = False, **kwargs)¶

Bases: pandemonium.demons.control.OfflineTDControl, pandemonium.demons.control.QLearning, pandemonium.demons.offline_td.TTD, pandemonium.demons.demon.ParametricDemon, pandemonium.experience.buffers.ReplayBufferMixin, pandemonium.networks.target_network.TargetNetMixin, pandemonium.demons.control.CategoricalQ, pandemonium.demons.control.DuellingMixin

Deep Q-Network with all the bells and whistles mixed in.

References

“Rainbow: Combining Improvements in Deep RL” by Hessel et. al: https://arxiv.org/pdf/1710.02298.pdf

Methods Summary

`learn`(self, transitions)
`q_t`(self, trajectory)	Computes action-value targets $Q(s_{t+1}, \hat{a})$.
`q_tm1`(self, x, a)	Computes values associated with action batch a

Methods Documentation

learn(self, transitions: List[ForwardRef(‘Transition’)])¶

q_t(self, trajectory: pandemonium.experience.experience.Trajectory)¶

Computes action-value targets $Q(s_{t+1}, \hat{a})$.

Algorithms differ in the way $\hat{a}$ is chosen.

\[\begin{split}\begin{align*} \text{Q-learning} &: \hat{a} = \argmax_{a \in \mathcal{A}}Q(s_{t+1}, a) \\ \SARSA &: \hat{a} = \mu(s_{t+1}) \end{align*}\end{split}\]

q_tm1(self, x, a)¶

Computes values associated with action batch a

Is overridden by distributional agents that use azf to evaluate actions instead of aqf.

Returns
Return type: A batch of action values $Q(s_{t-1}, a_{t-1})$.