DQN

class DQN(feature: callable, behavior_policy: pandemonium.policies.policy.Policy, replay_buffer: pandemonium.experience.buffers.ReplayBuffer, aqf: callable = None, avf: callable = None, target_update_freq: int = 0, warm_up_period: int = None, num_atoms: int = 1, v_min: float = None, v_max: float = None, duelling: bool = False, double: bool = False, **kwargs)

Bases: pandemonium.demons.control.OfflineTDControl, pandemonium.demons.control.QLearning, pandemonium.demons.offline_td.TTD, pandemonium.demons.demon.ParametricDemon, pandemonium.experience.buffers.ReplayBufferMixin, pandemonium.networks.target_network.TargetNetMixin, pandemonium.demons.control.CategoricalQ, pandemonium.demons.control.DuellingMixin

Deep Q-Network with all the bells and whistles mixed in.

References

“Rainbow: Combining Improvements in Deep RL” by Hessel et. al

https://arxiv.org/pdf/1710.02298.pdf

Methods Summary

learn(self, transitions)

q_t(self, trajectory)

Computes action-value targets \(Q(s_{t+1}, \hat{a})\).

q_tm1(self, x, a)

Computes values associated with action batch a

Methods Documentation

learn(self, transitions: List[ForwardRef(‘Transition’)])
q_t(self, trajectory: pandemonium.experience.experience.Trajectory)

Computes action-value targets \(Q(s_{t+1}, \hat{a})\).

Algorithms differ in the way \(\hat{a}\) is chosen.

\[\begin{split}\begin{align*} \text{Q-learning} &: \hat{a} = \argmax_{a \in \mathcal{A}}Q(s_{t+1}, a) \\ \SARSA &: \hat{a} = \mu(s_{t+1}) \end{align*}\end{split}\]
q_tm1(self, x, a)

Computes values associated with action batch a

Is overridden by distributional agents that use azf to evaluate actions instead of aqf.

Returns

Return type

A batch of action values $Q(s_{t-1}, a_{t-1})$.