DQN¶
-
class
DQN
(feature: callable, behavior_policy: pandemonium.policies.policy.Policy, replay_buffer: pandemonium.experience.buffers.ReplayBuffer, aqf: callable = None, avf: callable = None, target_update_freq: int = 0, warm_up_period: int = None, num_atoms: int = 1, v_min: float = None, v_max: float = None, duelling: bool = False, double: bool = False, **kwargs)¶ Bases:
pandemonium.demons.control.OfflineTDControl
,pandemonium.demons.control.QLearning
,pandemonium.demons.offline_td.TTD
,pandemonium.demons.demon.ParametricDemon
,pandemonium.experience.buffers.ReplayBufferMixin
,pandemonium.networks.target_network.TargetNetMixin
,pandemonium.demons.control.CategoricalQ
,pandemonium.demons.control.DuellingMixin
Deep Q-Network with all the bells and whistles mixed in.
References
- “Rainbow: Combining Improvements in Deep RL” by Hessel et. al
Methods Summary
learn
(self, transitions)q_t
(self, trajectory)Computes action-value targets \(Q(s_{t+1}, \hat{a})\).
q_tm1
(self, x, a)Computes values associated with action batch a
Methods Documentation
-
learn
(self, transitions: List[ForwardRef(‘Transition’)])¶
-
q_t
(self, trajectory: pandemonium.experience.experience.Trajectory)¶ Computes action-value targets \(Q(s_{t+1}, \hat{a})\).
Algorithms differ in the way \(\hat{a}\) is chosen.
\[\begin{split}\begin{align*} \text{Q-learning} &: \hat{a} = \argmax_{a \in \mathcal{A}}Q(s_{t+1}, a) \\ \SARSA &: \hat{a} = \mu(s_{t+1}) \end{align*}\end{split}\]
-
q_tm1
(self, x, a)¶ Computes values associated with action batch a
Is overridden by distributional agents that use azf to evaluate actions instead of aqf.
- Returns
- Return type
A batch of action values $Q(s_{t-1}, a_{t-1})$.