QLearning¶
-
class
QLearning
(**kwargs)¶ Bases:
pandemonium.demons.control.TDControl
Classic Q-learning update rule.
Notes
Can be interpreted as an off-policy version of \(\SARSE\). Since the target policy \(\pi\) in canonical Q-learning is greedy wrt to GVF, we have the following equality:
\[\max_\limits{a \in \mathcal{A}}Q(S_{t+1}, a) = \sum_{a \in \mathcal{A}} \pi(a|S_{t+1})Q(S_{t+1}, a)\]In this case the target Q-value estimator would be:
@torch.no_grad() def q_t(self, exp: Experience): q = self.target_aqf(exp.x1) dist = self.gvf.π.dist(exp.x1, q_fn=self.aqf) return torch.einsum('ba,ba->b', q, dist.probs)
We do not actually use this update in here since taking a max is more efficient than computing weights and taking a dot product.
- TODO: integrate
- online:
duelling
- offline:
duelling traces
Methods Summary
q_t
(self, exp, ForwardRef])Computes action-value targets \(Q(s_{t+1}, \hat{a})\).
Methods Documentation
-
q_t
(self, exp: Union[ForwardRef(‘Transition’), ForwardRef(‘Trajectory’)])¶ Computes action-value targets \(Q(s_{t+1}, \hat{a})\).
Algorithms differ in the way \(\hat{a}\) is chosen.
\[\begin{split}\begin{align*} \text{Q-learning} &: \hat{a} = \argmax_{a \in \mathcal{A}}Q(s_{t+1}, a) \\ \SARSA &: \hat{a} = \mu(s_{t+1}) \end{align*}\end{split}\]