QLearning¶

class QLearning(**kwargs)¶

Bases: pandemonium.demons.control.TDControl

Classic Q-learning update rule.

Notes

Can be interpreted as an off-policy version of \(\SARSE\). Since the target policy \(\pi\) in canonical Q-learning is greedy wrt to GVF, we have the following equality:

\[\max_\limits{a \in \mathcal{A}}Q(S_{t+1}, a) = \sum_{a \in \mathcal{A}} \pi(a|S_{t+1})Q(S_{t+1}, a)\]

In this case the target Q-value estimator would be:

@torch.no_grad()
def q_t(self, exp: Experience):
    q = self.target_aqf(exp.x1)
    dist = self.gvf.π.dist(exp.x1, q_fn=self.aqf)
    return torch.einsum('ba,ba->b', q, dist.probs)

We do not actually use this update in here since taking a max is more efficient than computing weights and taking a dot product.

TODO: integrate

online:: duelling
offline:: duelling traces

Methods Summary

q_t(self, exp, ForwardRef])

Computes action-value targets \(Q(s_{t+1}, \hat{a})\).

Methods Documentation

q_t(self, exp: Union[ForwardRef(‘Transition’), ForwardRef(‘Trajectory’)])¶

Computes action-value targets \(Q(s_{t+1}, \hat{a})\).

Algorithms differ in the way \(\hat{a}\) is chosen.

\[\begin{split}\begin{align*} \text{Q-learning} &: \hat{a} = \argmax_{a \in \mathcal{A}}Q(s_{t+1}, a) \\ \SARSA &: \hat{a} = \mu(s_{t+1}) \end{align*}\end{split}\]