TD2Q reinforcement learning (Blackwell and Doya, 2023)

TD2Q is a Q learning type of reinforcement learning algorithm that corresponds to the basal ganglia with two Q matrices, one representing direct pathway neurons (G) and another representing indirect pathway neurons (N). Unlike previous two-Q architectures, a novel and critical aspect of TD2Q is to update the G and N matrices utilizing the temporal difference reward prediction error. A best action is selected for N and G using a softmax with a reward-dependent adaptive exploration parameter, and then differences are resolved using a second selection step applied to the two action probabilities. The model is tested on a range of multi-step tasks including extinction, renewal, discrimination; switching reward probability learning; and sequence learning. Simulations show that TD2Q produces behaviors similar to rodents in choice and sequence learning tasks, and that use of the temporal difference reward prediction error is required to learn multi-step tasks. Performance in the sequence learning task is dramatically improved with two matrices.

Model Type:

Region(s) or Organism(s): Basal ganglia; Striatum

Cell Type(s):

Currents:

Receptors:

Genes:

Transmitters: Dopamine

Model Concept(s): Learning; Reinforcement Learning; Synaptic Plasticity

Simulation Environment: Python

References:

Blackwell KT, Doya K. (2023). Enhancing reinforcement learning models by including direct and indirect pathways improves performance on striatal dependent tasks. PLoS computational biology. 19 [PubMed]

View on GitHub