====
TD2Q
====
Python3 code for new reinforcement learning model, called TD2Q
Q learning model with Q matrices representing dSPN and iSPN, state-splitting, and adaptive exploration-exploitation parameter
**A. Several reinforcement learning tasks have been implemented**
- Discrim2stateProbT_twoQtwoS.py:
* Variations on Discrimination and extinction.
* parameters, including reward and transition matrix, in DiscriminationTaskparam2.py.
* 5 different learning phases have been implemented:
* acquisition of 1 choice task (tone A, left)
* extinction of task (in different context)
* renewal - retesting extinction in the same context
* discrimination - adding in 2nd choice (toneB, right)
* reversal - switching the direction that is rewarded
* Can also test
* savings after extinction
* acquisition in new context
* AAB, ABA and ABB extinction and renewal, where A and B are contexts
- BanditTask.py
* also known as probabilistic serial reversal task
* parameters, including reward and transition matrix, in BanditTaskparam.py
* The 2-arm bandit task, from (start location, tone blip) the agent must go to the center poke port.
* At the poke port, the agent hears a single tone (go cue) which contains no information about which port is rewarded.
* To receive a reward, the agent has to select either left port or right port.
* Both left and right choices are rewarded with probabilities assigned independently, and which change periodically.
* This task can be run as 1 step task by setting step1=True
- SequenceTask.py
* parameters, including reward and transition matrix, in SequenceTaskparam.py
* the task is that reported in Geddes et al. Cell 2018
* the agent must press the left lever twice, and then the right lever twice to obtain a reward.
* There are no external cues to indicate when the left lever or right lever need to be pressed.
* the minimum number of steps per trial is 7
- Any of the above programs can be run with the default parameters. Or you can adjust the following values
* runs: number of agents, or how many times the task will be simulated. Equivalent to biological replicates
* trials: number of actions the agent will take for a single run. For a 3 step task (BanditTask and Discrim), the number of trials is approximately trials/3
* save_data: set to False if you do not want to save the output to disk
* plot_Qhx: higher number gives more graphs
**B. Other files**
- agent_twoQtwoSsplit.py
* agent which has one or two Q matrices.
* the agents states can include a context cue - one that does not influence the reward or transitions
* agent states (and Q matrix rows) are added as the agent encounters new states
- completeT_env.py
environment in which every state is explicated
- sequence_env.py
environment used for large numbers of states, in which one type of state (e.g. press history) is independent of another type of state (e.g. location).
I.e., an agents action alters either press history or location, but not both.
This simplifies specification of the transition matrix
- RL_class.py
base classes for the environment and the agent
- RL_utils.py
some additional functions used for graphs and accumulate output data
- Qlearn_multifile_param_anal.py
To analyze a large set of parameter sweep simulations that were run on the Mason cluster
- TD2Q_manuscript_graphs.py and TD2Q_Qhx_graphs.py
* Used to create publication quality figures (or panels to combine into figures using photoshop).
* Files to analyze are read in from banditFiles.py or discrimFiles.py or sequenceFiles.py
- persever.py
* count how many times agent only makes 1 response, L or R, in probabilistic serial reversal, on the 50:50 block
* Also analyze how many times the prior block had best response the same as perseverant response
- multisim_Discrim.py, multisim_Sequence.py, mutlisim_Bandit.py
* used to run parameter sweeps of beta and gamma of three tasks.
* summary results saved in .npy files
**C. Parameters**
- runs: number of agents, or how many times the task will be simulated. Equivalent to biological replicates
- trials: number of actions the agent will take for a single run. For a 3 step task (BanditTask and Discrim), the number of trials is approximately trials/3
- save_data: set to False if you do not want to save the output to disk
- plot_Qhx: higher number gives more graphs
- params['numQ']=1 #number of Q matrices. numQ=2 is improves the 2-arm bandit task and sequence task. No effect on discirmination/extinction
- params['alpha']=[0.3,0.06] # learning rate for Q1 and (optionally Q2) matrices. Task dependent
- params['beta']=1.5 # maximum value of inverse temperature, controls exploration-exploitation
- params['beta_min']=0.5 # minimum value of inverse temperature, controls exploration
- params['gamma']=0.9 #discount factor
- params['hist_len']=40 #update covariance matrix and ideal states of agents as average of this many events
- params['state_thresh']=[0.12,0.2] #threshold distance of input state to ideal state. Task and distance measure dependent
- params['sigma']=0.25 #std used in Mahalanobis distance.
- params['moving_avg_window']=3 #This in units of trials, the actual window is this times the number of events per trial. It is used to calculate reward probability
- params['decision_rule']
* None: choose action based on Q1 and Q2, then resolve difference
* 'delta': choose action based on difference between Q1 and Q2 matrix
- params['Q2other']=0.0 #fractional learning rate (multiplied by alpha) for Q2 values for NON-selected actions, i.e. heterosynaptic plasticity
- params['distance']='Euclidean' #determine best matching state based on Euclidean distance, alternative: "Gaussian": mahalanobis distance
- params['initQ']=-1
*-1 means do state splitting (initialize new row of Q matrix as values of best matching state).
* initQ=0, 1 or 10, means initialize Q to that value and don't split
* params['initQ']=-1 is same as params['split']=True from earlier version.
* params['initQ']=0 is same as params['split']=False from earlier version.
- params['D2_rule']= None ### Opal: use Opal update rule without critic, Ndelta: calculate delta for N matrix from N values
- params['use_Opal'] = False ## Use Opal algorithm: implement critic, use Opal update rule, and use delta decision rule