|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1Kavli Institute for Theoretical Physics, University of California, Santa Barbara, Santa Barbara; 2Center for Theoretical Biological Physics, University of California, San Diego, La Jolla, California; and 3McGovern Institute for Brain Research, 4Howard Hughes Medical Institute, and 5Brain and Cognitive Sciences Department, Massachusetts Institute of Technology, Cambridge, Massachusetts
Submitted 14 December 2006; accepted in final form 13 July 2007
| ABSTRACT |
|---|
|
|
|---|
RA synapses is driven by hypothetical "rules" depending on three signals: activation of HVC
RA synapses, activation of LMAN
RA synapses, and reinforcement from an internal critic that compares the bird's own song with a memorized template of an adult tutor's song. Fluctuating glutamatergic input to RA from LMAN generates behavioral variability for trial-and-error learning. The plasticity rules perform gradient-based reinforcement learning in a spiking neural network model of song production. Although the reinforcement signal is delayed, temporally imprecise, and binarized, the model learns in a reasonable amount of time in numerical simulations. Varying the number of neurons in HVC and RA has little effect on learning time. The model makes specific predictions for the induction of bidirectional long-term plasticity at HVC
RA synapses. | INTRODUCTION |
|---|
|
|
|---|
A number of song-related avian brain areas have been discovered (Fig. 1A). Song production areas (Fig. 1A, open blue) include HVC (high vocal center) and RA (robust nucleus of the arcopallium), which generate sequences of neural activity patterns and through motoneurons control the muscles of the vocal apparatus during song (Hahnloser et al. 2002
; Suthers and Margoliash 2002
; Wild 1993
, 2004
; Yu and Margoliash 1986
). Lesion of HVC or RA causes immediate loss of song (Nottebohm et al. 1976
; Simpson and Vicario 1990
). Other areas in the anterior forebrain pathway (AFP) appear to be important for song learning but not production (Fig. 1A, filled green), at least in adults. The AFP is regarded as an avian homologue of the mammalian basal ganglia thalamocortical loop (Farries 2004
; Perkel 2004
; Reiner et al. 2004
). In particular, lesion of area LMAN (lateral magnocellular nucleus of the nidopallium) has little immediate effect on song production in adults, but arrests song learning in juveniles (Bottjer et al. 1984
; Doupe 1993
; Scharff and Nottebohm 1991
). These facts suggest that LMAN plays a role in driving song learning, but the locus of plasticity is in brain areas related to song production, such as HVC and RA.
|
Nearly a decade ago, Doya and Sejnowski (1998)
attempted to place such observations in a schema borrowed from mathematical theories of reinforcement learning. In this schema, learning is based on interactions between an actor and a critic (Fig. 1B). The critic evaluates the performance of the actor at a desired task. The actor uses this evaluation to change in a way that improves its performance. To learn by trial and error, the actor performs the task differently each time. It generates both good and bad variations, and the critic's evaluation is used to reinforce the good ones.1 Ordinarily it is assumed that the actor generates variations by itself. However, Doya and Sejnowski considered a schema in which the source of variation is external to the actor. We will call this source the experimenter.
Doya and Sejnowski proceeded to identify the three parts of their schema with specific areas of the avian brain. The actor was identified with HVC, RA, and the motor neurons that control vocalization. They hypothesized that the actor learns through plasticity at the synapses from HVC to RA (Fig. 1C). Based on evidence of structural changes like axonal growth and retraction that take place in the HVC to RA projection during song learning (Herrmann and Arnold 1991
; Kittelberger and Mooney 1999
; Mooney 1992
; Sakaguchi and Saito 1996
; Stark and Scheich 1997
), this view is widely regarded as plausible. Curiously, no reliable protocols for the induction of activity-dependent plasticity at these synapses in vitro have yet been found (R Mooney, private communication), possibly for good reasons, which we consider in the DISCUSSION. For the experimenter and critic, Doya and Sejnowski turned to the anterior forebrain pathway, hypothesizing that the critic is X and the experimenter is LMAN.
What is the current status of the Doya–Sejnowski tripartite schema? The actor part of their model was on firm ground, but their ideas about the critic and the experimenter were more speculative. Unfortunately, the location of the critic is still unknown, although it is widely believed to exist. Because the critic has not been found, the nature of its feedback is still unknown. One could imagine a powerful critic, which gives the actor specific instructions about how to improve song. This would place more of the computational burden of the learning problem on the critic. Or one could imagine a weak critic, which simply tells the actor whether performance is good or bad. This would place more of the burden of learning on the actor.
On the other hand, there is increasing support for their general idea of LMAN as an experimenter. First, we review evidence in support of LMAN as an experimenter. Then we argue that recent experiments show important departures from the assumptions of Doya and Sejnowski about the structure and dynamics of LMAN's input to RA, which call for a different formulation of learning with LMAN experimentation in the songbird system.
During song, LMAN neural spiking is quite variable from trial to trial and more irregular than activity in RA (Hessler and Doupe 1999b
; Leonardo 2004
). Moreover, mean activity in LMAN correlates with the overall song variability: In adult birds, LMAN activity is low during song directed at females, which tends to be extremely stable and stereotyped, and much higher during the more variable undirected song (Hessler and Doupe 1999b
; Kao et al. 2005
). Although, as noted earlier, LMAN lesions have little effect on adult song, especially during directed bouts, closer inspection reveals that LMAN lesions reduce the slight variability present in adult undirected song (Kao et al. 2005
). In juveniles, there is much greater trial-to-trial song variability compared with that of adults; this is dramatically reduced after LMAN lesions (Scharff and Nottebohm 1991
). Recently it was shown that reversible pharmacological inactivation of juvenile LMAN with tetrodotoxin (TTX) or muscimol leads to immediate reduction in song variability (Kao et al. 2005
; Olveczky et al. 2005
). All of this evidence suggests that LMAN generates song variability through its projection to RA.2
But how, mechanistically and functionally, does LMAN drive song variability and learning? Doya and Sejnowski proposed that the role of LMAN input to RA is to produce a fluctuation that is static over the duration of a song bout, directly in the synaptic strengths from premotor nucleus HVC to RA. From a functional perspective, the model of Doya and Sejnowski is akin to "weight perturbation" (Dembo and Kailath 1990
; Seung 2003
; Williams 1992
) and relatively easy to implement: a temporary but static HVC–RA weight change that lasts the duration of one song causes some change in song performance. If performance is good, the critic sends a reinforcement signal that makes the temporary static perturbation permanent. From a neurobiological perspective their model requires machinery whereby N-methyl-D-aspartate (NMDA)–mediated synaptic transmission from LMAN to RA can drive synaptic weight changes that remain static over the 1- to 2-s duration of song, in the heterosynaptic HVC–RA connections. However, LMAN activity in the songbird is dynamic and variable throughout song, evolving on a 10- to 100-ms timescale (Hessler and Doupe 1999a
,b
; Leonardo 2004
), at odds with the assumption that at the beginning of song LMAN triggers an instantaneous perturbation in the HVC–RA weights, which is then held constant throughout the song.
Next, in recent experiments, transient stimulation in LMAN leads to transient, subsyllable-long changes in either song pitch or amplitude (Kao et al. 2005
). Presumably, local stimulation excites local myotopic ensembles of LMAN neurons; if this LMAN activity led to static perturbations of a set of HVC synapses projecting to a myotopic RA group, it would have produced changes in pitch or amplitude that were not transient, but lasted to produce consistent biases in pitch or amplitude throughout one song iteration. In Olveczky et al. (2005)
, blocking NMDA receptor currents in RA causes the same reduction in song variability as does LMAN inactivation,3 indicating that the effects of LMAN activity in RA are through ordinary glutamatergic synaptic transmission into RA neurons. In short, LMAN appears to drive fast, transient song fluctuations on a subsyllable level, effected by ordinary excitatory transmission that drives dynamic postsynaptic membrane conductance fluctuations in the postsynaptic RA neurons. This picture of rapidly fluctuating glutamatergic input from LMAN driving fast conductance perturbations in RA is quite different, in its neurobiological mechanism and mathematical implications for reinforcement learning, from the Doya and Sejnowski model based on slow modulatory influences on HVC
RA weights.
Finally, for song learning, synapses from different HVC neurons to the same postsynaptic RA neuron must have the flexibility to change in opposite directions. Within the weight-perturbation model of Doya and Sejnowski, this requires that each synapse from HVC onto a single RA neuron receive independent perturbations in different directions, relative to other synapses from different HVC neurons onto the same RA neuron. In neurobiological terms, this could be possible if, for each synapse from a distinct HVC neuron onto a RA neuron, there were a separate LMAN input. However, this seems unlikely considering that each RA neuron receives only about 50 synapses from LMAN (Canady et al. 1988
; Hermann and Arnold 1991
) compared with about 1,000 synapses from
200 different HVC neurons (Kittelberger and Mooney 1999
).
Next, we describe a learning rule that, like the weight-perturbation scheme used by Doya and Sejnowski, also belongs in the broad category of actor–critic reinforcement learning rules. However, the rule is distinct functionally and in its neurobiological implications from weight-perturbation–like schemes. Applied to the song system, the rule is fully consistent with the physiological and anatomical findings on LMAN input to RA and with the phenomenology of song learning.
Learning with empiric synapses
The goal of this work is to relate the high-level concept of reinforcement learning by the tripartite schema to a biologically realistic lower level of description in terms of microscopic events at synapses and neurons in the birdsong system, to demonstrate song learning in a network of realistic spiking neurons, and to examine the plausibility of reinforcement algorithms in explaining biological fine motor skill learning with respect to learning time in the birdsong network.
The present model is based on many of the same general assumptions that were made by Doya and Sejnowski. We assume a tripartite actor–critic–experimenter schema. The critic is weak, providing only a scalar evaluation signal. The HVC sequence is fixed, and only the map from HVC to the motor neurons is learned, through plasticity at the HVC
RA synapses.4 LMAN perturbs song through its inputs to the song premotor pathway. However, the structure and dynamics of LMAN inputs, and their influence on learning, are different, with distinct neurobiological implications. In keeping with our hypothesis that the function of LMAN drive to RA is to perform "experiments" for trial-and-error learning, the connections from LMAN to RA will be called "empiric" synapses (Fig. 1C).
We make a specific theoretical proposal for synaptic reinforcement learning in the case of birdsong, illustrated in Fig. 2. Functionally, our scheme is similar to "node perturbation" (Fiete and Seung 2006
; Werfel et al. 2005
; Xie and Seung 2004
) because it relies on independent perturbations delivered to neurons (rather than to individual plastic synapses, as in weight perturbation). From a neurobiological perspective, this scheme is more realistic, for two reasons. First, it is in better agreement with the microanatomy of LMAN–RA synapses because it only requires one independent LMAN input per RA neuron, rather than per HVC–RA synapse. Second, the perturbation to each neuron in our model is temporally varying on a rapid timescale, not static, during song. This is consistent with activity in LMAN during song production and song learning.
|
In our proposal, the conductance of the plastic synapse from neuron j in HVC to neuron i in RA is given by WijsijHVC(t), where the synaptic activation
(t) determines the time course of conductance changes, and the plastic parameter Wij determines their amplitude. Changes in Wij are governed by the plasticity rule
![]() | (1) |
, called the learning rate, controls the overall amplitude of synaptic changes. The eligibility trace eij(t) is a hypothetical quantity present at every plastic synapse. It signifies whether the synapse is "eligible" for modification by reinforcement and is based on the recent activation of the plastic synapse and the empiric synapse onto the same RA neuron
![]() | (2) |
is the conductance of the empiric (LMAN
RA) synapse onto the i th RA neuron. The temporal filter G(t) is assumed to be nonnegative and its shape determines how far back in time the eligibility trace can "remember" the past.
An important aspect of Eq. 2 is that the instantaneous activation of the empiric synapse is measured relative to its own expected activity
(t)
. This subtraction of average activation in the empiric synapse enables bidirectional synaptic changes, even if the reinforcement signal R(t) is constrained to be nonnegative.6 In our model, each empiric synapse is driven by a Poisson spike train from an LMAN neuron with constant firing rate, so
is a fixed constant throughout song and throughout learning (and thus easy to estimate by a simple time average) for every RA neuron.
The preceding equations have the advantage of mathematical precision, but it is helpful to have verbal formulations of the conditions for synaptic strengthening and weakening, illustrated in Fig. 2B. Suppose an empiric synapse and a plastic synapse onto the same RA neuron are activated at the same time. By Eq. 2, the eligibility trace tends to be positive for some time after. If positive reinforcement arrives during this time interval, then Wij is increased by Eq. 1. Therefore the condition for synaptic strengthening can be summarized by the following rule.
RA) synapse and empiric (LMAN
RA) synapse onto the same RA neuron is followed by positive reinforcement, then the plastic synapse is strengthened. To understand rule R2, note that each empiric synapse has a nonzero average level of activation, which is determined by the firing rate of the presynaptic LMAN neuron. If the empiric synapse is not active at a particular time, it means that the RA neuron is receiving less input than usual for that moment in time. Subsequent positive reinforcement suggests that this deficit of input is better. This LMAN-driven chance deficit is consolidated for future trials within the HVC–RA pathway by weakening the plastic synapses that were active at that time.
R1 and R2 describe how the presence or absence of chance LMAN input to RA, if followed by positive reinforcement, causes HVC–RA synapses to undergo either long-term potentiation (LTP, R1) or long-term depression (LTD, R2). Because the presence or absence of empiric (LMAN) input determines the sign of synaptic change when reinforcement is present, LMAN's role in the preceding rules might be mistaken as supervisory. We note, however, that in our theoretical formulation and birdsong model, output performance does not affect patterns of activity in the empiric (LMAN) input, which would be a requirement if LMAN were sending supervisory signals to RA based on output performance. Furthermore, if reinforcement is held constant, or if it varies independently of eligibility, then rules R1 and R2 produce no net (average) change in synaptic weights: over many trials, synaptic strengthening and weakening due to R1 and R2 cancels, even when LMAN is active. This can readily be seen from Eq. 2, where the average of synaptic eligibility alone is always zero. It is only when reinforcement actually covaries with fluctuations of the synaptic eligibility that there is a net nonzero change in synaptic weight.
Let us more closely examine how the demands of the desired trajectory—reflected in the reinforcement signal—set the balance between R1 and R2 to determine the actual direction of net synaptic change. Consider a scenario where overall performance would improve with an increase in the activity of RA neuron A at time t in the trajectory, a decrease in its activity at time t' in the trajectory, and be unaffected by changes in its activity at time t''. How do plasticity rules R1 and R2 combine to produce these changes? In this hypothetical scenario the network will tend to receive positive reinforcement in song trials where neuron A happens to be more active at time t than usual for that time, due to chance input from an empiric LMAN synapse. In trials where the empiric synapse to neuron A is quiescent at t, the network will tend to get less or no positive reinforcement because the neuron is less active than usual for that time. In short, for this scenario reinforcement is greater after empiric input to neuron A at time t than without, causing R1 to dominate over R2 and resulting in a net LTP of those regular inputs to neuron A that were active at time t. Conversely, because the trajectory would be better with less-than-usual activity in neuron A at time t', reinforcement will be larger in trials where LMAN inputs to A are quiescent at t', meaning that R2 will dominate and produce a net LTD of HVC synapses to A that were active at t'. Finally, because reinforcement does not depend on the activity of neuron A at t'', then reinforcement will arrive with equal likelihood after quiescence or activity in the LMAN input at t'', and the effects of R1 and R2 will cancel, resulting in zero average synaptic change for inputs to A that were active at t''.
Gradient learning
In the preceding text our synaptic plasticity rules were justified with intuitive arguments. They can also be understood using a formal mathematical theory developed elsewhere (Fiete and Seung 2006
). Under reasonable assumptions, the rules—based on dynamic conductance perturbations of the actor neurons—perform stochastic gradient ascent on the expected value of the reinforcement signal.8 The antagonism between plasticity rules R1 and R2 ensures that they compute the subtraction that is the essence of the definition of a gradient. This means that song performance as evaluated by the critic is guaranteed to improve on average. The guarantee holds even if the synapses are embedded in a network that is very complex: for example, the network may be recurrent and consist of conductance-based spiking neurons with synapses that display short-term plasticity. The guarantee is also broadly independent of model details or parameter choices.
Gradient learning can be regarded as a method for (approximately) solving a computational problem: finding a configuration of synaptic strengths that optimizes the performance of a network as evaluated by a critic. In general, this optimization problem is nontrivial. The performance of the network is determined by the collective effects of a large number of synapses and neurons. The role of any given synapse in performance may not be obvious, given that its effect may be exerted through multiple polysynaptic or even recurrent pathways involving both excitation and inhibition. Furthermore, this role may shift over time as the network changes during learning.
Is the principle of gradient learning also used by the brain? One might be skeptical that such a formal principle is relevant for neurobiology. However, gradient learning has a property that is important for brains: it is very robust. Even when properties of the actor and critic are varied, the plasticity rules are still guaranteed to improve average network performance.
The role of numerical simulations
This paper contains the results of many numerical simulations, which might seem irrelevant given that the principle of gradient learning guarantees that the plasticity rules will improve performance. Why are the simulations important? Although there are mathematical guarantees that gradient learning will improve performance, there is no assurance about how fast these improvements will be. If learning turns out to take longer than the lifetime of a zebra finch, then our model of learning, based on the general principle of random single-neuron experimentation and global reinforcement, could be rejected. Thus learning speed is the main issue explored in our numerical simulations. We explore how learning time scales with the number of neurons (to obtain an estimate of learning speed in a realistically sized song network) and with the precision and delay of the reinforcement signal.
Reinforcement learning in its essence is a parallel blind local search in the space of plastic parameters to climb a hill (the reinforcement function, which reflects overall performance on the desired task). The number of search dimensions equals the number of independently perturbed parameters. In algorithms based on synaptic weight perturbation (Dembo and Kailath 1990
; Seung 2003
), the search dimension is the number of weights, whereas in algorithms based on node perturbation (Fiete and Seung 2006
; Xie and Seung 2004
), like the one proposed here, the search dimension is the number of perturbed neurons multiplied by the number of independent time steps in the trajectory. Because optimization by blind multiparameter local search is slow, reinforcement learning might similarly be too slow. Indeed, previous theoretical work on reinforcement learning algorithms shows that in certain feedforward networks, learning time scales proportionally with the number of plastic parameters or with the dimensionality of the input perturbations (Cauwenberghs 1993
; Werfel et al. 2005
).
Existing models of song learning are far from biologically realistic in network size, output degrees of freedom, neural dynamics, and characteristics of the reinforcement signal (temporal delay or broadening), and do not explore how convergence speed and final error would be affected if these properties were made to approach those found in the actual songbird. In fact, even in a small, simplified neural network model with small numbers of output degrees of freedom, Doya and Sejnowski (2000)
reported that learning with independent random perturbations from LMAN resulted in relatively poor convergence to the tutor song. To remedy this situation, they assumed that LMAN computes and carries an instructive gradient signal for HVC–RA synaptic change, in addition to a random component. In addition, learning with a weight-perturbation scheme can be significantly slower and scale more poorly with network size than node-perturbation–like rules such as ours, as demonstrated in a network similar to the birdsong network (Werfel et al. 2005
). Thus existing work provides few results on the possibility or accuracy of song learning based on uncorrelated random perturbations from LMAN in full-scale, realistic network models of birdsong acquisition.
In the bird there are as many as 8,000 RA neurons (and therefore as many potentially independent exploratory perturbations) and 20,000 x 8,000
108 plastic HVC–RA weights. We show that even in such large networks, it is possible at least in principle for independent random neural perturbation to produce biologically realistic learning.
To challenge our plasticity rules, we have made our model of song production quite complex. Unlike any existing models of sensorimotor learning in the song pathway, the model neurons in HVC and RA are biophysically realistic, generating spikes and interacting through synaptic conductances. The spiking activity of the network is converted into an acoustic signal by a simple model of the vocal organ. To further challenge our plasticity rules, we have intentionally "crippled" the critic's reinforcement signal, to make it more difficult to learn from. The critic is modeled as a template matcher that compares the acoustic signal with a template drawn from real zebra finch song. The critic's signal reaches the actor only after a temporal delay, is temporally imprecise, and is binary rather than analog. These features could be realistic if the critic's signal is broadcast by secretion of a neuromodulator. The question is whether the plasticity rules will still be able to learn in a reasonable amount of time.
Although our models of song production and evaluation are highly complex, one should not forget that the underlying model of synaptic plasticity is extremely simple: it consists of the two equations (Eqs. 1 and 2). It is this simple model that is being tested herein. The complexities are there to make the test challenging.
| METHODS |
|---|
|
|
|---|
ACTOR.
In our model of song production, a model neural network controls a source–filter model of the avian vocal organ. Neurons interact through synaptic conductances and generate spikes, unlike past models based on nonspiking neurons (Doya and Sejnowski 1998
; Troyer and Doupe 2000
).
The network is composed of layers that represent HVC, RA, and motor neurons (Fig. 1). The connectivity of the network is feedforward, except for weak global inhibition in RA. Two output units represent motor neuron pools. They low-pass filter and sum the synaptic currents from RA, to produce a pair of time-varying control signals for the vocal organ.
In zebra finches, each RA-projecting HVC neuron generates a single burst of spikes at a stereotyped time during a song motif (Hahnloser et al. 2002
). The burst onset times of the population of neurons are distributed throughout the song motif. To simulate these short bursts, we stimulate each HVC neuron in our model with a single current pulse during the song. This pattern of activity remains unchanged during learning.
Our source–filter model of the syrinx, the avian vocal organ, is mathematically similar to digital models of speech production (Rabiner and Schafer 1978
). Oscillatory motions of the syrinx are driven by air flow, yielding an acoustic output of a set of harmonically related frequencies. The pitch or fundamental frequency of the harmonics is adjusted by muscles that control the tension of the syringeal fold (Goller and Larsen 1997
; Suthers et al. 1999
; Warner 1972
; Wild 1997
), whereas amplitude is partially controlled by air flow. The source in our source–filter model is a pulse train, yielding an acoustic output of a set of harmonically related frequencies, with pitch and amplitude controlled by the two time-varying outputs of the motor network. In the bird, the vocal tract and beak filter the broad spectral content of the syringeal output, and may also directly affect the syringeal oscillations (Beckers et al. 2003
; Nowicki 1987
; Suthers et al. 1999
). The filter in our source–filter model is based on ten linear predictive coefficients, which are generated from zebra finch song recordings to produce a broad spectral envelope similar to that of real songs. For simplicity, the filter is static over the duration of the simulated song and does not change with learning.
Our use of the source–filter model is a compromise between simplicity and realism. More realistic models of the syrinx have relied on physics-based simulations (Fletcher 1988
; Titze 1988
), and display both quasiperiodic or chaotic behaviors. The quasiperiodic behaviors are similar to that of our source–filter model, but are much more time consuming to simulate.
CRITIC. The critic compares the pitch and amplitude of the generated song against those of the template, which is a recording of real zebra finch song, and sends a delayed comparison of the two back to the song network. At every instant in time, the error of the model song with respect to the template is computed as the sum of the squares of the pitch and amplitude differences. The critic's signal is "crippled" in several ways to make learning more difficult and thus to test the capabilities of our model: First, the critic's signal is binarized, rather than analog. Whenever the error is below a similarity threshold, then the critic provides a reinforcement of strength one; otherwise its signal is zero. Second, the signal is temporally delayed by 50 ms. Third, the signal is temporally broadened in some simulations.
There is a similarity threshold for each moment of song, set by the average performance at that moment in the last few trials. This adaptive threshold ensures that the critic gives positive reinforcement roughly 50% of the time. If the threshold were set improperly, then the critic would be hypercritical (never reinforcing anything) or uncritical (reinforcing everything). Our use of an adaptive threshold is similar to baseline comparison in reinforcement learning, which can result in faster learning and lower final error (Dayan 1990
).
In our model, the critic's signal reaches HVC
RA synapses after a delay of Tdelay = 50 ms relative to the RA neural activities that gave rise to it. This number was inferred as follows. First, the delay from RA activity to acoustic output is estimated to lie in the range from 20 ms (Fee et al. 2004
) to 45 ms (Troyer and Doupe 2000
). The lower of these two numbers, when added to an estimated auditory processing delay of 30 ms (Troyer and Doupe 2000
), yields Tdelay = 50 ms.
In some simulations, the critic's signal is temporally broadened in addition to being delayed. This is done by low-pass filtering with a 50 ms time constant (see Numerical details).
EXPERIMENTER.
In each time interval [t, t + dt] during song, LMAN neurons fire a spike with probability p =
dt, with firing rate
= 80 Hz chosen to be consistent with the averaged spiking rate of putative RA-projecting single LMAN units recorded in the singing bird (Leonardo 2004
). This underlying firing rate is taken to be constant throughout song and over learning. LMAN spike trains are regenerated, and thus vary, from iteration to iteration.
Synaptic plasticity
As described earlier, the reinforcement signal R(t) is delayed by 50 ms after the neural events that gave rise to the song that it evaluates. Therefore reinforcement starts 50 ms after the song has begun and ends 50 ms after the song has ended. Equation 1 is applied during this period. The learning rate is
= 0.0002.
The temporal filter G(t) = tnet/
e was used in Eq. 2, with
e = 10 ms and n = 5. The peak of this filter is at Tdelay = n
e, so the eligibility trace can be regarded as a version of the instantaneous eligibility that is delayed by Tdelay = 50 ms to match the time delay in the reinforcement signal. However, to be realistic we assume that delaying the eligibility trace comes at the cost of introducing temporal imprecision. The width of the filter, defined as the time between the two inflection points flanking the delta-function response peak, is 2
, so a temporal imprecision of 45 ms is introduced by filtering to produce a 50 ms delay. In the simulations, the time average
is computed by averaging the LMAN spike train of the current trial. It could be implemented instead by a low-pass filter at every LMAN
RA synapse.
There is no clear experimental evidence for plasticity in the RA
motor output connections, although it is possible these weights are also learned. In addition, the rules described in R1 and R2 could be used in the recurrent RA synapses at the same time as in the HVC
RA synapses, and would drive gradient learning on the whole network. We have focused our attention on the HVC
RA synapses because they are widely expected to be involved in song learning (Herrmann and Arnold 1991
; Kittelberger and Mooney 1999
; Mooney 1992
; Sakaguchi and Saito 1996
; Stark and Scheich 1997
).
Numerical details
VOLTAGE AND CONDUCTANCE DYNAMICS.
The membrane potentials V of all neurons in HVC and RA are governed by
![]() | (3) |
Vreset when Vi crosses the threshold voltage V
; this threshold-reset event represents a voltage spike followed by repolarization. Following a spike in the ith neuron in HVC or RA, the synaptic activation ski(t) in the synapse from neuron I to neuron k is incremented by one. Between spikes it decays with time constant
s
![]() | (4) |
(t),
(t), and
respectively. Note that although we have used integrate-and-fire neurons and relatively simple time courses for synaptic dynamics, the learning rule is guaranteed to perform stochastic gradient ascent on the reinforcement R even for more complicated neuron models (e.g., Hodgkin–Huxley) and synaptic time courses (Fiete and Seung 2006RA neurons receive excitatory synaptic inputs from HVC and LMAN, and global (recurrent) inhibitory inputs due to activity in RA.
Two nonspiking motor output units with time constant
m and tonic activations bi sum the synaptic activations from RA, through a fixed set of RA–output weights A
![]() |
PREMOTOR NETWORK PARAMETERS.
For all HVC and RA neurons, Cm = 1 µF/cm2, VL = –60 mV, VE = 0 mV, and VI = –70 mV. The leak conductance is gL = 0.3 mS/cm2 for HVC neurons and gL = 0.44 mS/cm2 for RA neurons. The threshold membrane potential is V
= –50 mV, and Vreset = –55 mV. The synaptic time constant is
s = 5 ms for HVC
RA, LMAN
RA, and RA
motor output connections. We also assume
m = 5 ms. In all simulations, the time grain is dt = 0.2 ms, so Eqs. 3 and 4 are discretized, and
(t –
)

There are NHVC HVC neurons, NRA RA neurons, and NLMAN LMAN neurons in our simulations. In all cases, NLMAN = NRA. The synaptic conductances in HVC are gI,i(t) = 0 for all neurons at all times; gEi(t) = 0 for all neurons at most times in the motif, except for one brief excitatory pulse of duration 6 ms and magnitude 0.13 mS/cm2 per neuron per motif. The onset times for the pulses for different HVC neurons are distributed evenly across the simulated motif, and this pattern of HVC inputs stays fixed throughout learning. In RA, the synaptic conductances are gEi(t) = 0.0024[
j
+
(t)], and gI,i(t) = (0.2/NRA)
i siRA(t) for all i. With these numerical values, the average excitatory drive to each RA neuron is approximately eightfold stronger than the average inhibitory drive from global inhibition. However, results reported here do not depend on the existence of global inhibition in RA; we have performed simulations with no inhibition in RA, and the results remain qualitatively unchanged. The HVC
RA synaptic weights W are initialized randomly with uniform probability on the interval [0, 1.5] in all the simulations shown herein. RA–output weight matrix A: half of all RA neurons, randomly chosen, project to m1; the other half project to m2. Of the set projecting to m1, half the weights are of uniform strength 440/NRA and half are –440/NRA. Similarly, of the set projecting to m2, half the weights are uniformly 640/NRA and the other half are uniformly –640/NRA. These values were chosen to be large enough so that the maximum range of the network outputs could span the amplitudes and pitches present in the recorded tutor song. The opposing signs of the weights A to the output pools are meant to represent bidirectional muscle control from some resting position (Suthers et al. 1999
)—rather than literal excitatory or inhibitory synapses. The strengths scale inversely with NRA to keep the mean output drive the same when NRA is varied. The baseline or "resting" values of the outputs in the absence of any drive from RA are b1 = 60 and b2 = 40.
All microscopic parameters such as individual neural leak conductances, time constants, and so forth are kept fixed, while scaling the size of the network and generating learning curves for the scaled network. To do this correctly, we have to scale some other macroscopic parameters together with network size. For example, if the RA layer is scaled up by a factor of 4 in size, then all weights from RA to the motor outputs are globally scaled downward by the same factor of 4 to keep the maximum summed drive to the output units, and thus the range of allowed vocal pitch and amplitude, fixed. Such scaling is described in both the preceding and subsequent text.
The total length of the simulated song motif is T = 300 ms in Fig. 4. In Fig. 6, we study the effects of song length and HVC size on learning time. To make the comparison reasonable, we change song length and HVC size while keeping total HVC drive per song-moment constant, so we scale NHVC with song length. Both are reduced fourfold, so T = 75 ms and NHVC = 180; all other parameters are kept unchanged. In Fig. 7, we study the effects of scaling RA size on learning time. Because of the result that song learning does not depend on song length and HVC size, and because it is currently infeasible to run simulations with larger networks, both curves are trained with the short-duration song (T = 75 ms) with small HVC (NHVC = 180). In one curve, NRA = 200; in the other, RA size is increased fourfold, to NRA = 800. NLMAN and the weights A rescale automatically as described earlier. To keep the total variance of the output motor pools fixed as NRA is scaled, we rescale the size of the experimental pulses from LMAN to be larger by a factor of
. The learning rate
is empirically adjusted in both cases to give the fastest possible stable (monotonically nonincreasing on a coarse scale) learning curves for each case.
|
|
|
1(t) and
2(t), sampled at 44 kHz.
1(t) specifies the delta-pulse spacing (pitch period); for period to pulse conversion, a counter sums 1/
1(t) until it crosses 1, which triggers a pulse of duration (1/44,000) s, and the counter is reset to 0. The height of each pulse is specified by the value of
2(t) x 10–3 at the time of the pulse. We use a fixed 10-parameter linear predictive coding (lpc) filter derived from a concatenated sample of three arbitrarily selected zebra finch song recordings. The filter parameters are static and do not change over the course of the song or over the course of song learning. The real part of the filtered pulse train is the student song.
CRITIC.
Pitch extraction: The songs are windowed into overlapping segments by multiplication with a 300-sample (6.8-ms) Hanning window that shifts by 10 samples (0.23 ms) at a time until the entire length of simulated song is covered. To obtain a value for the pitch from each windowed segment, we compute the autocorrelation of that segment; the pitch period is assigned to be the number of samples between the highest peak (at zero time lag) and the second-highest peak, so long as this value is between 12 and 80; if outside this range, the distance to the next-highest peak is computed, until a value is found that falls in the allowed range. The middle 10 samples of the current windowed segment are assigned this value of estimated pitch. This procedure is repeated for each segment. The beginning of the first windowed segment and the end of the last windowed segment of the song are assigned the same pitch values as their closest assigned neighbors. Amplitude extraction: The songs are windowed into 100-sample (2.3-ms) disjoint segments. All 100 samples of each disjoint segment are assigned an amplitude of 0.3 x max |song segment|. Let p(t), a(t) represent the student song pitch and amplitude, and let
(t),
(t) represent the tutor song pitch and amplitude.
The reinforcement signal R is computed by thresholding the delayed estimate of performance
![]() | (5) |
![]() | (6) |
R = 50 ms.
In the preceding expressions,
[D(t) –
(t)] is 0 when the performance D(t) is worse than a threshold
(t), and is 1 when it is better. To mimic delays inherent in the transformation of network activity into vocal output and auditory processing, we assume that D(t) is itself a delayed measure of network performance: at time t, it reflects the performance of the network outputs at t – Tdelay. It is given by D(t + Tdelay) = –{[
(t) – p(t)]2/cp2 + [
(t) – a(t)]2/ca2} when the tutor song is nonsilent, and is D(t + Tdelay) = –2[
(t) – a(t)]2/ca2 during silent intervals in the tutor song. The parameters cp and ca equalize the importance given by the critic to pitch and amplitude; cp = 60, ca = 80 x 10–3, and Tdelay = 50 ms. The critic threshold
(t) adapts as the model birdsong network learns song, and is time-varying within the song. For each time t0 in the motif,
(t0) is obtained by linearly low-pass filtering D(t0) over the past five motif iterations. In all the simulations except Fig. 6,
R = 0 ms: in other words, the reinforcement is delayed while eligibility is correspondingly delayed and broadened (temporally imprecise), although the reinforcement signal is not itself not broadened.
| RESULTS |
|---|
|
|
|---|
|
The songs of the model network before and after learning are compared in Fig. 4A. The network learned to approximate the song template shown in Fig. 4A (left), which was a 300-ms segment of song recorded from a real zebra finch. Before learning, the simulated song looks nothing like the template. After learning, the simulated song is a good approximation to the template (sound files included in Supplemental Materials).9
Before learning, the strengths of the synapses from HVC to RA were initialized randomly. During the learning process, the strengths of these synapses were changed according to Eqs. 1 and 2. The spatiotemporal pattern of HVC neural activity was assumed to remain constant. Changes in the synapses from HVC to RA caused the formation of a "premotor map" that translates HVC spiking into a sequence of vocal commands appropriate for generating song.
Dynamics of learning
The start and end of the learning process are depicted in Fig. 4A. The process did not occur suddenly, but rather happened incrementally. The network generated simulated songs for thousands of trials. During each trial, it received reinforcement signals from the critic, which compared the simulated song with the song template. Whenever the match between the two was good, the critic sent a positive reinforcement signal. This happened many times per song because the critic continuously evaluated the song throughout each trial. Because the threshold for good performance was set by the average over recent trials, the threshold became higher as performance improved.
The "learning curve" of Fig. 4B is a graph of song error versus the number of trials. This error is the mismatch in pitch and amplitude between the simulated song and the real song. It starts high and then converges to a low value within about 2,000 iterations. Is this convergence time fast or slow? It has been estimated that a juvenile zebra finch may practice its song up to 100,000 times over the course of learning (Johnson et al. 2002
). Therefore the model learns relatively quickly, compared with a real zebra finch. As will be seen later, the learning time of the model may change if the properties of the reinforcement signal are changed.
After convergence there is a residual error that does not vanish. The residual could arise from several sources. First, the network may have converged to the vicinity of a local minimum of the error, rather than a global minimum. Second, even a global minimum might have nonzero error. Third, even if the network converged to a global minimum, such convergence would be probabilistic. As long as the synaptic strengths are governed by the learning rules, they would continue to fluctuate around their optimal values. Fourth, even if the synaptic strengths were frozen at their optimal values, the simulated song would fluctuate randomly because the network continues to be perturbed by random synaptic input from LMAN from trial to trial.
RA size
If many (N) neurons collectively drive the output of a network, the share of any one neuron's activity in the total output and reinforcement is small (
1/N). If all neurons fluctuate independently and simultaneously, any one neuron's contribution to the overall output fluctuations is swamped by all other neural contributions. A neuron would have to correlate its own activity with the output for many trials to determine the sign of its effect on the output. Therefore when learning is based on the correlation of individual neural fluctuations with a global reinforcement signal in large networks, learning may be expected to be quite slow.
In the simulations of Fig. 4, our model learned song substantially faster than a real zebra finch. However, the model network was composed of just 720 HVC neurons and 200 RA neurons. The HVC and RA of a real zebra finch are estimated to contain about 20,000 HVC neurons and 8,000 RA neurons, or 10–100 times more neurons and 500–5,000 times more synapses than in the model. Each RA neuron receives parallel, independent, time-varying perturbations from LMAN. RA neural activities sum to drive the motor pools; thus correlations between conductance fluctuations in a single RA neuron with the reinforcement signal diminish with increasing RA size. What is the learning time in a realistically large birdsong network? Unfortunately, numerical simulations of a model network of this size are currently impractical. Instead, we have taken the approach of varying the size of HVC and RA in our model to empirically determine how learning time scales with network size. This allows us to extrapolate learning time for network sizes larger than we can simulate.
We performed numerical simulations to investigate the dependence of learning time on RA size. Figure 5B shows that the learning curve changes little even if RA size is increased by a factor of 4.
|
These results may be surprising, when compared with theoretical studies indicating that the learning time for a feedforward network can scale linearly with its size, if trained by a reinforcement learning algorithm (Cauwenberghs 1993
; Werfel et al. 2003
).
Why is it that learning does not slow down with increasing RA size? In the birdsong network, individual RA–output (and thus RA–reinforcement) correlations do diminish with RA size. If the learning problem depended on each HVC–RA synapse attaining a specific desired value, learning would indeed have slowed down considerably. However, what matters for song production is the summed output from several RA neurons to each motor pool, not the individual contribution of each RA neuron. Consequently there are many configurations of synaptic strengths that will lead to good performance. In other words, the model network is a degenerate or redundant representation. Because it is so large, it has more neurons than necessary to perform the task. Thus although there are more synaptic strengths to learn in a large network, each can be learned more sloppily. These two effects compensate for each other, so that learning time is unchanged.
HVC size and song duration
In Fig. 5A, the dependence of learning time on HVC size is addressed. In our model, HVC size is equivalent to song duration. This is because each HVC neuron bursts only once during song [in accord with experimental findings (Hahnloser et al. 2002
)], and a fixed number of HVC neurons is assumed to be active at any given moment. Therefore we have scaled song duration in tandem with HVC size.
Learning curves for two model networks are shown. The first network has 720 HVC neurons and is trained on 300 ms of song. The second network has 180 HVC neurons and is trained on 75 ms of song. The learning curves look about the same. This suggests that learning time is independent of HVC size/song duration.
What is the reason for this independence? Because each HVC neuron bursts only once during song, moments of song separated by
10 ms are driven by completely separate sets of HVC neurons. Further, the critic evaluates each moment of song, delivering its evaluation continuously in time. This means that the learning of each moment of song occurs independently and in parallel. As a result, when measured in number of trials, learning time has no dependence on HVC size/song duration.10 If the critic delivered a single evaluation for the whole song rather than separate evaluations for each moment,11 then we expect that learning time would become dependent on song length. However, we find it plausible that the critic compares song output with the template continuously throughout time.
Analytical and numerical results in a reduced model of the birdsong network (APPENDIX B) are consistent with the full spiking model network results. In the reduced model as in the spiking network, increasing song length/HVC size has no effect on learning time. This is true only if reinforcement is delivered on-line, and if HVC activity is unary, with each neuron firing exactly once per motif. We find that if the encoding of different time steps in HVC is statistically orthogonal but not unary, learning time will grow linearly with the number of HVC neurons.
Number of muscle groups or output degrees of freedom
It is difficult to systematically vary the complexity or dimensionality of the model sound generator, which uses two network-driven control variables (pitch and amplitude) to produce output sounds that can resemble a recorded finch song. We would encounter the same difficulty if the sound generator were constructed from physics-based parameterized models of the songbird syrinx (Elemans et al. 2004
; Fletcher 1988
; Titze 1988
). Instead, in a reduced model of the song network (APPENDIX B), we can systematically vary the number of output units that independently contribute to performance and thus to the reinforcement, and analytically compute the dependence of learning time on the number of output degrees of freedom.
In this complementary approach (APPENDIX B), we find that learning