{"title": "Feedforward Learning of Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2564, "page_last": 2572, "abstract": "We develop a biologically-plausible learning rule that provably converges to the class means of general mixture models. This rule generalizes the classical BCM neural rule within a tensor framework, substantially increasing the generality of the learning problem it solves. It achieves this by incorporating triplets of samples from the mixtures, which provides a novel information processing interpretation to spike-timing-dependent plasticity. We provide both proofs of convergence, and a close fit to experimental data on STDP.", "full_text": "Feedforward Learning of Mixture Models\n\nMatthew Lawlor\u2217\nApplied Math\nYale University\n\nNew Haven, CT 06520\n\nmflawlor@gmail.com\n\nSteven W. Zucker\nComputer Science\nYale University\n\nNew Haven, CT 06520\n\nzucker@cs.yale.edu\n\nAbstract\n\nWe develop a biologically-plausible learning rule that provably converges to the\nclass means of general mixture models. This rule generalizes the classical BCM\nneural rule within a tensor framework, substantially increasing the generality of\nthe learning problem it solves. It achieves this by incorporating triplets of samples\nfrom the mixtures, which provides a novel information processing interpretation\nto spike-timing-dependent plasticity. We provide both proofs of convergence, and\na close \ufb01t to experimental data on STDP.\n\n1\n\nIntroduction\n\nSpectral tensor methods and tensor decomposition are emerging themes in machine learning, but\nthey remain global rather than \u201conline.\u201d While incremental (online) learning can be useful for ap-\nplications, it is essential for neurobiology. Error back propagation does operate incrementally, but\nits neurobiological relevance remains a question for debate. We introduce a triplet learning rule\nfor mixture distributions based on a tensor formulation of the BCM biological learning rule. It is\nimplemented in a feedforward fashion, removing the need for backpropagation of error signals.\nThe triplet requirement is natural biologically. Informally imagine your eyes microsaccading during\na \ufb01xation, so that a tiny image fragment is \u201csampled\u201d repeatedly until the next \ufb01xation. Viewed\nfrom visual cortex, edge selective neurons will \ufb01re repeatedly.\nImportantly, they exhibit strong\nstatistical dependencies due to the geometry of objects and their relationships in the world. \u201cHidden\u201d\ninformation such as edge curvatures, the presence of textures, and lighting discontinuities all affect\nthe probability distribution of \ufb01ring rates among orientation selective neurons, leading to complex\nstatistical interdependencies between neurons.\nLatent variable models are powerful tools in this context. They formalize the idea that highly coupled\nrandom variables can be simply explained by a small number of hidden causes. Conditioned on these\ncauses, the input distribution should be simple. For example, while the joint distribution of edges in\na small patch of a scene might be quite complex, the distribution conditioned on the presence of a\ncurved object at a particular location might be comparatively simple [14]. The speci\ufb01c question is\nwhether brains can learn these mixture models, and how.\nExample: Imagine a stimulus space of K inputs. These could be images of edges at particular\norientations, or audio tones at K frequencies. These stimuli are fed into a network of n Linear-\nNonlinear Poisson (LNP) spiking neurons. Let rij denote the \ufb01ring rate of neuron i to stimulus j.\nAssuming the stimuli are drawn independently with probability \u03b1k, then the number of spikes d in\nan interval where a single stimulus is shown is distributed according to a mixture model.\n\nP (d) =\n\n1Now at Google Inc.\n\n\u03b1kPk(d)\n\n(cid:88)\n\nk\n\n1\n\n\fwhere Pk(d) is a vector of independent Poisson distributions, and the rate parameter of the ith\ncomponent is rik. We seek a \ufb01lter that responds (in expectation) to one and only one stimulus. To\ndo this, we must learn a set of weights that are orthogonal to all but one of the vectors of rates r\u00b7j.\nEach rate vector corresponds to the mean of one of the mixtures. Our problem is thus to learn the\nmeans of mixtures. We will demonstrate that this can be done non-parametrically over a broad class\nof \ufb01ring patterns, not just Poisson spiking neurons.\nAlthough \ufb01tting mixture models can be exponentially hard, under a certain multiview assumption,\nnon-parametric estimation of mixture means can be done by tensor decomposition [2][1]. This\nmultiview assumption requires access to at least 3 independent copies of the samples; i.e., multiple\nsamples drawn from the same mixture component. For the LNP example above, this multiview\nassumption requires only that we have access to the number of spikes in three disjoint intervals,\nwhile the stimulus remains constant. After these intervals, the stimulus is free to change \u2013 in vision,\nsay, after a saccade \u2013 after which point another sample triple is taken.\nOur main result is that, with a slight modi\ufb01cation of classical Bienenstock-Cooper-Munro [5] synap-\ntic update rule a neuron can perform a tensor decomposition of the input data. By incorporating the\ninteractions between input triplets, our online learning rule can provably learn the mixture means\nunder an extremely broad class of mixture distributions and noise models.\n(The classical BCM\nlearning rule will not converge properly in the presence of noise.) Speci\ufb01cally we show how the\nclassical BCM neuron performs gradient ascent in a tensor objective function, when the data con-\nsists of discrete input vectors, and how our modi\ufb01ed rule converges when the data are drawn from a\ngeneral mixture model.\nThe multiview requirement has an intriguing implication for neuroscience. Since spikes arrive in\nwaves, and spike trains matter for learning [9], our model suggests that the waves of spikes arriv-\ning during adjacent epochs in time provide multiple samples of a given stimulus. This provides an\nunusual information processing interpretation for the functional role of spike trains. To realize it\nfully, we point out that classical BCM can be implemented via spike timing dependent plasticity\n[17][10][6][18]. However, most of these approaches require much stronger distributional assump-\ntions on the input data (generally Poisson), or learn a much simpler decomposition of the data than\nour algorithm. Other, Bayesian methods [16], require the computation of a posterior distribution\nwhich requires an implausible normalization step. Our learning rule successfully avoids these is-\nsues, and has provable guarantees of convergence to the true mixture means. At the end of this\npaper we show how our rule predicts pair and triple spike timing dependent plasticity data.\n\n2 Tensor Notation\n\nLet \u2297 denote the tensor product. We denote application of a k-tensor to k vectors by T (w1, ..., wk),\nso in the simple case where T = v1 \u2297 ... \u2297 vk,\n(cid:88)\n\nWe further denote the application of a k-tensor to k matrices by T (M1, ..., Mk) where\n\nT (w1, ..., wk) =\n\n(cid:104)vj, wj(cid:105)\n\n(cid:89)\n\nj\n\nT (M1, ..., Mk)i1,...,ik =\n\nTj1,...,jk [M1]j1,i1...[Mk]jk,ik\n\nj1,...,jk\n\nThus if T is a symmetric 2-tensor, T (M1, M2) = M T\nSimilarly, T (v1, v2) = vT\nWe say that T has an orthogonal tensor decomposition if\n\n1 T v2.\n\n1 T M2 with ordinary matrix multiplication.\n\n(cid:88)\n\nk\n\nT =\n\n\u03b1kvk \u2297 vk \u2297 ... \u2297 vk and (cid:104)vi, vj(cid:105) = \u03b4ij\n\n3 Connection Between BCM Neuron and Tensor Decompositions\n\nThe BCM learning rule was introduced in 1982 in part to correct failings of the classical Hebbian\nlearning rule [5]. The Hebbian learning rule [11] is one of the simplest and oldest learning rules. It\n\n2\n\n\fposits that the selectivity of a neuron to input i, mt(i) is increased in proportion to the post-synaptic\nactivity of that neuron ct = (cid:104)mt\u22121, dt(cid:105), where m is a vector of synaptic weights.\n\nmt \u2212 mt\u22121 = \u03b3tctdt\n\nThis learning rule will become increasingly correlated with its input. As formulated this rule does\nnot converge for most input, as (cid:107)m(cid:107) \u2192 \u221e. In addition, in the presence of multiple inputs Hebbian\nlearning rule will always converge to an \u201caverage\u201d of the inputs, rather than becoming selective to\none particular input. It is possible to choose a normalization of m such that m will converge to\nthe \ufb01rst eigenvector of the input data. The BCM rule tries to correct for the lack of selectivity, and\nfor the stabilization problems. Like the Hebbian learning rule, it always updates its weights in the\ndirection of the input, however it also has a sliding threshold that controls the magnitude and sign of\nthis weight update.\nThe original formulation of the BCM rule is as follows: Let c be the post-synaptic \ufb01ring rate, d \u2208 RN\nbe the vector of presynaptic \ufb01ring rates, and m be the vector of synaptic weights. Then the BCM\nsynaptic modi\ufb01cation rule is\n\nc = (cid:104)m, d(cid:105)\n\u02d9m = \u03c6(c, \u03b8)d\n\n\u03c6 is a non-linear function of the \ufb01ring rate, and \u03b8 is a sliding threshold that increases as a superlinear\nfunction of the average \ufb01ring rate.\nThere are many different formulations of the BCM rule. The primary features that are required are\n\u03c6(c, \u03b8) is convex in c, \u03c6(0, \u03b8) = 0, \u03c6(\u03b8, \u03b8) = 0, and \u03b8 is a super-linear function of E[c].\nThese properties guarantee that the BCM learning rule will not grow without bound. There have\nbeen many variants of this rule. One of the most theoretically well analyzed variants is the Intrator\nand Cooper model [12], which has the following form for \u03c6 and \u03b8.\n\u03c6(c, \u03b8) = c(c \u2212 \u03b8) with \u03b8 = E[c2]\n\nDe\ufb01nition 3.1 (BCM Update Rule). With the Intrator and Cooper de\ufb01nition, the BCM rule is de\ufb01ned\nas\n\n(1)\n(cid:80)\nt \u03b3t \u2192 \u221e and(cid:80)\nwhere ct = (cid:104)mt\u22121, dt(cid:105) and \u03b8 = E[c2]. \u03b3t is a sequence of positive step sizes with the property that\n\nmt = mt\u22121 + \u03b3tct(ct \u2212 \u03b8t\u22121)dt\n\nt \u03b32\n\nt < \u221e\n\nThe traditional application of this rule is a system where the input d is drawn from linearly in-\ndependent vectors {d1, ..., dk} with probabilities \u03b11, ..., \u03b1K, with K = N, the dimension of the\nspace.\nThese choices are quite convenient because they lead to the following objective function formulation\nof the synaptic update rule.\n\n(cid:104)m, d(cid:105)3(cid:105)\n(cid:104)\n\n1\n3\n\nE\n\n(cid:104)\n\n(cid:104)m, d(cid:105)2(cid:105)2\n\n1\n4\n\nE\n\n\u2212\n\n(cid:104)m, d(cid:105)2 d \u2212 E[(cid:104)m, d(cid:105)2](cid:104)m, d(cid:105) d\n\nR(m) =\n\nThus,\n\n(cid:104)\n\n\u2207R = E\n\n(cid:105)\n\n= E[c(c \u2212 \u03b8)d]\n= E[\u03c6(c, \u03b8)d]\n\nSo in expectation, the BCM rule performs a gradient ascent in R(m). For random, discrete input\nthis rule would then be a form of stochastic gradient ascent.\nWith this model, we observe that the objective function can be rewritten in tensor notation. Note\nthat this input model can be seen as a kind of degenerate mixture model.\n\n3\n\n\fThis objective function can be written as a tensor objective function, by noting the following:\n\n(cid:88)\n(cid:88)\n\nk\n\nk\n1\n3\n\nT =\n\nM =\n\nR(m) =\n\n\u03b1kdk \u2297 dk \u2297 dk\n\n\u03b1kdk \u2297 dk\n\nT (m, m, m) \u2212\n\n1\n4\n\nM (m, m)2\n\n(2)\n\nFor completeness, we present a proof that the stable points of the expected BCM update are selective\nfor only one of the data vectors.\nThe stable points of the expected update occur when E[ \u02d9m] = 0. Let ck = (cid:104)m, dk(cid:105) and \u03c6k =\n\u03c6(ck, \u03b8). Let c = [c1, . . . , cK]T and \u03a6 = [\u03c61, . . . , \u03c6K]T .\n\nDT = [d1|\u00b7\u00b7\u00b7|dK]\nP = diag(\u03b1)\n\nk ek or m = \u03b1\u22121\n\nTheorem 3.2. (Intrator 1992) Let K = N, let each dk be linearly independent, and let \u03b1k > 0 and\ndistinct. Then stable points (in the sense of Lyapunov) of the expected update \u02d9m = \u2207R occur when\nk D\u22121ek. ek is the unit basis vector, so there is activity for only one stimuli.\nc = \u03b1\u22121\n\nProof. E[ \u02d9m] = DT P \u03a6 which is 0 only when \u03a6 = 0. Note \u03b8 =(cid:80)\n\uf8eb\uf8ed(cid:88)\n\nck = \u03b8. Let S+ = {k : ck (cid:54)= 0}, and S\u2212 = {k : ck = 0}. Then for all k \u2208 S+, ck = \u03b2S+\n\nk. \u03c6k = 0 if ck = 0 or\n\n\uf8f6\uf8f8\u22121\n\n(cid:88)\n\nk \u03b1kc2\n\n\u03b2S+ =\n\n\u03b1i = 0\n\n\u03b1i\n\n\u03b2S+ \u2212 \u03b22\n\nS+\n\nk\u2208S+\n\nk\u2208S+\n\nTherefore the solutions of the BCM learning rule are c = 1S+\u03b2S+, for all subsets S+ \u2282 {1, . . . , K}.\nWe now need to check which solutions are stable. The stable points (in the sense of Lyapunov) are\npoints where the matrix\n\nH =\n\n\u2202E[ \u02d9m]\n\n\u2202m\n\n= DT P\n\n(cid:18) \u2202\u03a6\n\n(cid:19) \u2202c\n\n\u2202c\n\n\u2202m\n\n(cid:18) \u2202\u03a6\n\n(cid:19)\n\n\u2202c\n\n= DT P\n\nD\n\nis negative semide\ufb01nite.\nLet S be an index set S \u2282 {1, . . . , n}. We will use the following notation for the diagonal matrix\nIS:\n\n(cid:26)1\n\n0\n\n(IS)ii =\n\ni \u2208 S\ni /\u2208 S\n\n(3)\n\nSo IS + ISc = I, and eieT\n\na quick calculation shows(cid:18) \u2202\u03c6i\n\n(cid:19)\n\ni = I{i}\n\n= \u03b2S+ IS+ \u2212 \u03b2S+IS\u2212 \u2212 2\u03b22\n\nS+ diag(\u03b1) 1S+1T\nS+\n\n\u2202cj\n\nThis is negative semide\ufb01nite iff A = IS+ \u2212 2\u03b2S+ diag(\u03b1) 1S+1T\nAssuming a non-degeneracy of the probabilities \u03b1, and assume |S+| > 1. Let j = arg mink\u2208S+ \u03b1k.\nThen \u03b2S+\u03b1j < 1\n2 so A is not negative semi-de\ufb01nite. However, if |S+| = 1 then A = \u2212IS+ so the\nstable points occur when c = 1\n\u03b1i\n\nis negative semide\ufb01nite.\n\nei\n\nS+\n\nThe triplet version of BCM can be viewed as a modi\ufb01cation of the classical BCM rule which allows\nit to converge in the presence of zero-mean noise. This indicates that the stable solutions of this\nlearning rule are selective for only one data vector, dk.\nBuilding off of the work of [2] we will use this characterization of the objective function to build a\ntriplet BCM update rule which will converge for general mixtures, not just discrete data points.\n\n4\n\n\fd1\n\nm1\n\nm2\n\nd2\n\n(a) Geometry of stable solutions. Each stable\nsolution is selective in expectation for a single\nmixture. Note that the classical BCM rule will\nnot converge to these values in the presence of\nnoise.\n\n(b) Noise response of triplet BCM update\nrule vs BCM update. Input data was a mix-\nture of Gaussians with standard deviation \u03c3.\nThe selectivity of the triplet BCM rule re-\nmains unchanged in the presence of noise.\n\n4 Triplet BCM Rule\n\nWe now show that by modifying the update rule to incorporate information from triplets of input\nvectors, the generality of the input data can be dramatically increased. Our new BCM rule will learn\nselectivity for arbitrary mixture distributions, and learn weights which in expectation are selective\nfor only one mixture component. Assume that\n\n(cid:88)\n\nP (d) =\n\n\u03b1kPk(d)\n\nk\n\nwhere EPk [d] = dk. For example, the data could be a mixture of axis-aligned Gaussians, a mixture\nof independent Poisson variables, or mixtures of independent Bernoulli random variables to name a\nfew. We also require EPk [(cid:107)d(cid:107)2] < \u221e. We emphasize that we do not require our data to come from\nany parametric distribution.\nWe interpret k to be a latent variable that signals the hidden cause of the underlying input distribu-\ntion, with distribution Pk. Critically, we assume that the hidden variable k changes slowly compared\nto the inter-spike period of the neuron. In particular, we need at least 3 samples from each Pk. This\ncorresponds to the multi-view assumption of [2]. A particularly relevant model meeting this as-\nsumption is that of spike counts in disjoint intervals under a Poisson process, with a discrete, time\nvarying rate parameter. For the purpose of this paper, we assume the number of mixed distributions,\nk, is equal to the number of dimensions, n, however it is possible to relax this to k < n.\nLet {d1, d2, d3} be a triplet of independent copies from some Pk(d), i.e. each are drawn from\nthe same mixture. It is critical to note that if {d1, d2, d3} are not drawn from the same class, this\nupdate will not converge to the global maximum. Numerical experiments show this assumption can\nbe violated somewhat without severe changes to the \ufb01xed points of the algorithm. Our sample is\n\nthen a sequence of triplets, each triplet drawn from the same latent distribution. Let ci =(cid:10)di, m(cid:11).\n\nWith these independent triples, we note that the tensors T and M from equation (2) can be written\nas moments of the independent triplets\n\nT = E[d1 \u2297 d2 \u2297 d3]\nM = E[d1 \u2297 d2]\n\nR(m) =\n\n1\n3\n\nT (m, m, m) \u2212\n\n1\n4\n\nM (m, m)2\n\nThis is precisely the same objective function optimized by the classical BCM update, with the con-\nditional means of the mixture distributions taking the place of discrete data points. With access to\nindependent triplets, selectivity for signi\ufb01cantly richer input distributions can be learned.\n\n5\n\n\u221210\u22125051015202502468101214(cid:104)m1,d(cid:105)\u221210\u2212505101520250246810121416182022(cid:104)m2,d(cid:105)10\u2212210\u221211001010123Noise\u03c3km\u2212m0kNoisesensitivityofmafter10e6stepsTripletRuleBCM\fAs with classical BCM, we can perform gradient ascent in this objective function which leads to the\nexpected update\n\nE[\u2207R] = E[c1c2d3 + (c1d2 + c2d1)(c3 \u2212 2\u03b8)]\n\nwhere \u03b8 = E[c1c2]. This update is rather complicated, and couples pre and post synaptic \ufb01ring rates\nacross multiple time intervals. Since each ci and di are identically distributed, this expectation is\nequal to\n\nE[c2(c3 \u2212 \u03b8)d1]\n\nwhich suggests a much simpler update. This ordering was chosen to match the spike timing depen-\ndency of synaptic modi\ufb01cation. This update depends on the presynaptic input, and the postsynaptic\nexcitation in two disjoint time periods.\nDe\ufb01nition 4.1 (Full-rank Triplet BCM). The full-rank Triplet BCM update rule is:\n\nwhere \u03c6(c2, c3, \u03b8) = c2(c3 \u2212 \u03b8), the step size \u03b3t obeys(cid:80)\n\n(4)\nt < \u221e. \u03c0 is a\nprojection into an arbitrary large compact ball, which is needed for technical reasons to guarantee\nconvergence.\n\nt \u03b3t \u2192 \u221e, and(cid:80)\n\nmt = \u03c0(mt\u22121 + \u03b3t\u03c6(c2, c3, \u03b8t\u22121)d1)\n\nt \u03b32\n\n5 Stochastic Approximation\n\nHaving found the stable points of the expected update for BCM and triplet BCM, we now turn to\na proof of convergence for the stochastic update generated by mixture models. For this, we turn to\nresults from the theory of stochastic approximation.\nWe will decompose our update into two parts, the expected update, and the (random) deviation.\nThis deviation will be a L2 bounded martingale, while the expected update will be a ODE with the\npreviously calculated stable points. Since the expected update is the gradient of a objective function\nR, the Lyapunov functions required for the stability analysis are simply this objective function.\nThe decomposition of the triplet BCM stochastic process is as follows:\n\nmt \u2212 mt\u22121 = \u03b3t\u03c6(c2\n\nt , c3\n\nt , \u03b8t\u22121)d1\n\n= \u03b3nE[\u03c6(c2, c3, \u03b8t\u22121)d1] + \u03b3n\n= \u03b3th(mt) \u2212 \u03b3t\u03b7t\n\n(cid:0)\u03c6(c2, c3, \u03b8t\u22121)d1 \u2212 E[\u03c6(c2, c3, \u03b8t\u22121)d1](cid:1)\n\nHere, h(mt) is the deterministic expected update, and \u03b7t is a martingale. All our expectations are\ntaken with respect to triplets of input data. The decomposition for classical BCM is similar.\nThis is the Doob decomposition [8] of the sequence. Using a theorem of Delyon [7], we will show\nthat our triplet BCM algorithm will converge with probability 1 to the stable points of the expected\nupdate. As was shown previously, these stable points are selective for one and only one mixture\ncomponent in expectation.\nTheorem 5.1. For the full rank case, the projected update converges w.p. 1 to the zeros of \u2207\u03a6\nProof. See supplementary material, or an extended discussion in a forthcoming arXiv preprint [13].\n\n6 Triplet BCM Explains STDP Up to Spike Triplets\n\nBiophysically synaptic ef\ufb01ciency in the brain is more closely modeled by spike timing dependent\nplasticity (STDP). It depends precisely on the interval between pre- and post-synaptic spikes. Initial\nresearch on spike pairs [15, 3] showed that a presynaptic spike followed in close succession by\na postsynaptic spike tended to strengthen a synapse, while the reverse timing weakened it. Later\nwork on natural spike chains [9], triplets of spikes [4, 19], and quadruplets have shown interaction\neffects beyond pairs. Most closely to ours, recent work by P\ufb01ster and Gerstner [17] suggested that\na synaptic modi\ufb01cation function depending only on spike triplets is suf\ufb01cient to explain all current\nexperimental data. Furthermore, their rule resembles a BCM learning rule when the pre- and post-\nsynaptic \ufb01ring distributions are independent Poisson.\n\n6\n\n\fWe now demonstrate that our learning rule can model both the pairwise and triplet results from\nP\ufb01ster and Gerstner using a smaller number of free parameters and without the introduction of\nhidden leaky timing variables. Instead, we work directly with the pre- and post-synaptic voltages,\nand model the natural voltage decay during the falling phase of an action potential. Our (four)\nfree variables are the voltage decay, which we set within reasonable biological limits; a bin width,\ncontrolling the distance between spiking triplet periods; \u03b8, our sliding voltage threshold; and an\noverall multiplicative constant. We emphasize that our model was not designed to \ufb01t these data; it\nwas designed to learn selectivity for the multi-view mixture task. Spike timing dependence falls out\nas a natural consequence of our multi-view assumption.\n\nFigure 2: Fit of triplet BCM learning rule to synaptic strength STDP curve from [3]. Data points\nwere recreated from [3] . Spike timing measures the time between post synaptic and presynaptic\nspikes, tpost \u2212 tpre. A positive time means the presynaptic spike was followed by a postsynaptic\nspike.\n\nWe \ufb01rst model hippocampus data from Mu-ming Poo [3], who applied repeated electrical stimula-\ntion to the pre- and post-synaptic neurons in a pairing protocol within which the relative timing of\nthe two spike chains was varied. After repeated stimulation at a \ufb01xed timing offset, the change in\nsynaptic strength (postsynaptic current) was measured.\nWe take the average voltage in triplet intervals to be the measure of pre- and post-synaptic activity,\nand consider a one-dimensional version of our synaptic update:\n\n(5)\nwhere c2 and c3 are the postsynaptic voltage averaged over the second and third time bins, and d1\nis the presynaptic voltage averaged over the \ufb01rst time bin. We assume our pre and post synaptic\nvoltages are governed by the differential equation:\n\n\u03b4m = Ac2(c3 \u2212 \u03b8)d1\n\ndV\ndt\n\n= \u2212\u03c4 V\n\n(6)\nsuch that, if t = sk where sk is the kth spike, V (t) \u2192 1. That is, the voltage is set to 1 at each spike\ntime before decaying again.\nLet Vpre be the presynaptic voltage trace, and Vpost be the postsynaptic voltage trace. They are\ndetermined by the timing of pre- and post-synaptic spikes, which occur at r1, r2, . . . , rn for the\npresynaptic spikes, and o1, o2, . . . om for the postsynaptic spikes.\nTo model the pairwise experiments, we let ri = r0 + iT where T = 1000ms, a large time constant.\nThen oi = ri + \u03b4t where \u03b4t is the spike timing. Let \u03b4b be the size of the bins. That is to say,\n\n(cid:90) t+\n\n\u03b4b\n2\n\n\u03b4b\n2\n\nt\u2212\n\nVpre(t(cid:48) + \u03b4b)dt(cid:48)\n\nc2(t) =\n\nVpost(t(cid:48))dt(cid:48)\n\nVpost(t(cid:48) \u2212 \u03b4b)dt(cid:48)\n\nVpost(t) = Vpre(t \u2212 \u03b4t)\n\n(cid:90) t+\n(cid:90) t+\n\nt\u2212\n\n\u03b4b\n2\n\n\u03b4b\n2\n\u03b4b\n2\n\n\u03b4b\n2\n\nt\u2212\n\nd1(t) =\n\nc3(t) =\n\nThen the overall synaptic modi\ufb01cation is given by\n\n(cid:90)\n\nt\n\nAc2(t)(c3(t) \u2212 \u03b8)d1(t)dt\n\n7\n\n\u2212100\u221280\u221260\u221240\u221220020406080100\u221250050100Spike Timing (ms)Change in EPSC Amplitude (%)\fWe \ufb01t A, \u03c4, \u03b8, and the bin size of integration. Recall that the sliding threshold, \u03b8 is a function of the\nexpected \ufb01ring rate of the neuron. Therefore we would not expect it to be a \ufb01xed constant. Instead,\nit should vary slowly over a time period much longer than the data sampling period. For the purpose\nof these experiments it would be at an unknown level that depends on the history of neural activity.\nSee \ufb01gure 2 for the \ufb01t for Mu-ming Poo\u2019s synaptic modi\ufb01cation data.\nFroemke and Dan also investigated higher order spike chains, and found that two spikes in short\nsuccession did not simply multiply in their effects. This would be the expected result if the spike\ntiming dependence treated each pair in a triplet as an independent event. Instead, they found that a\npresynaptic spike followed by two postsynaptic spikes resulted in signi\ufb01cantly less excitation than\nexpected if the two pairs were treated as independent events. They posited that repeated spikes\ninteracted suppressively, and \ufb01t a model based on that hypothesis. They performed two triplet ex-\nperiments with pre- pre-post triplets, and pre-post-post triplets. Results of their experiment along\nwith the predictions based on our model are presented in \ufb01gure 3.\n\nFigure 3: Measured excitation and inhibition for spike triplets from Froemke and Dan are demar-\ncated in circles and triangles. A red circle or triangle indicates excitation, while a blue circle or\ntriangle indicates inhibition. The predicted results from our model are indicated by the background\ncolor. Numerical results for our model, with boundaries for the Froemke and Dan model are repro-\nduced.\nLeft \ufb01gure is pairs of presynaptic spikes, and a single post-synaptic spike. The right \ufb01gure is pairs of\npostsynaptic spikes, and a presynaptic spike. For each \ufb01gure, t1 measures the time between the \ufb01rst\npaired spike with the singleton spike, with the convention that each t is positive if the presynaptic\nspike happens before the post synaptic spike. See paired STDP experiments for our spiking model.\nFor the top \ufb01gure, \u03b8 = .65, our bin width was 2ms, and our spike voltage decay rate \u03c4 = 8ms. For\nthe right \ufb01gure \u03b8 = .45. Red is excitatory, blue is inhibitory, white is no modi\ufb01cation. A positive t\nindicates a presynaptic spike occurred before a postsynaptic spike.\n\n7 Conclusion\n\nWe introduced a modi\ufb01ed formulation of the classical BCM neural update rule. This update rule\ndrives the synaptic weights toward the components of a tensor decomposition of the input data.\nBy further modifying the update to incorporate information from triplets of input data, this ten-\nsor decomposition can learn the mixture means for a broad class of mixture distributions. Unlike\nother methods to \ufb01t mixture models, we incorporate a multiview assumption that allows us to learn\nasymptotically exact mixture means, rather than local maxima of a similarity measure. This is in\nstark contrast to EM and other gradient ascent based methods, which have limited guarantees about\nthe quality of their results. Conceptually our model suggests a different view of spike waves during\nadjacent time epochs: they provide multiple independent samples of the presynaptic \u201cimage.\u201d\nDue to size constraints, this abstract has has skipped some details, particularly in the experimental\nsections. More detailed explanations will be provided in future publications.\nResearch supported by NSF, NIH, The Paul Allen Foundation, and The Simons Foundation.\n\n8\n\n\fReferences\n[1] Animashree Anandkumar, Dean P Foster, Daniel Hsu, Sham M Kakade, and Yi-Kai Liu. Two\nsvds suf\ufb01ce: Spectral decompositions for probabilistic topic modeling and latent dirichlet al-\nlocation. CoRR, abs/1204.6703, 1, 2012.\n\n[2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Ten-\nsor decompositions for learning latent variable models. arXiv preprint arXiv:1210.7559, 2012.\n[3] Guo-qiang Bi and Mu-ming Poo. Synaptic modi\ufb01cations in cultured hippocampal neurons:\ndependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of\nNeuroscience, 18(24):10464\u201310472, 1998.\n\n[4] Guo-Qiang Bi and Huai-Xing Wang. Temporal asymmetry in spike timing-dependent synaptic\n\nplasticity. Physiology & behavior, 77(4):551\u2013555, 2002.\n\n[5] Elie L Bienenstock, Leon N Cooper, and Paul W Munro. Theory for the development of neuron\nselectivity: orientation speci\ufb01city and binocular interaction in visual cortex. The Journal of\nNeuroscience, 2(1):32\u201348, 1982.\n\n[6] Natalia Caporale and Yang Dan. Spike timing-dependent plasticity: a hebbian learning rule.\n\nAnnual Review Neuroscience, 31:25\u201346, 2008.\n\n[7] Bernard Delyon. General results on the convergence of stochastic algorithms. Automatic\n\nControl, IEEE Transactions on, 41(9):1245\u20131255, 1996.\n\n[8] Joseph L Doob. Stochastic processes, volume 101. New York Wiley, 1953.\n[9] Robert C Froemke and Yang Dan. Spike-timing-dependent synaptic modi\ufb01cation induced by\n\nnatural spike trains. Nature, 416(6879):433\u2013438, 2002.\n\n[10] Julijana Gjorgjieva, Claudia Clopath, Juliette Audet, and Jean-Pascal P\ufb01ster. A triplet\nspike-timing\u2013dependent plasticity model generalizes the bienenstock\u2013cooper\u2013munro rule to\nhigher-order spatiotemporal correlations. Proceedings of the National Academy of Sciences,\n108(48):19383\u201319388, 2011.\n\n[11] DO Hebb. The organization of behavior; a neuropsychological theory. 1949.\n[12] Nathan Intrator and Leon N Cooper. Objective function formulation of the bcm theory of visual\ncortical plasticity: Statistical connections, stability conditions. Neural Networks, 5(1):3\u201317,\n1992.\n\n[13] Matthew Lawlor and Steven S. W. Zucker. An online algorithm for learning selectivity to\n\nmixture means. arXiv preprint, 2014.\n\n[14] Matthew Lawlor and Steven W Zucker. Third-order edge statistics: Contour continuation,\ncurvature, and cortical connections. In Advances in Neural Information Processing Systems,\npages 1763\u20131771, 2013.\n\n[15] WB Levy and O Steward. Temporal contiguity requirements for long-term associative poten-\n\ntiation/depression in the hippocampus. Neuroscience, 8(4):791\u2013797, 1983.\n\n[16] Bernhard Nessler, Michael Pfeiffer, and Wolfgang Maass. Stdp enables spiking neurons to\ndetect hidden causes of their inputs. In Advances in neural information processing systems,\npages 1357\u20131365, 2009.\n\n[17] Jean-Pascal P\ufb01ster and Wulfram Gerstner. Triplets of spikes in a model of spike timing-\n\ndependent plasticity. The Journal of neuroscience, 26(38):9673\u20139682, 2006.\n\n[18] Sen Song, Kenneth D Miller, and Larry F Abbott. Competitive hebbian learning through spike-\n\ntiming-dependent synaptic plasticity. Nature neuroscience, 3(9):919\u2013926, 2000.\n\n[19] Huai-Xing Wang, Richard C Gerkin, David W Nauen, and Guo-Qiang Bi. Coactivation and\ntiming-dependent integration of synaptic potentiation and depression. Nature neuroscience,\n8(2):187\u2013193, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1331, "authors": [{"given_name": "Matthew", "family_name": "Lawlor", "institution": "Yale University"}, {"given_name": "Steven", "family_name": "Zucker", "institution": "Yale University"}]}