Anda di halaman 1dari 5

Proceedings Networks,

ot the Washington

IEEE/INNS D.C.,

International

June

18-22,

vol.

Joint Conference 1, pp. 401-405

on Neural (IEEE Press,

New York,

1989)
ADAPIIVE NETWOilK FOR OPTlMAL Lll@AR FEATURE EXTRACTION Peter Foldi& Physiological Laboratory Downing Street Cambridge, CB2 3EG England number of variables by taking only the coordinates on which the variance is concentrated and minim& the loss of variance by leaving out the coordinates with small variances. Tho basis vectors of this new coordinate system arc tho oigcnvoctors of tho covarianco matrix and the variance on those coordinates are the corresponding eigonvaluos. The optimal projection from m to n dimensions given by PCA is thorcfore the subspaco of tho n eigonvectors with largest eigenvaluos. Tho information content of a normally distributed variablo deponds on its variance, so by maximizing variances PCA also maximizes the amount of infonnation on tho n variables.

Abstract: A network of highly interconnected linear neuron-like processing units and a simple, local, unsupervised rule for the modification of connection strengths between thcso units are proposed. After training the network on a high (m) dimensional distribution of input vectors, the lowor (n) dimensional output will be a projection into the subspaco of tho n largest principal components (the subspaco spannod by the n olgonvectors of largest eigonvalues of the input covariance matrix) and maximize the mutual information botween the input and the output in the same way as principal component analysis doos. Tho purely local nature of the synaptic modification rulo (simple Hobbian and anti-Hobbian) makes tho implemontation of the notwork easier. fastor and biologically more plausible than ~10s depending on error propagation.

The task of any recognition systom is to divido a set of high dimensional pattern vectors, liko images or sounds, into a finito number of classes. Each of these classes corresponds to a region of this high dimensional space, and input vocton arc classified according to which of these regtons they lie in. Tho number of inputs nccossary in roal-world applications, liko the numbor of pixels in a television imago or the numbor of photoreceptor ~011s in the retina, is so large that tho computation becomes extremely slow. Massively parallel, nonlinear, connectionist-type networks that have been proposed to solve such catcgorization problems also have training times that depend strongly on size, giving unacceptable times for networks capable of handling inputs of such high dimensionality. One way of reducing the number of variables could be simply to decrease the resolution of the input, but obviously a lot of information would bo lost this way. There are woll known methods for data compression and feature extraction which are botter, and many of these rely on the details of a particular application. The features most useful in a givon problom may depend on tho desired output of tho catogorizer. In designing a moro genoralpurpose classifier, however, the correct catogorios may not bo known in advance, or information about dosirod output may not bo availablo at the physical location of the feature extractor. In this case selection of features must be based on the criterion that thoy bo most useful in most situations, independent of what exactly tho desired inputoutput relationship will be. Selection of foatures can rely only on regularities in the input data-set. Tho quality of a set of features can bo determinod by information-thooretie measures. Good forturcs will reduce dimonsionality with only a minimal loss of information [E, lo]. Among linear methods Principal Componont Analysis has such optimal proporties.

Oja (21 linear unit with and showod that component (the matrix with tho vector sequonco.

analyzed a modol consisting of a single a local, Hebbian-type modification rule tho unit oxtracts the largost principal single oigenvcctor of tho covariancc largest eigenvalue) of a stationary input

x1 x2

--l

rn Figure 1. The output of Ojas linear unit trained on a stationary scquonce of input vectors converges to tho largest principal component. The output of tbo unit, y is the sum of the inputs weighted by the connection strong&s, qj (Fig. 1): m Y = ,I1 SjXj , The unit is trained on vectors from an input disof the connectribution and the rule for tho modification tions during each training step is: Aqj = p (XjY - qjY2)$ where xjy is the Hobbian term that makes tho connection stronger when tho input and the output are correlated, i.e. whoa thoy arc active simultaneously. The second, weakoning term, -qjy is necessary to prevent instability. This term makes C qi2 approach 1. After training the unit maximizos the variance of its output subject to the constraint that Cqi2 = 1. This, howover, is not a full principal component analysis be-

xj

PCA is a statistical method for extracting featuros from high dimensional data distributions [l]. It is a linear, orthogonal transformation (rotation) of a distribution into a coordinate system in which the coordinates are uncorrelated and maximal amount of variance of tho original distribution is concentrated on only a small numbor of coordinates. In this transformod space we can reduce the

cause the unit finds only one component, the one with the largest variance. If there is more than one unit available for signalling, then if they follow the same rule and no noise is added to the outputs then their output values will of course be identical, which is no more uacful than the value of a single unit. The transmitted information will be significantly less than what could be achieved by PCA. Several alternative algorithms have been proposed to change connection strcngtbs in linear COMCCtionist networks to get more than one principal component of a distribution [3. 4. 111. but these are non-local rules. they rely on the calculation of errors and backward propagation of values between layers. which makes their operation and implementation more complicated. In this paper a combination of two local rules, Ojas rule and decorrelation will be shown to achieve the aame goal.

y2

y* Xm Figure 3. The cate Hcbbian. tions. Yi = Eijxj j=l or so YVQX+WY. y = (1 - W)-lQ x. values and connection proposed network. White circles indiblack circles anti-Hcbbian connec-

A mechanism has been propored


correlations modifiable. between units receiving anti-Hebbian connections.

[S] that removes correlated inputs by

+ e wijyj, j=l

x1

y1

w ij = 0. strengths and

Initially the Qijs are set to random The modification rules for the are: AWij = - a yiyj Mij ifi#j

x2

y2

= B(xjyi - gijyi2).

X n Figure 2. Decorrelating network connections. During training between the outputs become zero.

yn with anti-Hcbbian cross-correlations

The training is unsupervised, i.e. there is no need for a teaching or error signal. The modification rules are purely local. All the information necessary for the modification of connection strength is available locally at the site of the connection, and there is no need for the propagation of values from other units.

Here the output of a unit is the sum of the input to that unit and the feedback it receives from other units weighted by the connection strengths (Fig. 2): yi = Xi + 2 wijYj j=l Initially the connections are ineffective, wij = 0, and the modification rule for the feedback connections (as in Kohonens novelty filter [a]) is: AWij = - a YiYj if i rr j.

The networks operation was simulated on a slow time scale, i.e. neither the convergence of the outputs, nor the stable outputs for individual input vectors were calculated. The input was assumed to have normal distribution with zero mean, characterized by its covariancc matrix, Cx, and because of the linearity of the network the output distribution could be calculated directly from this covariancc matrix. If the matrix of the transformation performed by the network is T then the distribution of the output is also normal with cdvariance matrix cy = q=> where input = ~xx=T=> = T cxx=> T= = TCKT=, value operation over the

According to this rule. the connection between any two positively correlated units becomes more negative (more inhibitory or less excitatory). This forces the outputs to be less correlated until all correlations are rcmoved, and the network acts as a whitening filter.

< > denotes the expected distribution.

The proposed network combines properties of the previous two (Fig. 3). A large number (m) of inouts connect to a smaller number tn) of out&t; units Z by Hcbbian connections (a). and anti-Hebbian feedback . ._ connections (w) between the units keep the outputs uncorrelated. When an input is presented units settle to a stable state for which to the network, the

For each simulation run the input correlation matrix was chosen randomly, as described in the Appendix. The proposed modification rules were auaroximated by taking their expected values over the set of input patterns and keeping a and g small (a = p = 0.02). The following matrices were calculated in each cycle t for different network sizes: Toi= (1 _ W(t))-lQ(t).

C,(t) = T(t) Cx T(t)=, W(t+l) Q(t+l) = W(t) - a offdiag(Cy(t)), = Q(t) + g(T(t)CK - diag(Cy(t))Q(t)),

9..

0.6

m=lOO,n=lO

* b 0

ZA cycles

-1 730

Od -

0.6 -

0.6 -

m=2OO,n=20
0.1 0.4 -

m=4OO,n=40

0.2 -

03

0 0

50 I 50

I lC4
lC4

cycles Figure 4. The convergence of subspace of network outputs to the PCA subspace. The quantity Ip / 1 indicates the overlap of the two subspaces IS described in Fig. 5. m is the number of inputs. a the number of outputs. The curve is an average of 10 with different random input covariancc matrices.
N~S

I 150

I 200

1
250

0 0

I so

I 100

I 150

I zw cycles

1 250

where offdiag() is an operator which sets diagonal matrix elements to zero, while diag() sets off-diagonal elements to zero. For each input distribution PCA subspacc was calculated, and the between this and the subspace of the plotted (Pig. 4, 5). The basis vectors of the network, i.e. the row vectors identical to the principal components, same subspace. Figure 5. Illustration of the quantity Ip / 1 used in Fig. 4. For an arbitrary vector of length 1, and a given subspace (here a plane) lp / 1 approaches 1 as the vector converges to the subspace. where lp is the length of the projection of the vector on the given subspace. In Pig. 4. the vectors were the rows of T, and the subspace was that defined by PCA. the n dimensional amount of overlap network outputs was of the transformation of T, need not be but they span the

The amount of information transmitted by the network was calculated by treating it as a noisy communication channel. A communication channel [e.g. 71 is charactcrized by the mutunl information, 1(X, Y) between its input. X, and its output. Y. If the m dimensional continuous input has probability density function pX(x), 1(X: Y) is dcftaed as:

, _..........................................................................

pm

111

PC,\

0.8

0.6

m=50,n=S
CJi 0.4 -

molOO,n=lO

-..

0.6

m=2OO,n=20

m=400.n=40

I
0.5 0.4 -

02

0.2 -

0 0

I 50

I loo

I I50

I 2w cycles

I 250

0
0

I so

I loo

I I50

I 200

cycles

I 250

Figure 6, The mutual information of the network reaches the maximum set by PCA. The curve is an average of 10 runs with different random input covariancc matrices.

1(X, Y) = H(X) - H(X I Y), where H(X) = - I pX(x) log(px(x)) dx.

It can be shown, that H(X) - H(X I Y) = H(Y) - H(Y I X), where H(Y) is the entropy of the output, and H(Y I X). the conditional entropy of tbc output, represents the effect of noise in the input. The right-hand side of the equation was used to calculate the mutual information in each iteration step, by adding uncorrelatcd noise to the network input. The conditional entropy of the output is non-zero because of the noise transformed from the input to the output, and [7]

is the entropy [see 7). which is a measure of uncertainty about the input and H(X I Y) is the conditional entropy, the amount of uncertainty about the input that remains after having observed the output of the channel. This latter may be non-zero either because of noise in the channel or because the output is lower dimensional than the input. In the case of the network we would like the observation of its output to maximixc the decrease in our uncettainty about the input, i.e. it should maximizc the mutual information. Ojas algorithm reaches the maximum of mutual information set by PCA for the single unit case if the inputs contain uncorrelated noise of equal variance [g], and similarly for algorithms that yield the PCA subspacc [9].

where YN is the noise transformed to the output. The noise is also assumed to have normal distribution with zero mean, and the noise on each input is assumed to be independent and of unit variance, so the covariance matrix of tbt noise on the input is 1 (the identity matrix). The covariance of the noise on the output is CN = TTT.

The entropy of an n-dimensional bution with covariance matrix C is [7]: H=flog SO I(X;Y)=HCY)-HWN)= + log( (2nc)n det(CY)) - + log( (2ndn ((2ne)n de@.

normal

distriI would like to thank Prof. I$ B. Barlow and Dr G. J. Mitchison and others in Cambridge for their assistance and helpful discussions. This work was supported by research grants by the Royal Society, MCR. SERC and Churchill College, Cambridge.

det(CN))

[II

. P.A. Dcvijvcr .

and J. Kit&r. Prentice-Hall,

mm 1982.

rm

._

This quantity should be compared tual information given by PCA:

with

the mu-

121 E. Oja. A Simplified pal Component Analyzer, J., 267-273. 1982.

Neuron Model

as a Princivol. 15, pp. crof

1pC*=+ log(

(2ze)n

dct(Cp))

- f log( (2xe)n

dct(PP=)),

R. J. Williams, Feature discovery through 131 ror-correction learning, ICS Report 8501, University California, San Diego. 1985.

where P is the matrix containing the n largest normalixed principal components as rows, and CP = PCKPT. Further, we now that detfPPT) = 1. because PCA is an orthogonal transformation, so

T. D. Sanger. Optimal Unsupervised Learning 141 in Feedforward Neural Networka, 1988. MSc Thesis submitted to the Department of Electrical Engineering and Computer Science, MIT. H. B. Barlow and P. Fl)ldillk, Adaptation and [Sl decorrelation in the cortex, to appear in Be (lomautine Neuron. cd. by C. Miall, R.M. Durbin, G.J. Mitchison, Addison-Wesley, 1989.

IpC*=i log(

det(Cp) ) = 4 10e( X1X2*** 111 ).

where )cl. 12,...1n are the n largest eigenvalucs of CK, Fig. 6 shows the mutual information of the proposed network as a function of training cycles as it approaches the maximal value set by PCA.

['31
&,~Q,IY.

T. Kohonen. Self-Orggp&&g New York: Springer-Vcrlag,

and 1984. ie

BSSpciativq the=,

D. S. Jones, WN 171 Oxford: Clamdon Press, 1979. The results of the simulations indicate that the proposed simple rules can yield feature detectors that reduce dimensionality in linear network with minimal loss of information for Gaussian distributions. The convcrgcncc of the network is fast, and does not appear to dcpcnd strongly on the size of the network within the simulated range. For non-Gaussian distributions the proposed algorithm may not be optimal in information-theoretical sense, but by removing correlations, i.e. second order dcpendcncies between variables, it may make it easier fcr further, nonlinear processing stages to detect higher-order structure, frequently occurring combinations or patterns in the input. network,

[81

R. Linskcr, IEEE,

Self-organization in a perceptual vol. 21. pp.lOS-117, March 1988.

M. D. Plumblcy and F. Fallside, An Informa191 tion-Theoretic Approach to Unsupervised Conncctionist . . Models, submitted to &g&tggg of the CD San Mateo, CA: Morgan-Kaufmann. 7 [ 10 ] B. A. Pearlmutter and G. E. Hinton. G-maximization: An unsupervised learning procedure for discovering regularities, in ProceeQiaos of the Confemcc owl . Networks for ComDutlne I American Institute of Physics, 1986. [ 11 ] proximation expectation w E. Oja and J. Karhunen, On stochastic apof the cigcnvectors and eigtnvalucs of the of a random matrix, lnumsl of MB . . vol. 106, pp.69-84, 1985.

For each run qij(O) s were chosen to be random numbers from an even distribution over the interval f-5; 51. Each random input covariance matrix, CK, was generated by randomly rotating a random diagonal matrix, A. The elements of this diagonal matrix will be the cigenvalues of the covariancc matrix. These eigenvalues were chosen randomly from an exponential distribution with expected value 1. Let us denote the chosen eigenvalucs arranged into descending order by Xl > 12 > ... > Xm. The rotation matrix L was generated by Gram-Schmidt orthogonalization of a matrix with elements chosen from a unsorm distribution on the interval [-5: 51. CK = LALT. Thec u-s of L will be the eigenvcctors of CK, the principal components.

Anda mungkin juga menyukai