Anda di halaman 1dari 19

Perspectives on Learning Symbolic Data with Connectionistic Systems

Barbara Hammer
University of Osnabr ck, Department of Mathematics/Computer Science, D-49069 Osnabr ck, u u Germany, e-mail: hammer@informatik.uni-osnabrueck.de. Abstract. This paper deals with the connection of symbolic and subsymbolic systems. It focuses on connectionistic systems processing symbolic data. We examine the capability of learning symbolic data with various neural architectures which constitute partially dynamic approaches: discrete time partially recurrent neural networks as a simple and well established model for processing sequences, and advanced generalizations like holographic reduced representation, recursive autoassociative memory, and folding networks for processing tree structured data. The methods share the basic dynamics, but they differ in the specic training methods. We consider the following questions: Which are the representational capabilities of the architectures from an algorithmic point of view? Which are the representational capabilities from a statistical point of view? Are the architectures learnable in an appropriate sense? Are they efciently learnable?

1 Introduction
Symbolic methods and connectionistic or subsymbolic systems constitute complementary approaches in order to automatically process data appropriately. Various learning algorithms for learning an unknown regularity based on training examples exist in both domains: decision trees, rule induction, inductive logic programming, version spaces, . . . on the one side and Bayesian reasoning, vector quantization, clustering algorithms, neural networks, . . . on the other side [24]. The specic properties of the learning algorithms are complementary as well. Symbolic methods deal with high level information formulated via logical formulas, for example; data processing is human-understandable; hence it is often easy to involve prior knowledge, to adapt the training outputs to specic domains, or to retrain the system on additional data; at the same time, training is often complex, inefcient, and sensitive to noise. In comparison, connectionistic systems deal with low level information. Since they perform pattern recognition, their behavior is not human understandable and often, adaptation to specic situations or additional data requires complete retraining. At the same time, training is efcient, noise tolerant, and robust. Common data structures for symbolic methods are formulas or terms, i.e., high level data with little redundant information and a priori unlimited structure where lots of information lie in the interaction of the single data components. As an example, the meaning of each of the symbols in the term father(John,Bill) is essentially connected to its respective positions in the term. No symbol can be omitted without loosing important information. Assumed Bill was the friend of Marys brother, the above term could be substituted by father(John,friend(brother(Mary))), a term with a different length and structure. Connectionistic methods process patterns, i.e., real vectors of a xed dimension, which commonly comprise low level, noisy, and redundant information of a xed

Barbara Hammer

Fig. 1. Example for subsymbolic data: hand-written digit .

and determined form. The precise value and location of the single components is often unimportant, information comes from the sum of local features. As an example, Fig. 1 depicts various representations of the digit ; each picture can be represented by a vector of gray-levels; the various pictures differ considerably in detail while preserving important features such as two curved lines of the digit . Often, data possess both, symbolic and subsymbolic aspects: As an example, database entries may combine the picture of a person, his income, and his occupation; web sites consist of text, pictures, formulas, and links; arithmetical formulas may contain variables and symbols as well as real numbers. Hence appropriate machine learning methods have to process hybrid data. Moreover, people are capable of dealing with both aspects at the same time. It would be interesting to see which mechanisms allow articial learning systems to handle both aspects simultaneously. We will focus on connectionistic systems capable of dealing with symbolic and hybrid data. Our main interests are twofold: On the one hand, we would like to obtain an efcient learning system which can be used for practical applications involving hybrid data. On the other hand, we would like to gain insight into the questions of how symbolic data can be processed with connectionistic systems in principle; do there exist basic limitations; does this point of view allow further insight into the black-box dynamics of connectionistic systems? Due to the nature of symbolic and hybrid data, there exist two ways of asking questions about the theoretical properties of those mechanisms: the algorithmic point of view and the statistical point of view. One can, for example, consider the question whether symbolic mechanisms can be learned with hybrid systems exactly; alternatively, the focus can lie on the property that the probability of poor performance on input data can be limited. Generally speaking, one can focus on the symbolic data; alternatively, one can focus on the connectionistic systems. It will turn out that this freedom leads to both, further insight into the systems as well as additional problems which are to be solved. Various mechanisms extend connectionistic systems with symbolic aspects; a major problem of networks dealing with symbolic or hybrid data lies in the necessity of processing structures with a priori unlimited size. Mainly three different approaches can be found in the literature: Symbolic data may be represented by a xed number of features and further processed with standard neural networks. Time series, as an example, may be represented by a local time window of xed length and additional global features such as the overall trend [23]. Formulas may be represented by the involved symbols and a measure of their complexity. This approach is explicitely static: Data are encoded in a nite dimensional vector space via problem specic features before further processing with a connectionistic system. Obviously, the representation of data is not tted to the specic learning task since learning is independent of encoding. Moreover, it may be difcult or in general impossible to nd a representation in a nite dimen-

Perspectives on Learning Symbolic Data with Connectionistic Systems

sional vector space such that all relevant information is preserved. As an example, the terms equal(a,a), equal(f(a),f(a)), equal(f(f(a)),f(f(a))), . . . could be represented by the number of occurrences of the symbol at the rst and second position in the respective term. The terms equal(g(a,g(a,a)),g(a,g(a,a))), equal(g(g(a,a),a),g(g(a,a),a)) can no longer be represented in the same way without loss of information, we have to add an additional part encoding the order of the symbols. Alternatively, the a priori unlimited structure of the inputs can be mapped to a priori unlimited processing time of the connectionistic system. Standard neural networks are equipped with additional recurrent connections for this purpose. Data are processed in a dynamic way involving the additional dimension of time. This can either be fully dynamic, i.e., symbolic input and output data are processed over time, the precise dynamics and number of recurrent computation steps being unlimited and correlated to the respective computation; or the model can be partially dynamic and implicitly static, i.e., the precise dynamics are correlated to the structure of the respective symbolic data only. In the rst case, complex data may be represented via a limiting trajectory of the system, via the location of neurons with highest activities in the neural system, or via synchronous spike trains, for example. Processing may be based on Hebbian or competitive activation such as in LISA or SHRUTI [15,39] or on an underlying potential which is minimized such as in Hopeld networks [14]. There exist advanced approaches which enable complex reasoning or language processing with fully dynamic systems; however, these models are adapted to the specic area of application and require a detailed theoretical investigation for each specic approach. In the second case, the recurrent dynamics directly correspond to the data structure and can be determined precisely assumed the input or output structure, respectively, is known. One can think of the processing as an inherently static approach: The recurrence enables the systems to encode or decode data appropriately. After encoding, a standard connectionistic representation is available for the system. The difference to a feature based approach consists in the fact that the encoding is adapted to the specic learning task and need not be separated from the processing part, coding and processing constitute one connected system. A simple example of these dynamics are discrete time recurrent neural networks or Elman networks which can handle sequences of real vectors [6,9]. Knowledge of the respective structure, i.e., the length of the sequence allows to substitute the recurrent dynamics by an equivalent standard feedforward network. Input sequences are processed step by step such that the computation for each entry is based on the context of the already computed coding of the previous entries of the sequence. A natural generalization of this mechanism allows neural encoding and decoding of tree structured data as well. Instead of linear sequences, one has to deal with branchings. Concrete implementations of this approach are the recursive autoassociative memory (RAAM) [30] and labeled RAAM (LRAAM) [40], holographic reduced representations (HRR) [29], and recurrent and folding networks [7]. They differ in the method of how they are trained and in the question as to whether the inputs, the outputs, or both may be structured or real valued, respectively. The basic recurrent dynamics are the same for all approaches. The possibility to deal with symbolic data, tree structures, relies on some either xed or trainable recursive encoding and decoding of data with simple mappings computed by standard networks. Hence the approaches are uniform

Barbara Hammer

and a general theory can be developed in contrast to often very specic fully dynamic systems. However, the idea is limited to data structures whose dynamics can be mapped to an appropriate recursive network. It includes recursive data like sequences or tree structures, possibly cyclic graphs are not yet covered. We will start with the investigation of standard recurrent networks because they are a well established and successful method and, at the same time, demonstrate a typical behavior. Their in principle capacity as well as their learnability can be investigated from an algorithmic as well as a statistical point of view. From an algorithmic point of view, the connection to classical approaches like nite automata and Turing machines is interesting. Moreover, this connection allows partial insight into the way in which the networks perform their tasks. There are only few results concerning the learnability of these dynamics from an algorithmic point of view. Afterwards, we will study the statistical learnability and approximation ability of recurrent networks. These results are transferred to various more general approaches for tree structured data.

2 Network Dynamics
First, the basic recurrent dynamics are dened. As usual, a feedforward network consists of a weighted directed acyclic graph of neurons such that a global processing rule is obtained via successive local computations of the neurons. Commonly, the neurons iteratively compute their activation , denoting the predecessors of neuron , denoting some real-valued weight assigned to connection , denoting the bias of neuron , and its activation function . Starting with the neurons without predecessors, the so called input neurons, which obtain their activation from outside, the neurons successively compute their activation until the output of the network can be found at some specied output neurons. Hence feedforward networks compute functions from a nite dimensional real-vector space into a nite dimensional real-vector space. A network architecture only species the directed graph and the activation functions, but not the weights and biases. Often, socalled multilayer networks or multilayer architectures are used, meaning that the graph decomposes into subsets, so-called layers, such that connections can only be found between consecutive layers. It is well known that feedforward neural networks are universal approximators in an appropriate sense: Every continuous or measurable function, respectively, can be approximated by some network with appropriate activation function on any compact input domain or for inputs of arbitrarily high probability, respectively. Moreover, such mappings can be learned from a nite set of examples. This, in more detail, means that two requirements are met. First, neural networks yield valid generalization: The empirical error, i.e., the error on the training data, is representative for the real error of the architecture, i.e., the error for unknown inputs, if a sufciently large training set has been taken into account. Concrete bounds on the required training set size can be derived. Second, effective training algorithms for minimizing the empirical error on concrete training data can be found. Usually, training is performed with some modication of backpropagation like the very robust and fast method RProp [32]. Sequences of real vectors constitute simple symbolic structures. They are difcult for standard connectionistic methods due to their unlimited length. We denote the set

4 & 154

( & )'%  #!    $ "     (

  (

4 2 #3 " 10% ( &

Perspectives on Learning Symbolic Data with Connectionistic Systems

by . A common way of processing of sequences with elements in an alphabet sequences with standard networks consists in truncating, i.e., a sequence with initially unknown length is substituted by only a part with a priori xed time horizon . Obviously, truncation usually leads to information loss. Alternatively, one can equip feedforward networks with recurrent connections and use the further dimension of time. Here, we introduce the general concept of recurrent coding functions. Every mapping with appropriate domain and codomain induces a mapping on sequences or into sequences, respectively, via recursive application as follows:

enc

enc

dec

dec

Note that dec may be not dened if the decoding does not lead to values in . Therefore one often restricts decoding to decoding of sequences up to a xed nite length in practice. Recurrent neural networks compute the composition of up to three enc functions dec depending on their respective domain and codomain where , , and are computed by standard feedforward networks. Note that this notation is somewhat unusual in the literature. Mostly, recurrent networks are dened via their transition function and referring to the standard dynamics of discrete dynamic system. However, the above denition has the advantage that the role of the single network parts can be made explicit: Symbolic data are rst encoded into a connectionistic representation, this connectionistic representation is further processed with a standard network, nally, the implicit representation is decoded to symbolic data. In practice, these three parts are not well separated and one can indeed show that the transformation part can be included in either encoding or decoding. Encoding and decoding need not compute a precise encoding or decoding such that data can be restored perfectly. Encoding and decoding are part of a system which as a whole should approximate some function. Hence only those parts of the data have to be taken into account which contribute to the specic learning task. Recurrent networks are mostly used for time series prediction, i.e., the decoding dec is dropped. Long term prediction of time series, where the decoding part is necessary, is a particularly difcult task and can rarely be found in applications. A second advantage of the above formalism is the possibility to generalize the dynamics to tree structured data. Note that terms and formulas possess a natural representation via a tree structure: The single symbols, i.e., the variables, constants, function symbols, predicates, and logical symbols are encoded in some real-vector space via unique values, e.g., natural numbers or unary vectors; these values correspond to the labels of the nodes in a tree. The tree structure directly corresponds to the structure of the term or formula; i.e., subterms of a single term correspond to subtrees of a node

C $ @ R$$ @ rys A  Hd 9

dec

2 gv@

4 P

4 b 6 P v&

R 9

$ @ tys C 7 & 6 4 Y P

4 Y $ A C  P Fisv

Any function decoding

$$R A P @C E E E C A 9 IQrFFFG@ I h qyQsx C P @

enc

and nal set

induce a recursive

if otherwise

w 'vT

4 P e&

4 b 6 Y P dca`

Denition 1. Assume is some set. Any function context induce a recursive encoding

if otherwise

R P @C E E E C A SQIHFGFDB 9@

RV @C E E E C A WIFFGFDU 9@

7 86

$ @C E E E tsR P rGFGFC A @ r h qC p& 7 iY 9 4 6 P

h gt0 e f e

and initial

$ @ ys

4 2 P g8f

Barbara Hammer

(1,0,0) (0,1,0) (1,0,0) (0,0,1) (0,0,1) (0,0,1)


n r o so s HDsi j


nn mo mi ho n mi ji pqppIilkh $ XC E E E X DrrHGFGC A sy~ | 2 @ m

Fig. 2. Example for a tree representation of symbolic data: Left: encoding of , where represents , represents , and represents . Right: encoding of .

equipped with the label encoding the function symbol. See Fig. 2 for an example. In the following, we restrict the maximum arity of functions and predicates to some xed value . Hence data we are interested in are trees where each node has at most successors. Expanding the tree by empty nodes if necessary, we can restrict ourselves to the case of trees with fan-out exactly of the nodes. Hence we will deal with tree structures with fan-out as inputs or outputs of network architectures in the following. Denition 2. A -tree with labels in some set is either the empty tree which we and subtrees, some of denote by , or it consists of a root labeled with some which may be empty, , . . . , . In the latter case we denote the tree by . Denote the set of -trees with labels in by . The recursive nature of trees induces a natural dynamics for recursively encoding or decoding trees to real vectors. We can dene an induced encoding or decoding, respectively, for each mapping with appropriate arity in the following way:

Again, dec might be a partial function. Therefore decoding is often restricted to decoding of trees up to a xed height in practice. The encoding recursively applies a mapping in order to obtain a code for a tree in a real-vector space. One starts at the leaves and recursively encodes the single subtrees. At each level the already computed codes of the respective subtrees are used as context. The recursive decoding is dened in a similar manner: Recursively applying some decoding function to a real vector yields the label of the root and codes for the subtrees. In the connectionistic setting, the two mappings used for encoding or decoding, respectively, can be computed by standard feedforward neural networks. As in the linear case, i.e., the case of simple recurrent networks, one can combine mappings enc , dec , and depending on the specic learning task.

& @ $ 6  IC 7 &

4 Y

dec

dec

dec

$I$r$d  rCGEGEFEFCH$r$y@sAis F$@dqD @ }

$ dd& 4 b 6

qrFFGFC A C s 4 Y $ C E E E 

Any mapping recursive decoding

$$ X IDrs h qFGFFH$ A d h dx C E E E C X C ~

& X IC p& 7 WY 4 $ 6

enc

enc

enc

and set

E $ XC E E E X ~ HDrrFFFGC A dyX } X

if if

4 x)&

$ 4 b 6 Y d'c

Denition 3. Denote by a set. Any mapping induces a recursive encoding context

6 2 ~

7 $ 6 q

nn z v ui v n y v uii v nn z v yi v u pW{wxUy{wpxUpl{kxwi n so r o s ttDsi h n so so r trGi 6 | f IX u 6 A X | | g8f 4 2 | } h |

and initial

induces a

if otherwise.

Perspectives on Learning Symbolic Data with Connectionistic Systems

Note that this denition constitutes a natural generalization of standard recurrent networks and hence allows for successful practical applications as well as general investigations concerning concrete learning algorithms, the connection to classical mechanisms like tree automata, and the theoretical properties of approximation ability and learnability. However, it is not biologically motivated compared to standard recurrent networks, and though this approach can shed some light on the possibility of dealing with structured data in connectionistic systems, it does not necessarily enlighten the way in which humans solve these tasks. We will start with a thorough investigation of simple recurrent networks since they are biologically plausible and, moreover, signicant theoretical difculties and benets can already be found at this level.

3 Recurrent Neural Networks


Recurrent networks are a natural tool in any domain where time plays a role, such as speech recognition, control, or time series prediction, to mention just a few [8,9,25,41]. They are also used for the classication of symbolic data such as DNA sequences [31]. Turing Capabilities The fact that their inputs and outputs may be sequences suggests the comparison to other mechanisms operating on sequences, such as classical Turing machines. One can consider the internal states of the network as a memory or tape of the Turing machine. Note that the internal states of the network may consist of real values, hence an innite memory is available in the network. In Turing machines, operations on the tape are performed. Each operation can be simulated in a network by a recursive computation step of the transition function. In a Turing machine, the end of a computation is indicated by a specic nal state. In a network, this behavior can be mimicked by the activation of some specic neuron which indicates whether the computation is nished or still continues. The output of the computation can be found at the same time step at some other specied neuron of the network. Note that computations of a Turing machine which do not halt correspond to recursive computations of the network such that the value of the specied halting neuron is different from some specied value. A schematic view of such a computation is depicted in Fig. 3. A possible formalization is as follows: can be computed by a Denition 4. A (possibly partial) function recurrent neural network if feedforward networks , , enc and exist such that for all sequences , where denotes the smallest number of iterations such that the activation of some specied output neuron of the part is contained in a specied set encoding the end of the computation after iteratively applying to enc . Note that simulations of Turing machines are merely of theoretical interest; such computation mechanisms will not be used in practice. However, the results shed some light on the power of recurrent networks. The networks capacity naturally depends on the choice of the activation functions. Common activation functions in the literature are piecewise polynomial or S-shaped functions such as:

4 p&

4 Y f

4 p&

$ @ d h Qge sx $ @ e f 4 b C w Y UDGl8 P C w & 7 C w Y DFWtDGll $ @ ys h f

4 &

4 Y P x'

Barbara Hammer
not yet 1 1

input

recursive encoding

recursive computation

output

Fig. 3. Turing computation with a recurrent network.

Obviously, recurrent networks with a nite number of neurons and the perceptron activation function have at most the power of nite automata, since their internal stack is nite. In [38] it is shown that recurrent networks with the semilinear activation function are Turing universal, i.e., there exists for every Turing machine a nite size recurrent network which computes the same function. The proof consists essentially in a simulation of the stacks corresponding to the left and right half of the Turing tape via the activation of two neurons. Additionally, it is shown that standard tape operations like push and pop and Boolean operations can be computed with a semilinear network. The situation is more complicated for the standard sigmoidal activation function since exact classical computations which require precise activations or , as an example, can only be approximated within a sigmoidal network. Hence the approximation errors which add up in recurrent computations must be controlled. [16] shows the Turing universality of sigmoidal recurrent networks via simulating so-called clock machines, a Turing-universal formalism which, unfortunately, leads to an exponential delay. However, people believe that standard sigmoidal recurrent networks are Turing universal with polynomial resources, too, although the formal proof is still missing. In [37] the converse direction, simulation of neural network computations with classical mechanisms is investigated. The authors relate semilinear recurrent networks to so-called non-uniform Boolean circuits. This is particularly interesting due to the fact that non-uniform circuits are super-Turing universal, i.e., they can compute every, possibly non-computable function, possibly requiring exponential time. Speaking in terms of neural networks: Additionally to the standard operations, networks can use the unlimited storage capacity of the single digits in their real weights as an oracle, a linear number of such digits is available in linear time. Again, the situation is more difcult for the sigmoidal activation function. The super-Turing capability is demonstrated in [36], for example. [11] shows the super-Turing universality in possibly exponential time and which is necessary in all super-Turing capability demonstrations of recurrent networks, of course with at least one irrational weight. Note that the latter results rely on an additional severe assumption: The operations on the real numbers are performed with innite precision. Hence further investigation could naturally be put in a line with the theory of computation on the real numbers [3].

E $ A y $ @ $ @ s'tys @ vw @ $ @ s w @
, lin

 $ @ rs @ $ @ td w $ @ 'ys

perceptron function H semilinear activation lin sigmoidal function sgd

for for e

,H

for

otherwise,

Perspectives on Learning Symbolic Data with Connectionistic Systems

Finite Automata and Languages The transition dynamics of recurrent networks directly correspond to nite automata, hence comparing to nite automata is a very natural question. For formal denition, a enc nite automaton with states computes a function , where is a nite alphabet, is a transition function mapping an input letter and a context state to a new state, is the initial state, and is a projection of the states to . A language is accepted by an automaton enc if some automaton computing exists such that . Since neural networks are far more powerful, they are super-Turing universal, it is not surprising that nite automata and some context sensitive languages, too, can be simulated by recurrent networks. However, automata simulations have practical consequences: The constructions lead to effective techniques of automata rule insertion and extraction; moreover, the automaton behavior is even learnable from data as demonstrated in computer simulations. It has been shown in [27], for example, that nite enc , being a simautomata can be simulated by recurrent networks of the form ple projection, and being a standard feedforward network. The number of neurons which are sufcient in is upper bounded by a linear term in , the number of states of the automaton. Moreover, the perceptron activation function or the sigmoidal activation function or any other function with similar properties will do. One could ask whether less neurons are sufcient since one could encode states in the activation of only binary valued neurons. However, an abstract argumentation shows that at least for perceptron networks, a number of neurons is necessary. Since this argumentation can be used at several places, we shortly outline the main steps: The set of nite automata with states and binary inputs denes the class of functions computable with such a nite automaton, say F . Assume a network with at most neurons could implement every -state nite automaton. Then the class of functions computable with -neuron architectures, say F , would be at least as powerful as F . Consider the sequences of length with an entry precisely at the th position. Assume some arbitrary binary function is xed on these sequences. Then there exists an -state automaton which implements on the sequences: We can use states for counting the positions of the respective input entry. We map to a specied nal accepting state whenever the corresponding function value is . As a consequence, we need to nd for those sequences and every dichotomy some recurrent network which maps the sequences accordingly, too. However, the number of input sequences which can be mapped to arbitrary values is upper bounded by the so-called pseudodimension, a quantity measuring the richness of function classes as we will see later. In particular, this quantity can be upper bounded by a term O for perceptron networks with input sequences of length and weights. Hence the limit follows. However, various researchers have demonstrated in theory as well as in practice that sigmoidal recurrent networks can recognize some context sensitive languages as well: It is proved in [13] that they can perform counting, i.e., recognize languages of the form or, generally spoken, languages where the multiplicities of various symbols have to match. Approaches like [20] demonstrate that a nite approximation of these languages can be learned from a nite set of examples. This capacity is of partic-

$ I$

D$@sxG762@t0 h X e 0w 786' C w DGlt E E rCHGEFGCq0f 2 C E E E C & C E E E C b 6 Y rGFGqIFGFilX $  5sI ( h e 5  $ g v q3  $  '

C w & 7 6 FWtSY

hX e

v I3 

2 T P P ~ 0G HSlP t

10

Barbara Hammer

ular interest due to its importance for the capability of understanding natural languages with nested structures. The learning algorithms are usually standard algorithms for recurrent networks which we will explain later. Commonly, they do not guarantee the correct long-term behavior of the networks, i.e., they lead only sometimes to the correct behavior for long input sequences, although they perform surprisingly well on short training samples. Learnability for example in the sense of identication in the limit as introduced by Gold is not granted. Approaches which explicitely tackle the long term behavior and which, moreover, allow for a symbolic interpretation of the connectionistic processing are automata rule insertion or extraction: The possibly partial explicit knowledge of the automatons behavior can be directly encoded in a recurrent network used for connectionistic processing, if necessary with further retraining of the network. Conversely, automata rules can be extracted from a trained network which describe the behavior approximately and generalize to arbitrarily long sequences [5,26]. However, all these approaches are naturally limited due to the fact that common connectionistic data are subject to noise. Adequate recursive processing relies to some extent on the accuracy of the computation and the input data. The capacity is different if noise is present: At most nite state automata can be found assumed the support of the noise is limited. Assumed the support of the noise is not limited, e.g. the noise is Gaussian, then the capacity reduces to the capacity of simple feedforward dynamics with a nite time window [21,22]. Hence recurrent networks can in a nite approximation algorithmically process symbolic data, the presence of noise limits their capacities. Learning Algorithms Naturally, an alternative point of view is the classical statistical scenario, i.e., possibly noisy data allow to learn an unknown regularity with high accuracy and condence for data of high probability. In particular, the behavior need not be correct for every input; the learning algorithms are only guaranteed to work well in typical cases, in unlikely situations the system may fail. The classical PAC setting as introduced by Valiant formalizes this approach of learnability [42] as follows: Some unknown regularity for which only a nite set of examples is available, is to be learned. A learning algorithm chooses a function from a specied class of functions, e.g. given by a neural architecture, based on the training examples. There are two demands: The output of the algorithm should nearly coincide with the unknown regularity, mathematically, the probability that the algorithm outputs a function which differs considerably from the function to be learned should be small. Moreover, the algorithm should run in polynomial time, the parameters being the desired accuracy and condence of the algorithm. Usually, learning separates into two steps as depicted in Fig. 4: First, a function class with limited capacity is chosen, e.g. the number of neurons and weights is xed, such that the function class is large enough to approximate the regularity to be learned and, at the same time, allows identication of an approximation based on the available training set, i.e., guarantees valid generalization to unseen samples. This is commonly addressed by the term structural risk minimization and obtained via a control of the so-called pseudodimension of the function class. We will address this topic later. In a second step, a concrete regularity is actually searched for in the specied function class, commonly via so called empirical risk minimization, i.e., a function is chosen

$ @ @ r$ dxC s

Perspectives on Learning Symbolic Data with Connectionistic Systems


nested function classes of increasing complexity First step: choose a function class ^ f ^ g f f: function to be learned ^ f: empirical approximation

11

Second step: minimize the empirical error, ^ is the output of the algorithm g generalization error

Fig. 4. Structural and empirical risk minimization

which nearly coincides with the regularity to be learned on the training examples . According to these two steps, the generalization error divides into two parts: The structural error, i.e., the deviation of the empirical error on a nite set of data from the overall error for functions in the specied class, and the empirical error, i.e., the deviation of the output function from the regularity on the training set. We will shortly summarize various empirical risk minimization techniques for reare the training data and some neural arcurrent neural networks: Assume chitecture computing a function which is parameterized by the weights is chosen. Often, training algorithms choose appropriate weights by means of minimizing the quadratic error , being some appropriate distance, e.g. the Euclidian distance. Since in popular cases the above term is differentiable with respect to the weights, a simple gradient descent can be used. The derivative with respect to one weight decomposes into various terms according to the sequential structure of the inputs and outputs, i.e., the number of recursive applications of the transition functions. A direct recursive computation of the single terms has the complexity O , being the number of weights and being the number of recurrent steps. In so-called real time recurrent learning, these weight updates are performed immediately after the computation such that initially unlimited time series can be processed. This method can be applied in online learning in robotics, for example. In analogy to standard backpropagation, the most popular learning algorithm for feedforward networks, one can speed up the computation and obtain the derivatives in time O via rst propagating the signals forward through the entire network and recursive steps and afterwards propagating the error signals backwards through the network and all recursive steps. However, the possibility of online adaptation while a sequence is still processed is lost in this so called backpropagation through time [28,44]. There exist combinations of both methods and variations for training continuous systems [33]. The true gradient is sometimes substituted by a truncated gradient in earlier approaches [6]. Since theoretical investigation suggests that pure gradient descent techniques will likely suffer from numerical instabilities the gradients will either blow up or vanish at propagation through the recursive steps alternative methods propose a random guessing, statistical approaches like the EM algorithm, or an explicit normalization of the error like LSTM [1,12]. Practice shows that training recurrent networks is harder than training feedforward networks due to numerically ill-behaved gradients as shown in [2]. Hence the complexity of training recurrent networks is a very interesting topic; moreover, the fact that

$$ @ C @ Isxyd

$ IsHsxQ $$ @ C $ @ 

12

Barbara Hammer

the empirical error can be minimized efciently is one ingredient of PAC learnability. Unfortunately, precise theoretical investigations can be found only for very limited situations: It has been proved that xed recurrent architectures with the perceptron activation function can be trained in polynomial time [11]. Things change if architectural parameters are allowed to vary. This means that the number of input neurons, for example, may change from one training problem to the next since most learning algorithm are uniform with respect to the architectural size. In this case, almost every realistic situation is NP-hard already for feedforward networks, although this has not yet been proved for a sufciently general scenario. One recent result reads as follows: Assume there is given a multilayer perceptron architecture where the number of input neurons is allowed to vary from one instance to the next instance, the input biases are dropped, and no solution without errors exist. Then it is NP-hard to nd a network such that the number of misclassied points of the network compared to the optimum achievable number is limited by a term which may even be exponential in the network size [4]. People are working on adequate generalizations to more general or typical situations. Approximation Ability The ability of recurrent neural networks to simulate Turing machines manifests their enormous capacity. From a statistical point of view, we are interested in a slightly different question: Given some nite set of examples , the inputs or outputs may be sequences, does there exist a network which maps each approximately onto the corresponding ? Which are the required resources? If there is an underlying mapping, can it be approximated in an appropriate sense, too? The difference to the previous argumentation consists in the fact that there need not be a recursive underlying regularity producing from . At the same time we do not require to interpolate or simulate the underlying possibly non-recursive behavior precisely in the long term limit. One way to attack the above questions consists in a division of the problem into three parts: It is to be shown that sequences can be encoded or decoded, respectively, with a neural network, and that the induced mapping on the connectionistic representation can be approximated with a standard feedforward network. There exist two natural ways of encoding sequences in a nite dimensional vector space: Sequences of length at most can be written in a vector space of dimension , lling the empty spaces, if any, with entries ; we refer to this coding as vector-coding. Alternatively, the single entries in a sequence can be cut to a xed precision and concatenated in a single real number; we refer to this method as real-value-coding. Hence the sequence becomes or , as an example. One can show that both codings can be computed with a recurrent network. Vector-encoding and decoding can be performed with a network whose number of neurons is linear in the maximum input length and which possesses an appropriate activation function. Real-value-encoding is possible with only a xed number of neurons for purely symbolic data, i.e., inputs from with . Sequences in require additional neurons which compute the discretization of the real values. Naturally, precise decoding of the discretization is not possible since this information is lost in the coding. Encoding such that unique codes result can be performed with O neurons being the number of sequences to be encoded. Decoding real-value codes is possible,

qf

7 I$ P  4

w w E tlqq lw

d

$ w C w C w C E w C E w C E w C E w IWrllIQ WIW llq l

g6

$ fC @ Iyd

7 86

R E w C E w C E w C E HlW llq WIWDqq Ww 9

qf

qf

Perspectives on Learning Symbolic Data with Connectionistic Systems

13

too. However, a standard activation function requires a number of neurons increasing with the maximum length even for symbolic data [11]. It is well known that feedforward networks with one hidden layer and appropriate activation function are universal approximators. Hence one can conclude that approximation of general functions is possible if the above encoding or decoding networks are combined with a standard feedforward network which approximates the induced mappings on the connectionistic codes. To be more precise, approximating measurable functions on inputs of arbitrary high probability is possible through real-value encoding. Each continuous function can be approximated for inputs from a compact set through vector-encoding. In the latter case, the dimension used for the connectionistic representation necessarily increases for increasing length of the sequences [11]. Learnability Having settled the universal approximation ability, we should make sure that the structural risk can be controlled within a xed neural architecture. I.e., we have to show that a nite number of training examples is sufcient in order to nearly specify the unknown underlying regularity. Assume there is xed some probability measure on the inputs. For the moment assume that we deal with real-valued outputs only. Then one standard way to guarantee the above property for a function class F is via the so called uniform convergence of empirical distances (UCED) property, i.e.,

where is the real error and is the empirical error. The UCED property guarantees that the empirical error of any learning algorithm is representative for the real generalization error. We refer to the above distance as the risk. A standard way to prove the UCED property consists in an estimation of a combinatorial quantity, the pseudodimension. Denition 5. The pseudodimension of a function class F, VC F is the largest cardinality (possibly innite) of a set of points which can be shattered. A set of points is shattered if reference points exist such that for every some function F exists with . The pseudodimension measures the richness of a function class. It is the largest set of points such that every possible binary function can be realized on these points. No generalization can be expected if a training set can be shattered. It is well known that the UCED property holds if the pseudodimension of a function class is nite [43]. Moreover, the number of examples required for valid generalization can be explicitely limited by roughly the order , being the pseudodimension and the required accuracy. Assume F is given by a recurrent architecture with weights. Denote by F the restriction to inputs of length at most . Then one can limit VC F by a polynomial in and . However, lower bounds exist which show that the pseudodimension necessarily depends on in most interesting cases [17]. Hence VC F is innite for unrestricted

$ @C C yry

$ s @ Y

3G d$ dx @

$ 

$V 

$ 

4 `2 t

$ @ $ C  iys glS

2 v

DS

C w & FWtg

@ G$ s$ sx @ w 

holds for every

$ & dd

w & $ $ @C C pdGIr

$ C  xli

G G @ lrgs X @IFFGFC A G C E E E @ @C E E E IFFGFC A G @  X

14

Barbara Hammer

sequences. As a consequence, the above argumentation proves learnability only for restricted inputs. Moreover, since a nite pseudodimension (more precisely, a nite socalled fat-shattering dimension) is necessary for distribution independent learnability under realistic conditions, distribution independent bounds for the risk cannot exist in principle [43]. Hence one has to add special considerations to the standard argumentation for recurrent architectures. Mainly two possibilities can be found in the literature: One can either take specic knowledge about the underlying probability into consideration, or one can derive posterior bounds which depend on the specic training set. The results are as follows [11]: Assume and one can nd such that the probability of sequences of length is bounded from above by . Then the risk is limited by provided that the number of examples is roughly of order , being the (nite) pseudodimension of the architecture restricted to input sequences of length . Assume training on a set of size and maximum length has been performed. Then the risk can be bounded by a term of roughly order , being the (nite) pseudodimension of the architecture restricted to input sequences of length at most . A more detailed analysis even allows to drop the long sequences before measuring [10]. Hence one can guarantee valid generalization, although only with additional considerations compared to the feedforward case. Moreover, there may exist particularly ugly situations for recurrent networks where training is possible only with an exponentially increasing number of training examples [11]. This is the price one has to pay for the possibility of dealing with structured data, in particular data with a priori unlimited length. Note that the above argumentation holds only for architectures with real values as outputs. The case of structured outputs requires a more advanced analysis via so called loss functions and yields to similar results [10].

4 Advanced Architectures
The next step is to go from sequences to tree structured data. Since trees cover terms and formulas, this is a fairly general approach. The network dynamics and theoretical investigations are direct generalizations of simple recurrent networks. One can obtain and a recursive neural decoding dec a recursive neural encoding enc of trees if and are computed by standard networks. These codings can be composed with standard networks for the approximation of general functions. Depending on whether the inputs, the outputs, or both may be structured and depending on which part is trainable, we obtain different connectionistic mechanisms. A sketch of the rst two mechanisms which are described in the following can be found in Fig. 5. Recursive Autoassociative Memory The recursive autoassociative memory (RAAM) as introduced by Pollack and generalized by Sperduti and Starita [30,40] consists of a recursive encoding enc , a recursive decoding dec , and being standard feedforward networks, and a standard feedforward network . An appropriate composition of these parts can approximate mappings where the inputs or the outputs may be -trees or vectors, respectively. Training proceeds in

V q

I$ V  X X

4 & 7 6 P $ Y

D V

D |

6 7 $ &

X P

Perspectives on Learning Symbolic Data with Connectionistic Systems

15

RAAM:
a b d e c f

encoding

decoding a b d e c f

Folding networks:
a b d e c f

encoding

Fig. 5. Processing tree structures with connectionistic methods.

two steps, rst the composition dec enc is trained on the identity on a given training set with truncated gradient descent such that the two parts constitute a proper encoding or decoding, respectively. Afterwards, a standard feedforward network is combined with either the encoding or decoding and trained via standard backpropagation where the weights in the recursive coding are xed. Hence arbitrary mappings on structured data can be approximated. Note that the encoding is tted to the specic training set. It is not tted to the specic approximation task. In all cases encoding and decoding must be learned even if only the inputs or only the outputs are structured. In analogy to simple recurrent networks the following questions arise: Can any mapping be approximated in principle? Do the respective parts show valid generalization? Is training efcient? We will not consider the efciency of training in the following since the question is not yet satisfactorily answered for feedforward networks. The other questions are to be answered for both, the coding parts and the feedforward approximation on the encoded values. Note that the latter task only deals with standard feedforward networks whose approximation and generalization properties are well established. Concerning the approximation capability of the coding parts we can borrow ideas from recurrent networks: A natural encoding of tree structured data consists in the prex repcan uniquely be resentation of a tree. For example, the -tree represented by the sequence including the empty tree . Depending on whether real-labeled trees are to be encoded precisely, or only symbolic data, i.e., labels from where , are dealt with or a nite approximation of the real values is sufcient, the above sequence can be encoded in

R}qCC qqGSIqDSC qA A ~ 9 } C } C } C C } C } C A C } C } C C C $ Hr$ IDd A A y~ C $ C A C  

Bg6

7 86

h e

16

Barbara Hammer

a real-value code with a xed dimension or a vector code whose dimension depends on the maximum height of the trees. The respective encoding or decoding can be computed with recursive architectures induced by a standard feedforward network [11]. The required resources are as follows: Vector-coding requires a number of neurons which increases exponentially with the maximum height of the trees. Real-value-encoding requires only a xed number of neurons for symbolic data and a number of neurons which is quadratic in the number of patterns for real-valued labels. Real-value-decoding requires a number of neurons which increases exponentially with increasing height of the trees; the argument consists in a lower bound of the pseudodimension of function classes which perform proper decoding. This number increases more than exponentially in the height. Learning the coding yields valid generalization provided prior information about the input distribution is available. Alternatively, one can derive posterior bounds on the generalization error which depend on the concrete training set. These results follow in the same way as for standard recurrent networks. Hence the RAAM constitutes a promising and in principle applicable mechanism. Due to the difculty of proper decoding, applications can be found for small training examples only [40]. Folding Networks Folding networks use ideas of the LRAAM [19]. They focus on clustering symbolic data, i.e., the outputs are not structured, but real vectors. This limitation makes decoding superuous. For training, the encoding part and the feedforward network are composed and simultaneously trained on the respective task via a gradient descent method, socalled backpropagation through structure, a generalization of backpropagation through time. Hence the encoding is tted to the data and the respective learning task. It follows immediately from the above discussion that folding networks can approximate every measurable function in probability using real-value codes, and they can approximate every continuous function on compact input domains with vector codes. Additionally, valid generalization can be guaranteed with a similar argumentation as above with bounds depending on the input distribution or the concrete training set. Due to the fact that the difcult part, proper decoding, is dropped, several applications of folding networks for large data sets can be found in the literature: classication of terms and formulas, logo recognition, drug design, support of automatic theorem provers, . . . [19,34,35]. Moreover, they can be related to nite tree automata in analogy to the correlation of recurrent networks and nite automata [18]. Holographic Reduced Representation Holographic reduced representation (HRR) is identical to RAAM with a xed encoding and decoding: a priori chosen functions given by so-called circular correlation or convolution, respectively [29]. Correlation (denoted by ) and convolution (denoted by ) constitute a specic way to relate two vectors to a vector of the same dimension such that correlation and convolution are approximately inverse to each other, i.e., . Hence one can encode a tree via computing the convolution of each entry with a specic vector indicating the role of the component and adding these three vectors: , being the roles. The single entries can be

$ IC A sy~ X X

A x X

~ x A

~ $  e ~ ~

Perspectives on Learning Symbolic Data with Connectionistic Systems

17

. One can approximately restored via correlation: compute the deviation in the above equation under statistical assumptions. Commonly, the restored values are accurate provided the dimension of the vectors is sufciently high, the height of the trees is limited, and the vectors are additionally cleaned up in an associative memory. It follows immediately from our above argumentation that these three conditions are necessary: Decoding is a difcult task which requires for standard computations exponentially increasing resources. HRR is used in the literature for storing and recognizing language [29]. Since encoding and decoding are xed, no further investigation of the approximation or generalization ability is necessary.

5 Conclusions
Combinations of symbolic and connectionistic systems, more precisely, connectionistic systems processing symbolic data have been investigated. A particular difculty consists in the fact that the informational content of symbolic data is not limited a priori. Hence a priori unlimited length is to be mapped to a connectionistic vector representation. We have focused on recurrent systems which map the unlimited length to a priori unlimited processing time. Simple recurrent neural networks constitute a well established model. Apart from the simplicity of the data they process, sequences, the main theoretical properties are the same as for advanced mechanisms. One can investigate algorithmic or statistical aspects of learning, the rst ones being induced by the nature of the data, the second ones by the nature of the connectionistic system. We covered algorithmic aspects mainly in comparison to standard mechanisms. Although being of merely theoretical interest, the enormous capacity of recurrent networks has turned out. Concerning statistical learning theory, satisfactory results for the universal approximation capability and the generalization ability have been established, although generalization can only be guaranteed if specics of the data are taken into account. The idea of coding leads to an immediate generalization to tree structured data. Well established approaches like RAAM, HRR, and folding networks fall within this general denition. The theory established for recurrent networks can be generalized to these advanced approaches immediately. The in-principle statistical learnability of these mechanisms follows. However, some specic situations might be extremely difcult: Decoding requires an increasing amount of resources. Hence the RAAM is applicable for small data only, decoding in HRR requires an additional cleanup, whereas folding networks can be found in real world applications. Nevertheless, the results are encouraging since they prove the possibility to process symbolic data with neural networks and constitute a theoretical foundation for the success of some of the above mentioned methods. Unfortunately, the general approaches do neither generalize to cyclic structures like graphs, nor do they provide biological plausibility and could help explaining human recognition of these data. For both aspects fully dynamic approaches would be more promising although it would be more difcult to nd effective training algorithms for practical applications.

~ X 't$

AX I

~ A UQvs e A

18

Barbara Hammer

References
1. Y. Bengio and P. Frasconi. Credit assignment through time: Alternatives to backpropagation. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, Volume 5. Morgan Kaufmann, 1994. 2. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difcult. IEEE Transactions on Neural Networks, 5(2), 1994. 3. L. Blum, F. Chucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer, 1998. 4. B. DasGupta, and B. Hammer. On approximate learning by multi-layered feedforward circuits. In: H. Arimura, S. Jain, A. Sharma (eds.), Algorithmic Learning Theory2000, Springer, 2000. 5. M. W. Craven and J. W. Shavlik. Using sampling and queries to extract rules from trained neural networks. In: Proceedings of the Eleventh International Conference on Machine Learning, Morgan Kaufmann, 1994. 6. J. L. Elman. Finding structure in time. Cognitive Science, 14, 1990. 7. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data sequences. IEEE Transactions on Neural Networks, 9(5), 1997. 8. C. L. Giles, G. M. Kuhn, and R. J. Williams. Special issue on dynamic recurrent neural networks. IEEE Transactions on Neural Networks, 5(2), 1994. 9. M. Gori, M. Mozer, A. C. Tsoi, and R. L. Watrous. Special issue on recurrent neural networks for sequence processing. Neurocomputing, 15(3-4), 1997. 10. B. Hammer. Approximation and generalization issues of recurrent networks dealing with structured data. In: P. Frasconi, M. Gori, F. Kurfe, and A. Sperduti, Proceedings of the ECAI s workshop on Foundations of connectionist-symbolic integration: representation, paradigms, and algorithms, 2000. 11. B. Hammer. Learning with recurrent neural networks. Lecture Notes in Control and Information Sciences 254, Springer, 2000. 12. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997. 13. S. H lldobler, Y. Kalinke, and H. Lehmann. Designing a Counter: Another case study of Dyo namics and Activation Landscapes in Recurrent Networks. In G. Brewka and C. Habel and B. Nebel (eds.): KI97: Advances in Articial Intelligence, Proceedings of the 21st German Conference on Articial Intelligence, LNAI 1303, Springer, 1997. 14. J.J. Hopeld and D.W. Tank. Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 1985. 15. J.E. Hummel and K.L. Holyoak. Distributed representation of structure: a theory of analogical access and mapping. Psychological Review, 104, 1997. 16. J. Kilian and H. T. Siegelmann. The dynamic universality of sigmoidal neural networks. Information and Computation, 128, 1996. 17. P. Koiran and E. D. Sontag. Neural networks with quadratic VC dimension. Journal of Computer and System Sciences, 54, 1997. 18. A. K chler. On the correspondence between neural folding architectures and tree automata. u Technical report, University of Ulm, 1998. 19. A. K chler and C. Goller. Inductive learning symbolic domains using structure-driven neural u networks. In G. G rz and S. H lldobler, editors, KI-96: Advances in Articial Intelligence. o o Springer, 1996. 20. S. Lawrence, C.L. Giles, and S. Fong. Can recurrent neural networks learn natural language grammars?. In: International Conference on Neural Networks, IEEE Press, 1996.

Perspectives on Learning Symbolic Data with Connectionistic Systems

19

21. W. Maass and P. Orponen. On the effect of analog noise in discrete-time analog computation. Neural Computation, 10(5), 1998. 22. W. Maass and E. D. Sontag. Analog neural nets with Gaussian or other common noise distributions cannot recognize arbitrary regular languages. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, Volume 9. The MIT Press, 1998. 23. M. Masters. Neural, Novel, & Hybrid Algorithms for Time Series Prediction. Wiley, 1995. 24. T. Mitchel. Machine Learning. McGraw-Hill, 1997. 25. M. Mozer. Neural net architectures for temporal sequence processing. In A. Weigend and N. Gershenfeld, editors, Predicting the future and understanding the past. Addison-Wesley, 1993. 26. C. W. Omlin and C. L. Giles. Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1), 1996. 27. C. Omlin and C. Giles. Constructing deterministic nite-state automata in recurrent neural networks. Journal of the ACM, 43(2), 1996. 28. B. A. Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1995. 29. T. Plate. Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3), 1995. 30. J. Pollack. Recursive distributed representation. Articial Intelligence, 46, 1990. 31. M. Reczko. Protein secondary structure prediction with partially recurrent neural networks. SAR and QSAR in environmental research, 1, 1993. 32. M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation: The RPROP algorithm. In Proceedings of the Sixth International Conference on Neural Networks. IEEE, 1993. 33. J. Schmidhuber. A xed size storage O( ) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2), 1992. 34. T. Schmitt and C. Goller. Relating chemical structure to activity with the structure processing neural folding architecture. In Engineering Applications of Neural Networks, 1998. 35. S. Schulz, A. K chler, and C. Goller. Some experiments on the applicability of folding u architectures to guide theorem proving. In Proceedings of the 10th International FLAIRS Conference, 1997. 36. H. T. Siegelmann. The simple dynamics of super Turing theories. Theoretical Computer Science, 168, 1996. 37. H. T. Siegelmann and E. D. Sontag. Analog computation, neural networks, and circuits. Theoretical Computer Science, 131, 1994. 38. H. T. Siegelmann and E. D. Sontag. On the computational power of neural networks. Journal of Computer and System Sciences, 50, 1995. 39. L.Shastri. Advances in Shruti A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Applied Intelligence, 11, 1999. 40. A. Sperduti. Labeling RAAM. Connection Science, 6(4), 1994. 41. J. Suykens, B. DeMoor, and J. Vandewalle. Static and dynamic stabilizing neural controllers applicable to transition between equilibrium point. Neural Networks, 7(5), 1994. 42. L. Valiant. A theory of the learnable. Communications of the ACM, 27, 1984. 43. M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997. 44. R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. Rumelhart, editors, Back-propagation: Theory, Architectures and Applications. Erlbaum, 1992.

Anda mungkin juga menyukai