Anda di halaman 1dari 4

Frontiers in the Convergence of Bioscience and Information Technologies 2007

Improving training speed of Support Vector Machines by creating exploitable trends of Lagrangian variables: an application to DNA splice site detection
Jason Li, Saman K. Halgamuge Dynamic Systems & Control Group, DMME, The University of Melbourne, VIC 3010, Australia j.li5@pgrad.unimelb.edu.au Abstract
Support Vector Machines are state-of-the-art machine learning algorithms that can be used for classification problems such as DNA splice site identification. However, the large number of samples in biological data sets can often lead to slow training speed. The training speed can be improved by removing non-support vectors prior to training. This paper proposes a method to predict non-support vectors with high accuracy by the use of strictconstrained gradient ascent optimisation. Unlike other data pre-selection methods, the proposed gradient based method is itself a training algorithm for SVM, and is also very simple to implement. Experiments with comparable results are conducted on a DNA splice-site detection problem. Results show significant speed improvements over other algorithms. The relationship between speed improvement and cache memory size is also exploited. Generalisation capability of the proposed algorithm is also shown to be better than some other reformulated SVMs.

1. Introduction
Support vector machines (SVMs) [1, 2] are powerful machine learning algorithms that have been reported as successful in a variety of biological data classification problems including disease diagnosis and gene expression analysis [3, 4]. Although the performance of SVM is superior in terms of classification accuracy, its training methodology and speed still have significant room for improvement and remain the focus of many research works. Such researches are especially important for biomedical data sets as their high dimensionality and large number of data often hinder the speed of SVM training. The SVM classifier can be described as a quadratic programming (QP) problem. Traditional methods for solving this QP problem such as Newton or QuasiNewtons methods are incapable of handling large

dataset due to their O l 2 memory requirement [1]. To tackle this, a decomposition framework has been developed to divide the large problem into smaller subproblems [5, 6]. The well known Sequential Minimal Optimisation (SMO) [7, 8] and kernel-AdaTron (KA) [9, 10] training algorithms were also developed to address this issue, aiming to keep the memory requirement at minimum. However, modern computers possess ample memory that minimum memory usage is inefficient and hinders computational speed. Most biomedical and pattern recognition data sets are extremely high dimensional, meaning that the computation of a kernel entry can be very expensive. To address this problem, the idea of caching has emerged [11]. Caching refers to the process of storing the values of kernel entries in a computers physical memory to avoid repeat computation. The physical memory used for such purpose is called the cache. Caching allows practitioners to strive a balance between memory usage and time required for training. The effect of caching on training time is demonstrated in Fig. 1. Note the tremendous time saved when the whole kernel matrix can fit into memory (100%).
Cache effect on training time 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0% 20% 40% 60% 80% 100% Cache memory (% of the memory required for full kernel storage)

( )

Fig. 1: The training time of SMO on a splice-site detection dataset versus different memory size available for storing kernel entries

0-7695-2999-2/07 $25.00 2007 IEEE DOI 10.1109/FBIT.2007.56

230

T rain in g tim e (sec)

The work presented in this paper has been motivated by the need to reduce training time especially for cases where available cache memory can store only a fraction of the SVM kernel matrix. The proposed method integrates a specially tailored version of constrained gradient ascent (CGA) [12] with Keerthis modified version of SMO [7] (with caching) to provide a two-stage training: the proposed extended CGA serves as a fast preliminary training step to identify potential support vectors while SMO fine tunes the solution values. We will show that the proposed method, although involving data removal, can achieve better classification accuracy than LSSVM [13] and RSVM [14], two of the more popular algorithms.

2.2. CGA as the pre-training step


The proposed two-stage training approach, with CGA as the pre-training step to SMO, aims to exploit the strengths of both the CGA and SMO algorithms to provide an overall faster training method. The proposed CGA trains data in batch and its training time for each iteration is very fast. This nice property allows it to quickly identify potential support vectors, serving as a preliminary training step. The fine tuning ability of CGA, however, is low due to numeric precision and possibly ill-conditioned problems. SMO does not face the same problem in this regard, since its training is based on a completely different basis heuristics and analytical solutions. The disadvantage of SMO lies in its scalability to large dataset. It has time complexity O(n L ) where n is the number of training data and L the number of candidate support vectors during training [11]. Predetermining candidate support vectors and discarding the rest using CGA can help reduce L and n, thus improve the training speed. Their strengths and weaknesses imply that a joint effort is desired. The methodology of the two-stage training is largely based on the behaviour of the SVM Lagrangian variables () under the training of CGA. As our results indicate, follow the behaviours illustrated in Fig. 2 and Fig. 3 below, for non-support vectors and support vectors respectively. These patterns of behaviours are a result of strict-gradient ascent; such behaviours will not exist if non-strict gradient methods are used. These graphs show that all values will initially increase regardless of whether they will become support vectors (i.e., > 0) later or not. This increase is an intrinsic property of the SVM objective function. However, after a period of time, those for nonsupport vectors will drop back to zero.

2. Data reduction by CGA


2.1. The proposed constrained gradient ascent (CGA) algorithm
The CGA algorithm we propose comprises the first stage of training. In the literature, there are different types of constrained gradient methods reported and they have been applied in a variety of optimisation problems [12, 15, 16]. In this work, we utilise its simplest form strict-gradient ascent, further develop it to incorporate the constraints imposed by SVM, and develop a simple and fast implementation for it. More specifically, the CGA algorithm has been developed as follows: A. The simplest case without inequality constraints. A simple form of constrained gradient method has first been considered, ignoring all inequalities of SVM. This sets the framework for further derivation of our algorithm and helps to observe computational simplicity of CGA. B. Formulation with equalities. A mathematical model has then been developed to describe how to update the Lagrangian variables of the SVM optimisation problem. C. Implementation. We have developed a pseudo code describing the computational procedure that efficiently implements the associated mathematical model. D. Optimal learning rate. One inevitable parameter of CGA is the learning rate. We have developed a method to approximate the theoretical optimal learning rate with computational time taken into consideration. The details of these are available in the accompanying publication in the Journal of Biomedicine and Biotechnology.

Fig. 2: The plot of alpha values of a non-support vector against the training epochs in CGA

231

(Table 3) and it shows that the proposed two-staged method does not degrade the performance as much as those reformulations do.

4. Conclusion
Both CGA and SMO have the merit of simplicity in implementation. We propose a method that combines CGA with SMO to provide faster training for SVM classifiers. In terms of training speed, the two-stage training scheme brings significant improvement over the spice site data set. However, the amount of improvement is not steady across different cache sizes. Experiments also indicate that classification accuracy of the two-stage SVM is some times a little worse than the standard SVM because practical data sets can be ill-conditioned and practical learning rates are finite. Future works include developing a better criterion for transition from CGA to SMO such that real support vectors can be preserved. The possibility of prefixing CGA to algorithms other than SMO will also be explored.

Fig. 3: The plot of alpha values of a support vector against the training epochs in CGA

Unlike other data extraction techniques described previously, CGA also takes into account the available cache memory size. This allows the SMO training to be more effective.

3. Results
The proposed method possesses simplicity and analytical foundation, two crucial characteristics for algorithmic success as demonstrated by SMO [17]. The use of CGA as a pre-training step helps to work around poor caching policy by allowing a large data set to be reduced according to the size of cache memory. A splice-site detection dataset from StatLog [18] has been used to evaluate our proposed method. For comparability, both the CGA and SMO are implemented in the same settings. Results of other SVM algorithms are obtained from their respective publications. Table 1 shows that speed improvement is most significant when the cache memory size is 94% of the size required for storing full kernel matrix. This indicates the point of best balance between the two stages of training. Note that 94% of memory size means a coverage of 75% of data points since only half of the kernel needs to be stored in memory due to symmetry. This means that 25% data reduction with CGA is the most effective for this spice-site detection problem. Nevertheless, there is an overall improvement in speed regardless of the cache size. Since might not follow the behaviours in Fig. 2 and Fig. 3 in circumstances where we have extreme kernel values and precision restrictions, it is possible to have some alphas incorrectly removed during the firststage training with CGA. Consequently, classification accuracy could be affected. Table 2 shows that the classification accuracy of a CGA-reduced problem is slightly lower for donor-site detection. However, we have also compared the accuracy with Least-Squares SVM and Reduced SVM

5. References
[1] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines: And other kernel-based learning methods. Cambridge, England: Cambridge Press, 2000. V. Vapnik, Statistical Learning Theory. NY: Wiley, 1998. M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares, and D. Haussler, "Knowledge-based analysis of microarray gene expression data using support vector machines," in Proc. National Academy of Sciences, vol. 97, 2000, pp. 262-267. S. Liu, Q. Song, W. Hu, and A. Cao, "Diseases classification using support vector machine (SVM)," in Proc. 9th Intl. Conf. Neural Information Processing, vol. 2, 2002, pp. 760-763. T. Joachims, "Making large-scale support vector machine learning practical," in Advances in Kernel Methods: Support Vector Machines, B. Scholkopf, C. Burges, and A. Smola, Eds. Cambridge, MA: MIT Press, 1998. C. J. Lin, "On the Convergence of the Decomposition Method for Support Vector Machines," IEEE Trans. Neural Networks, vol. 12, 2001. S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy,

[2] [3]

[4]

[5]

[6]

[7]

232

[8]

[9]

[10]

[11]

[12]

"Improvements to Platt's SMO Algorithm for SVM Classifier Design," Neural Comp., vol. 13, pp. 637-649, 2001. J. C. Platt, "Fast training of support vector machines using sequential minimal optimization," in Advances in Kernel Methods: Support Vector Machines, B. Scholkopf, C. Burges, and A. Smola, Eds. Cambridge, MA: MIT Press, 1998. C. Campbell and N. Cristianini, "Simple Learning Algorithms for training support vector machines," Technical Report, University of Bristol 1998. T. Frie, N. Cristianini, and C. Campbell, "The kernel-Adatron algorithm: a fast and simple learning procedure for support vector machines," in Machine Learning: Proc. of the 15th International Conf., J. Shavlik, Ed. San Francisco: Morgan Kauffman Publishers, 1998. J. X. Dong, A. Krzyzak, and C. Y. Suen, "A fast SVM training algorithm," Intl. J. Pattern Recognition and Artificial Intelligence, vol. 17, pp. 367-384, 2003. A. A. Hasan and M. A. Hasan, "Constrained Gradient Descent and Line Search for Solving Optimization Problem with Elliptic Constraints," in Proc. Intl. Conf. Acoustics, Speech, and Signal Processing, vol. 2, 2003, pp. 763-796.
100% +86% +29%

[13]

[14]

[15]

[16]

[17]

[18]

J. A. K. Suykens and J. Vandewalle, "Least Squares Support Vector Machine Classifiers," Neural Processing Letters, vol. 9, pp. 293300, 1999. K. M. Lin and C. J. Lin, "A Study on Reduced Support Vector Machines," IEEE Trans. Neural Networks, vol. 14, pp. 1449-1459, 2003. Z. Wang and E. P. Simoncelli, "Stimulus Synthesis for Efficient Evaluation and Refinement of Perceptual Image Quality Metrics," in Proc. Human Vision and Electronic Imaging IX, vol. 5292, 2004. H. K. Zhao, B. Merriman, S. Osher, and L. Wang, "Capturing the Behaviour of Bubbles and Drops Using Variational Level Set Approach," J. Computational Physics, vol. 143, pp. 495-518, 1998. V. Kecman, M. Vogt, and T. M. Huang, "On the Equality of Kernel AdaTron and Sequential Minimal Optimization in Classification and Regression Tasks and Alike Algorithms for Kernel Machines," in Proc. 11th European Symposium on Artificial Neural Networks. Bruges, Belgium, 2003. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Machine Learning, Neural and Statistical Classification. Englewood Cliffs, NJ: Prentice Hall, 1994.

Table 1: Speed improvement with two-stage training approach; classifying Acceptor Site or Not and Donor Site or Not

Kernel Cache Size (% of the memory required to store full kernel matrix) Speed Improvement Acceptor Site (%) Donor Site

94% +453% +143%

75% +367% +121%

44% +297% +114%

0% (no cache) +153% +74%

Table 2: Classification accuracy on Statlog splice-site data set showing effect of data reduction due to CGA.

Acceptor Site or Not


Accuracy on train set (%) Accuracy on test set (%)

Donor Site or Not


Accuracy on train set (%) Accuracy on test set (%)

SVM (SMO) Two-staged SVM CGA+SMO

100 100

97.302 97.302

100 99.8

96.46 95.6

Table 3: Comparison of classification accuracy with different versions of SVM. Data for LS-SVM and RSVM obtained from [14].

Accuracy on test set (average)

Original SVM (with SMO) 96.881

Two-staged SVM CGA+SMO 96.451

LS-SVM 93.086

RSVM 93.002

233

Anda mungkin juga menyukai