68-72
TI Journals
ISSN:
2222-2510
www.tijournals.com
Sina Darjazi
Faculty of Mathematical Sciences, Department of Statistics, University of guilan, Rasht, Iran
*Corresponding author: Behrouz.fathi@gmail.com
Keywords
Abstract
The use of common random numbers (CRN) is known as a good way to reduce the variability of the gradient
estimates of gradient-free algorithms such as simultaneous perturbation stochastic approximation (SPSA). It
has been proven, that the common random numbers is optimal choise when we use the inverse transform
method to generate random variables. In this paper, we show that Newton-Raphson method to generate
random variables in CRN also provides an appropriate solution for SPSA.
1.
Introduction
Finite difference stochastic approximation (FDSA) and simultaneous perturbation stochastic approximation (SPSA) are the gradient-free
efficient ways to minimize the loss function ( ), in the presence of random noise. SPSA requires to 2 p times less the calculation
of loss function than FDSA method. Thus SPSA to FDSA is superior in view of speed and memory. SPSA is applicable in a variety of fields
such as passive filter planning, wireless sensor networks, and Neural Networks [4, 5, 6]. An important component of SPSA is gradient estimate
which it has differential form and using the CRN, variance of this difference can be reduced. CRN method, independent of optimization in the
context of simulation is expressed to reduce the variability of the difference of two random vectors [2, 3, 8, 9]. The main idea comes from
minimize var(X Y ) , that X , Y are random variables [9]. Given distribution of X , Y the minimum value occurs when cov( X , Y ) is
maximized, namely X , Y have to have same behavior. CRN method by taking the same path in generation of X , Y achieves this goal. Then CRN
is proposed as a simulation-based optimization method for SPSA. Reducing the variance of the gradient estimate, leads to reducing variability in
solutions of SPSA algorithm and finally it will lead to faster convergence. Theoretically, CRN is optimum choise when inverse transform
method is used to generate random variables[8, 9].This requirement, seems a bit restrictive, because there are so many variables that cannot be
found simple form for their distribution functions and we need to other ways to generate these variables. In this paper we use Newton-Raphson
method to solve the equation F ( X ) U instead of inverse transform, in theory of CRN.
2.
is a vector of parameters of
interest, represent the noise term and is the domain of allowable values for . Our problem is as follows:
(1)
min L ( ) .
The stochastic optimization algorithm for solving (1) is given by the following iterative form:
k 1 k ak g k (k ), k 0,1,2,...
where k is the estimate of at kth iteration and
(2)
(. )
k 0
In finite difference (FD) method, lth element of the gradient estimate is calculated as follows:
g kl (k )
ykl( ) ykl( )
2ck
l 1,.., p
(3)
ykl( ) y (k ck el ) ,
where {ck } is a sequence of positive numbers converging to zero with the condition
ak2
c
k 0
2
k
and a 0 elsewhere. Therefore, we need to calculate 2 p loss function for a gradient estimate in FD method.
69
Application of Newton-Raphson Algorithm in Common Random Numbers for Finding the Optimal Solution in Simultaneous Perturbation Stochastic Approximatio...
World Applied Programming Vol(5), No (4), April, 2015.
In simultaneous perturbation ( SP) method, lth element of the gradient estimate is calculated as follows:
y ( ) yk( )
g kl (k ) k
2ck
(4)
l 1,.., p
yk( ) y (k ck k ),
Where is the vector of random variables that are mutually independent and in conditions [1] holds true. Sadegh and Spall have proven
Symmetric Bernoulli distribution for elements of k is asymptotically optimal [12]. Xumeng for small samples, has introduced an effective
distribution [14].
3.
Proposition1: Consider X ( X 1 ,..., X n ) , Y (Y 1 ,...,Y n ) are random vectors with independent components of a given distributions. Our
problem is to find the 2n -dimensional distribution function FXY such that var( g ( X ) h(Y )) be minimal, where the real functions g and h are
(1)
( n)
(1)
( n)
monotonic in the same direction with respect to the ith component, i 1,..., n . Suppose U1 (u1 ,..., u1 ) and U 2 (u2 ,..., u2 ) are vectors
with
independent
components and
uniformly distributed
on [0,1].
Then by rewriting
and
h(U 2 ) g ( FY1 1 (u2(1) ),..., FYn 1 (u2( n ) )) , problem is equivalent to finding the minimum of var( g (U 1 ) h (U 2 )) . Then choice of VCRN(vector of CRN)
(U 1 U 2 U ) would be an optimal choice for our problem.
Proof: Clearly, the problem of finding the minimum variance is equivalent to the problem of finding the maximum of E ( g (U1 )h (U 2 )) . We want
to prove:
(5)
where .
Thus, we conclude, for the given index m +1, other elements must be common. The last statement is true for any other index, which means that
4.
The inverse transform method is accurate when explicit form of F 1 is known. But sometimes we have to solve numerically F ( X ) U . It
requires more computation time.
There are many factors to select the appropriate
algorithm, such as:
1) speed of convergence,
2) ensure the convergence,
3) knowledge of the density function,
4) prior knowledge.
4.1. Newton-Raphson method
This method will converge when F is convex or concave. If f has no explicit form, then this method should not be used. The approximation of
1
( F ( x ) F ( x)) , due to the elimination of error terms is relatively inaccurate [10]. The algorithm is as follows:
f ( x) by
70
2) F ( X ) U ,
*
3) X X X ,
where X * is the exact solution of F ( X ) U . Since X * is not known in practice, the second criterion would be appropriate [10].
5.
l 1,.., p
(6)
Proposition 2: Suppose Vk
are vectors of random effects with independent elements and Vkl( ) ,Vkl( ) for all l may be dependent but
Vkl( ) ,Vkm( ) , l m must be independent. Suppose Vk( ) and Vk( ) are generated by using the inverse transform method from vectors U k( ) and U k(-)
respectively (where, U k( ) ,U k(-) are vectors with independent components and uniformly distributed on [0,1]). Suppose yk( ) and yk( ) are
monotonic in the same direction with respect to the lth element of V k( ) and V k( ) , for almost all values of k .Then var[g kl (k ) | k ] is
minimized at each l when U k( ) U k(-) U k .
Proof: According to the proposition 1, var(( yk( ) yk( ) ) | k , k ) and var(
yk( ) yk( )
| k , k ) is minimized when U k( ) U k( ) U k .
2ck kl
E ((
yk( ) yk( ) 2
y ( ) yk( ) 2
) | k ) E[ E (( k
) | k , k ) | k ],
2ck kl
2ck kl
Under proposition 2 of [1] for large k, k * proportional with O (k 3 ) moves towards zero. Given the constraints of 2 , 3 0
2
a
c
, ck
( a, c are positive constants), fastest possible stochastic rate is proportional with k 3 for large k .
(k 1)
(k 1)
Thus maximum rate of stochastic convergence of k to * in SPSA algorithm without using CRN is O(k 3 ) . Under Theorem 2.1 of [7] and the
similar proposition 2 of [1] for large
0 1 ,
k,
k *
proportional with
1
, , 4 0 and using CRN in the gradient estimate of the loss function L( ) , maximum rate of stochastic
2
1
convergence of k to * , will be proportional to
. Note that, the progress made in using the CRN comes from the elimination of the O(1)
k
terms in the Taylor expansion of Q(k ck k ,Vk ) Q (k ck k ,Vk ) in the proof of Theorem 2.1of [7], if the CRN not be used, with replacing
Vk and Vk instead of Vk , convergence rate will not be increased.
6.
Algorithm
Guess the 0 .
iii.
Choose a, c, , in sequences ak
iv.
Compute ak
a
c
,c k
.
( k 1)
( k 1)
a
c
,c k
.
( k 1)
(k 1)
71
Application of Newton-Raphson Algorithm in Common Random Numbers for Finding the Optimal Solution in Simultaneous Perturbation Stochastic Approximatio...
World Applied Programming Vol(5), No (4), April, 2015.
v.
vi.
vii.
viii.
7.
y k( ) y k( )
, l 1,.., p and yk( ) Q(k ck k ,Vk( ) ) .
2c k
ix.
x.
Stop, if k N , or if there is little change in the last few iterations, Otherwise put k k 1 and go to step iv.
Numerical example
10
L ( ) ti2 E (
i 1
i 1
1
vi ti
),
where (t1,..., t10 )T , t i 0 ( i 1,...,10) , and the vi are independent random variable from Rayleigh distribution with parameters i . We
generate i according to the uniform distribution on (0.2,2). Q ( ,V ) is monotonically non-increasing for an increasing vi for any value of
ti 0 . The parameters i and the corresponding elements within * are given in table 1. According to proposition 2, CRN provides the
minimum variance for the elements g k (k ) , n is the total number of iterations and sequences ak , ck is defined as ak 0.7 , ck
k 1
0.5
. In all
( k 1) 0.49
runs, we used the initial guess 0 (1.2,1.2,...,1.2)T . Difference CRN method with non-CRN method in the sense of convergence is shown in
Figure 1. We define the following two states:
I : vi generated by the inverse transform method. II : vi generated by the Newton-Raphson method.
The results of 50 independent implementation are given in table 2 and 3.
We've tested the following statistical hypothesis:
There is no significant difference between I and II in the sense of their estimated losses and rate of convergence
~ (0.2,2)
1.8509
0.7145
1.563
1.5567
0.8848
1.2221
0.3365
0.2971
1.1554
1.6025
ith element of
0.8167
0.6752
0.7896
0.7889
0.7046
0.7517
0.5808
0.5665
0.7433
0.7935
72
Sig(level of significance)
100
1000
10000
100000
28.60587177
26.572366593
26.380245345
26.355868184
28.22759923
26.48061284
26.37294471
26.354989426
0.7918535057
0.7001811655
0.53092978
0.21001431
Total
iterations n
Mean n 2 (n * ) in state I
Mean n 2 (n * ) in state II
Sig(level of significance)
100
1000
10000
100000
10.500910173437314
7.164886153732708
6.858212185078264
7.848472662713310
8.802750217276586
6.785387557160711
7.676914389347618
7.704537753341972
0.569360983918349
0.630234680928987
0.535493477324933
0.769379323778296
It is clear, that in the all cases of the table 2,3, Since Sig is high, there is no significant difference, therefore we accept the mentioned statistical
hypothesis test.
8.
Conclusion
Results of Table 2 and 3 indicate, that there is no significant difference between two methods in the sense of their estimated losses and rate of
convergence. So, in CRN method for SPSA algorithm, significant difference between Newton-Raphson method and inverse transform method
does not exist. Of course, the run time of Newton-Raphson is long. Some algorithms such as Bisection and secant method have higher execution
speed, because they seek only in a selected interval [a, b ] , but it is possible that be converge to a false value. These algorithms also can be used
when explicit form of f does not exist.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
J. C. Spall, multivariate stochastic Approximation using a simultaneous Perturbation Gradient Approximation, IEEE Trans. Automatic Control, vol. 37, pp.
332-341, (1992).
S. Ehrlichman; S. G. Henderson, Comparing two systems: Beyond common random numbers, Simulation Conference, 2008. WSC2008. Winter, pp. 245251, 7-10 (Dec. 2008).
Xi Chen; B. Ankenman; B. L. Nelson, Common random numbers and stochastic kriging, Simulation Conference (WSC), Proceedings of the 2010 Winter,
pp. 947-956, 5-8 (Dec. 2010).
Ying-Yi Hong; Ching-Sheng Chiu, Passive Filter Planning Using Simultaneous Perturbation Stochastic Approximation, Power Delivery, IEEE
Transactions on, vol. 25, no. 2, pp. 939-946, (April 2010).
M. A. Azim; Z. Aung; Weidong Xiao; V. Khadkikar; A. Jamalipour, Localization in wireless sensor networks by constrained simultaneous perturbation
stochastic approximation technique, Signal Processing and Communication Systems (ICSPCS), 2012 6th International Conference on , pp. 1-9, 12-14 (Dec.
2012).
J. C. Spall; Qing Song; Yeng Chai Soh; Jie Ni, Robust Neural Network Tracking Controller Using Simultaneous Perturbation Stochastic Approximation,
Neural Networks, IEEE Transactions on , vol. 19, no. 5, pp. 817-835, (May2008).
N. L. Kleinman; J. C. Spall and D. Q. Naiman, Simulation-Based Optimization with Stochastic Approximation Using Common Random Numbers,
Management Science, vol. 45, pp. 1570-1578, (1999).
R. Y. Rubinstein; G. Samorodnitsky and M. Shaked, Antithetic Variates, Multivariate Dependence, and Simulation of stochastic Systems, Management
Science, vol. 31, pp. 66-77, (1985).
R. Y. Rubinstein and G. Samorodnitsky, Variance reduction by the use of common and antithetic random variables, Journal of Statistical Computation and
Simulation, Vol. 22, pp. 161-180, (1985).
L. Devroye, Non-Uniform Random Variate Generation, Springer-Verlag, New York, (1986).
J. C. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, Hoboken, NJ (2003).
P. Sadegh, J. C. Spall, Optimal Random Perturbations for Stochastic Approximation with a Simultaneous Perturbation Gradient Approximation, IEEE
Transactions on Automatic Control, vol. 43, pp. 14801484, (1998).
P. Bratley, B. L. Fox, and L. E. Schrage, A Guide to Simulation, Springer-Verlag, New York, N. Y, (1983).
Xumeng Cao, Effective perturbation distributions for small samples in simultaneous perturbation stochastic approximation, Information Sciences and
Systems (CISS), 2011 45th Annual Conference on, pp. 1-5, 23-25, (March 2011).