Anda di halaman 1dari 9

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 14, NO.

1, MARCH 2013 419

A Study on the Method for Cleaning and


Repairing the Probe Vehicle Data
Zhaosheng Zhang, Diange Yang, Tao Zhang, Qiaochu He, and Xiaomin Lian

Abstract—Probe vehicle data are being increasingly applied error and data loss, which may increase the risk of instability
in urban dynamic traffic data collection. However, the mobility when these data are used in transportation systems. Therefore,
and scale limit of probe vehicles may lead to incomplete or in- data cleaning and repair are very important tasks for obtaining
accurate data and thus influence the measurement of the state
of traffic. At present, probe vehicle data are usually repaired by accurate dynamic traffic information [8].
linear interpolation or a historical average method, but the repair
accuracy is relatively low. To address the given problems, the II. R ELATED W ORK
multithreshold control repair method (MTCRM) was proposed
to clean and repair the probe vehicle data. The MTCRM adopts Probe vehicle data cleaning involves identifying errors in the
threshold control and a rule based on the approximate normal- raw data and then filtering or removing them. The removed
ization transform to clean abnormal traffic data and to fill in the data are regarded as missing information and are completed
missing data by a weighted average method and an exponential
by data repair. At present, although methods for traffic flow
smoothing method. In this approach, we combine topological road
network characteristics to fill in the missing data from data for data have been well developed, few studies have investigated
neighboring road sections and repair noisy data by reconstructing the cleaning and repair of probe vehicle data. Jarema et al. [9]
the principal components. This paper mainly focuses on analyzing proposed treating missing data by a historical average method
the component of the recurring pattern of probe vehicle data, that replaces the incorrect or missing data with historical data
which can provide guidelines for the subsequent traffic forecasts.
from the same period. Then, Jiang et al. [10] improved the
The findings of data repair for different grades of road in Beijing,
China, demonstrate that the mean repair error may meet the method by considering historical traffic data, average values of
requirements of traffic-state measurement, demonstrating that adjacent time periods, and data of the adjacent road sections
MTCRM can effectively clean probe vehicle data. to correct the missing data. Jacobson et al. [11] suggested
Index Terms—Data cleaning, data repair, multithreshold con- the threshold control algorithm, which identifies abnormal data
trol repair method (MTCRM), normalization transform, probe based on the idea that the values of traffic flow parameters for
vehicle data, reconstruction of principal components. a certain time interval should be within a reasonable range.
According to this principle, traffic parameters outside this range
I. I NTRODUCTION are recognized as incorrect data. However, in this method, it
is difficult to accurately determine the threshold values for the
W ITH THE rapid development of vehicle navigation sys-
tems, dynamic traffic data are being more widely ap-
plied [1]–[3]. Probe vehicles are widely used to collect dynamic
data range. Coifman [12] developed a method for determining
whether the data are reasonable using three parameters (vehicle
speed, flow, and occupancy). However, the application of their
traffic data because of their wide coverage, high precision,
method is limited because this method requires traffic flow
and excellent real-time performance [4], [5]. However, not
conditions to be determined and a valid range of average vehicle
all road sections can be covered by enough probe vehicles
length given. Some improved methods, such as analyses of
because of their high mobility and limited number [6]. Wireless
traffic flow data by clustering methods, have been suggested
communication may also cause loss of data; therefore, probe
in [13] and [14].
vehicle data may be incorrect or incomplete and thus affect the
All the given studies are based on traffic flow data, which
accuracy of traffic-state measurement. Reference [7] points out
include several types of information such as vehicle speed, flow,
that 50% of the collected traffic data have problems such as data
and occupancy. In cases of missing single-property data, the
traffic data can be completed by the other data. However, probe
Manuscript received June 18, 2012; revised August 19, 2012; accepted vehicle data only include vehicle speed information and cannot
August 26, 2012. Date of publication September 20, 2012; date of current
version February 25, 2013. This work was supported by the National High-tech
be cleaned by methods developed based on traffic flow. With
R&D Program (863 Program) under project (2012AA111901). The Associate the wide use of probe vehicle data [15]–[23], processing these
Editor for this paper was J. A. Miller. data is becoming increasingly important. Yu et al. [24] filled the
Z. Zhang, D. Yang (Corresponding author), T. Zhang, and X. Lian
(Corresponding author) are with Department of Automotive Engineering, missing data (less than four data points) by linear interpolation
Tsinghua University, Beijing 100084, China (e-mail: zzs08@mails. while deleting the source data for days, missing more than four
tsinghua.edu.cn; ydg@mail.tsinghua.edu.cn; zhang-t@mails.tsinghua.edu.cn; data points. Lv et al. [25] developed a nonparametric regression
lianxm@tsinghua.edu.cn).
Q. He is with Department of Industrial Engineering and Operation method for repairing missing data, although their approach
Research, University of California, Berkeley, CA 94709 USA (e-mail: requires a large amount of historical data under various traffic
heqc0425@berkeley.edu). conditions and is thus unsatisfactory when sufficient data are
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. unavailable. In summary, methods based on traffic flow cannot
Digital Object Identifier 10.1109/TITS.2012.2217378 be applied to repairing probe vehicle data; therefore, probe
1524-9050/$31.00 © 2012 IEEE
420 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 14, NO. 1, MARCH 2013

vehicle data are currently repaired by simple linear interpo-


lation or historical average methods. As a result, the repair
precision is low, which may influence the accuracy of traffic-
state measurement.
In this paper, based on the periodicity and spatial relation-
ships of probe vehicle data, we develop a new method for
cleaning incorrect data by thresholding and a 3σ rule method
based on a pseudonormalization transformation. Missing data
are completed by numerical analysis taking into account the
topological characteristics of the road network. Additionally,
noisy data are filtered by principal component reconstruction
to further improve the accuracy of the data repair. It should Fig. 1. Changes of vehicle speed (z-axis, in km/h) with time (x-axis) and date
be noted that the content of this paper converged into single- (y-axis).
property data (speed) process, for which, in general, these
probe vehicle data combined with speed, occupancy, and flow
together cannot be directly provided by the traffic data provider.

III. T RAFFIC DATA C LEANING


As a result of the regularity of urban traveling, vehicle speed
shows periodicities and trends. The data collected on a road
section over multiple days can be expressed as a matrix X in
⎡x x ··· x ⎤
1,1 1,2 1,N
⎢ x2,1 x2,2 ··· x2,N ⎥
X=⎢
⎣ .. .. .. .. ⎥
⎦ (1)
. . . .
xM,1 xM,2 ··· xM,N
where xi,j represents the vehicle speed at moment j on
day i, M is the number of days of data collection, and
N is the number of data collected per day. The row vec-
Fig. 2. Removal of abnormal traffic data. The symbol +shows historical
tor Xi = xi,1 , xi,j , . . . , xi,N  of X represents the vehi- average data.  shows the upper threshold.  shows the lower threshold.
cle speeds recorded at different moments on day i and × shows measurement data. The ellipse indicates detected abnormal data.
is called the date vector; analogously, the column vector
Xj = x1,j , xi,j , . . . , xM,j  of X represents the vehicle speeds 2) Conditions of excessive missing data. When large
recorded at the same time point but on different days and is amounts of data are missing, the characteristics of traf-
called the moment vector. fic operation cannot be accurately described by the es-
If collection interval is shorter, then the vehicle speed varia- tablished traffic model; therefore, these data should be
tion and the noise will be greater, which may add to the difficul- deleted. If the missing data volume for one day reaches
ties of data processing. Conversely, a longer acquisition interval 10% of the total data or the continuously missing data
offers a smoother curve but does not reflect the real-time exceeds 5%, the data for the entire day are discarded.
conditions of traffic because of the lower acquisition frequency. 3) Identification and filtering of abnormal data. The col-
The Highway Capacity Manual of the USA recommends that lected vehicle speed data may deviate from the normal
an acquisition interval of 5 min is appropriate. Therefore, in speed data owing to factors such as traffic accidents and
our probe vehicle system, a 5-min collection interval is used, weather factors. Although the deviation may be reason-
which gives 288 data points per day. able considering that road traffic is unavoidable, as seen
Fig. 1 shows the changes of vehicle speed with time and from a long historical trend, such speed flow is occasional
date. The speed varies slightly from day to day, but the general and uncertain. This paper mainly focuses on analyzing the
trend remains similar. In this paper, missing or incorrect probe component of the recurring pattern of the probe vehicle
vehicle data are repaired based on the characteristics of time data, which can provide guides for the subsequent traffic
series and similarity analysis of adjacent roads. forecasts; however, this uncertainties may interfere with
the study of the characteristics of vehicle speed and traffic
planning. Thus, these data were removed from this paper.
A. Raw Data Screening
Fig. 2 shows the calculated mean value X j for each moment
Before further careful screening, raw data are rough-screened vector Xj , where the standard deviation of Xj is σj . The
by the following three steps to remove invalid data. 85% confidence interval of the moment vector Xj is [X j −
1) Filtering negative data. Negative vehicle speed data are 1.44σj , X j + 1.44σj ]. For the date vector Xi , if 5% of the
obviously errors; therefore, these data are removed and data are consecutively outside of the confidence interval, then
replaced with 0. the vector contains abnormal data.
ZHANG et al.: STUDY ON THE METHOD FOR CLEANING AND REPAIRING THE PROBE VEHICLE DATA 421

Fig. 3. Q–Q plots showing normality of the data before (left) and after (right) pseudonormalization transformation.

B. Abnormal Data Cleaning


Instantaneous data are considered abnormal if they greatly
deviate from the center of the distribution. In this paper, a
probability model is established for the moment vector Xj , and
the data are then cleaned based on a 3σ rule.
The 3σ rule is applicable to data that follow a normal distri- Fig. 4. Schematic showing spatial distribution of road sections in a local
bution. Therefore, the vector Xj is first checked for normality area. Black circles represent road junctions. Lines represent roads. Numbers
with a quantile–quantile (Q–Q) plot. If it does not satisfy represent IDs of roads. Arrows point in the direction of traffic.
normality, the data in the vector should be transformed to a
IV. T RAFFIC DATA R EPAIR
pseudonormal distribution using a modified power function, as
shown in A. Missing Data Repair
 xγ −1
(γ) γ , γ = 0 Missing data are very common in data collection and may be
x = (2) either isolated or consecutive. In this paper, the missing data are
ln(x), γ = 0.
processed, as explained in the following.
For the measured data values x1 , x2 , . . . , xM , Box and Cox For isolated missing data xt , a weighted average method is
[26] give a method for calculating the optimal exponential γ used for data repair. Contrary to the mean value repair method,
in (3), which makes the equation to give the maximum value, using a weighted average takes advantage of trends over time,
as follows: thus reducing the influence of variations in the adjacent data, as

1
(γ) 2
M shown in the following:
M
l(γ) = max − ln x −x (γ)
M i=1 i 1

2 T
x̂t = wk · xt+k (k = 0). (5)

M
W
+ (γ − 1) ln(xi ) (3) k=−T

i=1
In (5), x̂t is the repaired missing data, wk is the weighting
where coefficient, W is the sum of all weighted coefficients, and
T is the maximum interval for repairing data. Note that wk
1
xγ − 1
M
x(γ) = . (4) decreases further from the missing measurement point. The
M i=1 γ maximum interval T of the neighboring data is set to 3, and
the corresponding weight coefficients wk are set to 0.7, 0.2 and
To reduce the computational complexity, γ should be within 0.1, respectively. Considering that vehicle speed–time curves
the interval (0, 5). Fig. 3 shows data before and after this trans- usually display clear trends of increase or decrease, continuous
formation. The closer the sample data approach the straight missing data are repaired by a secondary exponential smoothing
line, the more the samples comply with a normal distribution. method as follows:
It is shown in Fig. 3 that the transformed data (x(γ) ) fol-
low a pseudonormal distribution. Thus, the confidence inter- x̂t+r = at + bt r (r = 1, 2, . . .) (6)
val of x(γ) can be determined using the 3σ rule: [x(γ) −
3σ (γ) ), x(γ) + 3σ (γ) )], where σ (γ) represents the standard where at and bt are intermediate variables, which are given in
deviation of x(γ) , and abnormal data are deleted based on the the following:
confidence interval. After the data identified as abnormal are  (1) (2)
at = 2Qt − Qt
removed, the missing data are repaired by a weighted average (1) (2) (7)
method or an exponential smoothing method.
α
bt = 1−α Qt − Qt .
422 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 14, NO. 1, MARCH 2013

TABLE I
C ORRELATION C OEFFICIENTS OF N EIGHBORING ROAD S ECTIONS

(1) (2)
In (7), Qt is the primary exponential smoothing value, Qt In this paper, reconstruction of principal components is used to
is the secondary exponential smoothing value, and α ∈ (0, 1) is remove high-frequency noisy data. Compared with other noise
the smoothing coefficient, i.e., eliminating methods, reconstruction of principal components is
able to process the data for several days, and the regularity and
(1) (1)
Qt = αxt + (1 − α)Qt−1 trends of time series are also used, thereby reducing the volume
(2) (1) (2) of data processing and improving the precision at the same time.
Qt = αQt + (1 − α)Qt (8)
For a sampling data matrix X, each date vector corresponds
to a variable, and each moment vector corresponds to a sample.
Additionally, the quality of data repair can be improved by
The covariance matrix S for recording samples is
taking into account spatial similarities in traffic information,
such as traffic data for the upstream and downstream road 1

N
sections and nearby road sections. The experiment described S= (Xj − X j )(Xj − X j )T (11)
N − 1 j=1
in Fig. 4 provides the correlation between neighboring road
sections at some time.
Consider segment 23010 in Table I. The correlation coeffi- where T is the transpose of the matrix, λ1  λ2  · · ·  λM
cient with the neighboring road sections is relatively large but are the eigenvalues of S, and U{uj |j = 1, 2, . . . , M } indicates
decreases as the distance increases. Specifically, the correlation the corresponding orthogonal unit eigenvector matrix. The prin-
between the upstream and downstream road sections and seg- cipal component matrix Y of the data matrix X is given as
ment 23010 is relatively large, whereas the correlation between ⎡y y ··· y ⎤
1,1 1,2 1,M
segment 23010 and the parallel segments is relatively small; ⎢ y2,1 y2,2 ··· y2,M ⎥
therefore, the correlation coefficient can be calculated by the Y = XT U = ⎢
⎣ .. .. .. .. ⎥
⎦. (12)
adjacent road section. . . . .
The speed on road section 23010 can be calculated based on yN,1 xN,2 ··· yN,M
the similarities with adjacent road section, as expressed in the The mth main component contribution rate Zm of the principal
following: component matrix Y is shown as



M
x̂ω
t = βhω · xω
h (h) (9) Z m = λm λi (m = 1, 2, . . . , M ). (13)
h=1 i=1

where ω is the ID number of road that missing data, H ω is the Then, the contribution rate of the first m principal compo-
total number of adjacent segments of road ω, h is the sequence s
nents, Zm is
number of adjacent road, xω h (h) is the vehicle speed on adjacent

m
road h at time t, and βhω is a weight factor given in λi
s i=1
rhω Zm = . (14)
βhω = . (10) M

Hω λi
rτω i=1
τ =1
The principal components of matrix Y are used to recon-
In (10), rhω is the coefficient of correlation between the road struct the original data matrix X. Taking the first p principal
section that is missing data and its neighboring road section. In components with contribution rates Zm s
 95%, the recon-
general, the correlation coefficients are different for different 
structed matrix X is given as
segment at each day; thus, for the data without correlation ⎡y
coefficient, rhω was set to the average value of all the correlation 1,1 y y
1,2 ··· 0⎤
1,p
coefficients at those days earlier than the appointed day. ⎢ y2,1 y2,2 y2,p ··· 0⎥
X = ⎢
⎣ .. .. .. .. .. ⎥
⎦U
−1
(15)
. . . . .
B. Repair Method for Noisy Data yN,1 xN,2 yN,p ··· 0

In traffic flow analyses, dimension reduction is frequently where U −1 represents the inverse of U.
used to isolate the important information in the data. Principal The effect of noise reduction using this PCA-based method
component analysis (PCA) is a main tool for dimensionality is shown in Fig. 5. After the high-frequency noise in the data is
reduction. Earlier research has suggested that traffic flow can reduced, the curve is smoother, and the fluctuations are smaller,
be classified into an eigenflow plus noise [27]. These noise data assuming that the transient characteristics of speed data remain
refer to those data that cannot reflect the traffic characteristic. normal.
ZHANG et al.: STUDY ON THE METHOD FOR CLEANING AND REPAIRING THE PROBE VEHICLE DATA 423

March 31, 2011 and are computationally repaired using the


methods described earlier.
The experimental results were evaluated using the following
indicators.
relative error (rerr):
x̂ − x
rerr =
x
mean absolute relative error (mrerr):
n  
1
 x̂i − xi 
mrerr =
n i=1  xi 

max absolute relative error (marerr):


 
 x̂i − xi 
marerr = max   
xi 
Fig. 5. Comparison of traffic data before and after PCA noise reduction. The
symbol ∗ shows raw data before noise reduction. + shows PCA data after noise coefficient of equivalence (EC):
reduction. − shows noise. 
n
(x̂i − xi )2
i=1
EC = 1 −   .

n
2 
n
(x̂i ) + (xi )2
i=1 i=1

EC reflects the match error between the actual data and the
repair data. A value of EC > 0.9 was considered to indicate an
excellent repair.
When repairing the data with the weighted average method,
the maximum interval T of the neighboring data is set to 3, and
the corresponding weight coefficients wk were set to 0.7, 0.2
and 0.1, respectively. For consecutive missing data, the expo-
nential smoothing coefficient (α) was empirically determined
as 0.5, and the neighboring segments in speed estimation by
neighbor segment spatial similarity were H = 5. Expressway
data repair by the given method is shown in the following.
In Fig. 7, the data marked with stars are raw data, the data
Fig. 6. Standardized residual and autocorrelation function of traffic data after
marked with round solid points are modified data, and the
PCA noise-reduction treatment. data marked with crosses are man-made abnormal traffic data.
The right side of Fig. 7 shows a magnified portion of the left
The standard deviations of the residual and autocorrelation side of Fig. 7. Noise is removed from the modified data by
function are shown in Fig. 6. Almost all (96.5%) of the standard reconstruction of principal components. Compared with the
residual deviations fall in the range (−2, 2), and the autocorre- raw data, the processed vehicle speed not only maintains the
lation with time lag approaches zero, indicating that the residual transient characteristics but also eliminates the high-frequency
is a normally distributed white noise. noise in the data, thus smoothing the speed curve and laying a
foundation for subsequent traffic-state data identification. The
V. E XPERIMENTAL VALIDATION traffic abnormal data in different times (one day) were artifi-
Different types of roads are associated with different den- cially randomly introduced and were processed by MTCRM,
sities of probe vehicles and, therefore, different data qualities. conventional linear interpolation, and history average methods,
We validated the above data cleaning/repair methods with data respectively. The results are compared and shown in Fig. 8.
collected on four types of roads in Beijing, China: freeways, The errors of the repaired data are shown in Table II. As
express roads, arterial roads, and minor arterial roads. Probe shown in Table II, compared with the conventional historical
vehicle data are provided by traffic service providers; it col- average and linear interpolation method, MTCRM shows sig-
lected from 23 000 vehicles, and these data covered 85% of nificant advantages in precision and optimality. The 89% of the
the minor arterial roads and superior roads in Beijing with an relative error for the expressway is within 3%, and the average
interval of 5 min between January 1, 2011 and March 30, 2011. absolute relative error is 2.3%. The repair error statistics for
As the missing and incorrect data were removed, the vehicle different types of abnormal data are shown in Table III, where
speed data within 81 days were left. Missing and incorrect more than two consecutive abnormal data points are called
data were artificially introduced to information recorded on consecutive abnormal data.
424 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 14, NO. 1, MARCH 2013

Fig. 7. Expressway data repair.

Fig. 8. Repaired data comparison between MTCRM and conventional methods.


TABLE II TABLE III
E RROR C OMPARISON FOR D IFFERENT R EPAIR M ETHODS S TATISTICS OF D IFFERENT T YPES OF A BNORMAL DATA E RRORS

data, the repair precision of the algorithm is slightly lower,


As shown in the given table, discrete abnormal data were and the maximum error of consecutive abnormal data is 4.1%.
repaired relatively accurately, with a maximum relative error of The repair results for minor arterial roads, arterial roads, and
2.5%. In cases of multiple consecutive neighboring abnormal freeways are shown in Fig. 9.
ZHANG et al.: STUDY ON THE METHOD FOR CLEANING AND REPAIRING THE PROBE VEHICLE DATA 425

Fig. 9. Repair of vehicle speed data recorded on different types of roads. (a) Minor arterial road data repair. (b) Minor arterial road repair error. (c) Arterial road
data repair. (d) Arterial road repair error. (e) Freeway data repair. (f) Freeway repair error.

As shown in Fig. 9, the relative error of minor arterial road data Figs. 8 and 9 also clearly indicate morning and evening rush
repair is relatively large, whereas the error associated with the hours on the four types of roads. The vehicle speeds are sub-
freeway is smaller, with a relative error within ±1.5% for 95% stantially lower during rush hours and higher during the night.
of the time, and within ±3% even for time periods with obvious Additionally, the rush hour patterns were consistent on the
speed fluctuation. The difference can be mainly attributed to dif- four types of roads, and such periodicity and consistency can
ferent vehicle speeds and the number of probe vehicles on the two facilitate the identification and repair of traffic data. Table IV
types of roads. Compared with freeways, there are fewer probe lists indicators of repair error other than errors resulting from
vehicles, and the vehicle speed is lower on minor arterial roads. historical averaging.
426 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 14, NO. 1, MARCH 2013

TABLE IV
E RROR I NDEXES FOR D IFFERENT ROAD T YPES

The errors calculated for different road types are shown in [8] X. Y. Wang, J. L. Zhang, and X. Y. Yang, The Theoretical Approaches of
Table IV. For the repair of data from express roads, for which Traffic Flow Data Cleaning and State Identify as well as Optimize Control.
Beijing, China: Science, 2011, pp. 13–18.
there is smaller data fluctuation, the mean absolute error was [9] F. Jarema, C. Dahlin, and R. Gillmann, “FHWA study tour for european
only 0.76%, and the EC was 0.96. The quality of probe vehicle traffic monitoring programs and technologies,” Federal Highway Admin.,
data is significantly improved without requiring more probe U.S. Dept. Transp., Washington, DC, 1997.
[10] G. Y. Jiang, L. H. Gang, X. D. Zhang, and J. F. Wang, “Malfunction
vehicles or additional data processing equipment, indicating identifying and modifying of dynamic traffic data,” J. Traffic Transp. Eng.,
that MTCRM can effectively clean probe vehicle data. vol. 4, no. 1, pp. 121–125, Jan. 2004.
[11] L. N. Jacobson, N. L. Nihan, and J. D. Bender, “Detecting erroneous loop
detector data in a freeway traffic management system,” Transp. Res. Rec.,
VI. C ONCLUSION vol. 1287, pp. 151–166, Mar. 1990.
[12] B. Coifman, “Improved velocity estimation using single loop detectors,”
The precision of the current repair and cleaning method for Transp. Res. A, Policy Pract., vol. 35, no. 10, pp. 863–880, Dec. 2001.
probe vehicle data is low, which may influence the accuracy [13] X. Y. Gong, “Traffic flow data filtering algorithms based on data mining,”
in Proc. Nat. ITS Syst. Traffic Inf. Collect. Integr. Tech., Hangzhou, China,
of traffic-state prediction. In this paper, we have developed 2003, pp. 163–173.
new numerical analysis methods for cleaning probe vehicle [14] L. Sun and J. Zhou, “Development of multi regime speed-density rela-
data. A new method is presented for data repair and cleaning tionships by cluster analysis,” Transp. Res. Rec., vol. 1934, pp. 64–71,
Feb. 2005.
using principal component reconstruction based on the traffic [15] A. Simroth and H. Zähle, “Travel time prediction using floating car data
information and topological characteristics of the road network. applied to logistics planning,” IEEE Trans. Intell. Transp. Syst., vol. 12,
Experimental validation shows that abnormal data can be no. 1, pp. 243–253, Mar. 2011.
[16] J. F. Ehmke, S. Meisel, and D. C. Mattfeld, “Floating car based travel
effectively identified by a 3σ rule based on an approximate times for city logistics,” Transp. Res. C, Emerging Technol., vol. 21, no. 1,
normalization transform. Principal component reconstruction pp. 228–352, Apr. 2012.
can effectively exploit the periodicity and trends in traffic data [17] B. Mehran, M. Kuwahara, and F. Naznin, “Implementing kinematic wave
theory to reconstruct vehicle trajectories from fixed and probe sensor
to reduce the influence of noise on probe vehicle data and data,” Transp. Res. C, Emerging Technol., vol. 20, no. 1, pp. 144–163,
improve the data accuracy. This paper has focused on the single- Feb. 2012.
property data (speed) process in probe vehicle data as those [18] Q. Ou, R. L. Bertini, J. W. C. van Lint, and S. P. Hoogendoorn, “A theoret-
combined with speed, occupancy, and flow together cannot be ical framework for traffic speed estimation by fusing low-resolution probe
vehicle data,” IEEE Trans. Intell. Transp. Syst., vol. 12, no. 3, pp. 747–
directly provided by the traffic data provider; nevertheless, we 756, Sep. 2011.
believe our newly developed method has great potential for [19] J. J. V. Díaz, D. F. Llorca, A. B. R. González, R. Q. Mínguez, Á. L.
future transportation research. Llamazares, and M. Á. Sotelo, “Extended floating car data system: Exper-
imental results and application for a hybrid route level of service,” IEEE
Trans. Intell. Transp. Syst., vol. 13, no. 1, pp. 25–35, Mar. 2012.
[20] S. Breitenberger, K. Bogenberger, M. Hauschild, and K. Laffkas, “Ex-
R EFERENCES tended floating car data—An overview,” in Proc. World Congr. Intell.
[1] T. Zhang, D. G. Yang, T. Li, K. Q. Li, and X. M. Lian, “An improved Transp. Syst., Madrid, Spain, Nov. 2003.
virtual intersection model for vehicle navigation at intersections,” Transp. [21] L. Lin, T. Osafune, and M. Lenardi, “Floating car data system enforce-
Res. C, Emerging Technol., vol. 19, no. 3, pp. 413–423, Jun. 2011. ment through vehicle to vehicle communications,” in Proc. 6th Int. Conf.
[2] J. W. Ding, C. F. Wang, F. H. Meng, and T. Y. Wu, “Real-time vehicle ITS Telecommun., Jun. 2006, pp. 122–126.
route guidance using vehicle-to-vehicle communication,” IET Commun., [22] S. Maerivoet and S. Logghe, “Validation of travel times based on cellular
vol. 4, no. 7, pp. 870–883, Apr. 2010. loating vehicle data,” in Proc. Eur. Congr. Intell. Transp. Syst., Aalborg,
[3] F. Dion, J. S. Oh, and R. Robinson, “Virtual testbed for assessing probe Denmark, Jun. 2007.
vehicle data in IntelliDrive systems,” IEEE Trans. Intell. Transp. Syst., [23] S. Messelodi, M. Modena, M. Zanin, F. G. B. De Natale, F. Granelli,
vol. 12, no. 3, pp. 635–644, Sep. 2011. E. Betterle, and A. Guarise, “Intelligent extended floating car data col-
[4] J. E. Naranjo, F. Jiménez, F. J. Serradilla, and J. G. Zato, “Comparison lection,” Expert Syst. Appl. Int. J. Arch., vol. 36, no. 3, pp. 4213–4227,
between floating car data and infrastructure sensors for traffic speed es- Apr. 2009.
timation,” in Proc. 13th Int. IEEE Conf. Intell. Transp. Syst., Workshop [24] L. Yu, L. Yu, Y. Qi, and H. Wen, “Traffic incident detection algorithm for
Emergent Coop. Technol. Intell. Transp. Syst., 2010. urban expressways based on probe vehicle data,” J. Trans. Syst. Eng. Inf.
[5] Z. Yang, “Research and implementation of large-scale FCD processing,” Technol., vol. 8, no. 4, pp. 36–41, Aug. 2008.
M.S. thesis, Dept. Pattern Recogn. Intell. Syst., Univ. Sci. Technol. China, [25] W. F. Lv, Y. Liang, T. Y. Zhu, and D. D. Wu, “An FCD compensation
He Fei, China, 2010. model based on traffic condition trends matching,” in Proc. 4th ICCIT,
[6] J. E. Naranjo, F. Jiménez, F. J. Serradilla, and J. G. Zato, “Floating car 2009, pp. 1201–1206.
data augmentation based on infrastructure sensors and neural networks,” [26] G. E. P. Box and D. R. Cox, “An analysis of transformations,” J. R. Stat.
IEEE Trans. Intell. Transp. Syst., vol. 13, no. 1, pp. 107–114, Mar. 2012. Soc. B, Methodol., vol. 26, no. 2, pp. 211–252, Apr. 1964.
[7] M. Zhong, P. Lingras, and S. Sharma, “Estimation of missing traffic [27] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, and
counts using factor, genetic, neural, and regression techniques,” Transp. N. Taft, “Structural analysis of network traffic flows,” in Proc. SIGMET-
Res. C, Emerging Technol., vol. 12, no. 2, pp. 139–166, Apr. 2004. RICS, 2004, pp. 61–72.
ZHANG et al.: STUDY ON THE METHOD FOR CLEANING AND REPAIRING THE PROBE VEHICLE DATA 427

Zhaosheng Zhang was born in Hezhe, China, in Qiaochu He received the B.S. degree in automo-
1984. He received the B.S. degree in automotive en- tive engineering from Tsinghua University, Beijing,
gineering from Hunan University, Changsha, China, China, in 2011. He is currently working toward the
in 2008. He is currently working toward the Ph.D. Ph.D. degree in operation research with the De-
degree in automotive engineering with the Depart- partment of Industrial Engineering and Operation
ment of Automotive Engineering, Tsinghua Univer- Research, University of California, Berkeley.
sity, Beijing, China. His research interests include convex optimiza-
His research interests include data processing, ve- tion, stochastic processes, and their applications in
hicle navigation, and path planning. service operation management.

Diange Yang received the B.S. and Ph.D. degrees Xiaomin Lian received the B.S., M.S., and Ph.D.
in automotive engineering from Tsinghua University, degrees in automotive engineering from Tsinghua
Beijing, China, in 1996 and 2001, respectively. University, Beijing, China, in 1982, 1986, and 1997,
He is currently an Associate Professor with the respectively.
Department of Automotive Engineering, Tsinghua He is currently a Professor with the Department of
University. His research interests include intelligent Automotive Engineering, Tsinghua University. His
transport systems, vehicle electronics, and vehicle research interests include vehicle Global Positioning
noise measurement. System navigation, vehicle electronics, and vibration
Dr. Yang received the Second Prize from the control.
National Technology Invention Rewards of China
in 2010 and the Award for Distinguished Young
Science and Technology Talent of the China Automobile Industry in 2011.

Tao Zhang received the B.S. and Ph.D. degrees in


automotive engineering from Tsinghua University,
Beijing, China, in 2005 and 2010, respectively.
He is currently a Postdoctoral Researcher with the
Department of Automotive Engineering, Tsinghua
University. His research interests include vehicle
navigation and electronic control.