I.
I NTRODUCTION
devices on the IoT provides the basis for new business and
government applications in areas such as public safety, transport logistics and environmental management. A key challenge
in the development of such applications is how to model
and interpret the large volumes of complex data streams that
will be generated by the IoT. Examples of such large scale
deployments of sensors include (1) SmartSantander [1], [3],
in the Spanish city of Santander, with around 12,000 sensors
installed in places such as lamp posts for sensing temperature,
CO, noise, light and buried in the asphalt for parking sensing,
(2) in the city of San Francisco (SF), USA, where around
8,200 wireless parking sensors in neighborhoods across the
city are installed in on-street spaces, which can enable real
time monitoring and also inform drivers of the available vacant
parking lots and the rates in real time.
While there has been much discussion of the potential for
smart cities based on the IoT, there have been few systematic
studies of how data analytics can provide practical insights
from IoT data. The collection of such data is intended to be
used for improving trafc management, energy management,
environment protection, public health and safety. However,
urban authorities are not equipped to make use of this type
of Big Data. Without suitable data analytics to detect and
correlate relevant events in the urban environment, this sensing
infrastructure will not be effectively utilised and these public
services will remain manual tasks.
In this paper, we use the parking data collected from one
of the cities, namely the city of San Francisco (SF), USA,
and apply data analytics to infer interesting events buried
in the data. Although the SF parking data provides real-tine
parking availability data to the public, a meaningful analysis
of the data is lacking for interpretation by the authorities. In
particular, we perform data clustering and anomaly detection
on the collected parking data, and present several interesting
practical insights from the data, which are impossible to infer
without performing such machine learning tasks. To the best
of our knowledge, this is the rst time such an analysis has
been performed in terms of clustering and anomaly detection
on the SF parking dataset, which has been made available to
the public by the city, and can be accessed from [2].
The rest of the paper is organised as follows. Section II
provides the existing related work in this domain, and Section
III introduces the SF data set, the challenges in the analytics
and our approach. Section IV describes the clustering and
the anomaly detection algorithm, and Section V discusses the
outcomes. In Section VI, we provide a discussion about the
results and conclude highlighting further research directions.
1
978-1-4799-2843-9/14/$31.00 2014 IEEE
II.
computed the average OCC rate over the two week period
for every 15 minutes interval. The resulting data set consists
of 570 parking locations and 96 fteen minute time instances
per location over a day. We perform clustering and anomaly
detection on this 96 dimensional data in this paper.
R ELATED W ORK
Fig. 1. Geometry of SVDD: Data vectors are mapped from the input space
to a higher dimensional space and a hypersphere (with center c and radius R)
is t to the majority of the data. Data that falls outside the hypersphere are
anomalous.
min
R+ ,n
subject to:
1
R +
i
n i=1
2
(xi ) c R2 + i , i 0, i (1)
min
n
subject to:
n
i,j=1
n
i j k(xi , xj )
n
i = 1, 0 i
i=1
i k(xi , xi )
i=1
1
, i = 1...n. (2)
n
Fig. 2.
1
0.9
cluster0
cluster1
1
0.9
0.8
cluster0
cluster1
0.8
0.7
OCC rate
OCC rate
0.7
0.6
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0
00.00
0.6
0.1
05.00
10.00
15.00
Time (hours.minutes)
20.00
0
00.00
01.00
05.00
10.00
15.00
Time (hours.minutes)
20.00
01.00
(a)
(b)
(c)
Fig. 3. Farthest First clustering with two clusters. (a) Spatial locations of the parking lots in each cluster (b) Median and median absolute deviation of the data
vectors in each of the two clusters. (c) Mean and standard deviation of the data vectors in each of the two clusters
1
1
Anomalies
Normal
0.9
0.8
0.8
0.7
0.7
0.6
0.6
OCC rate
OCC rate
0.9
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0
00.00
Anomalies
Normal
0.1
05.00
10.00
15.00
Time (hour.minute)
20.00
0
00.00
01.00
05.00
10.00
15.00
Time (hours.minutes)
20.00
01.00
(a)
(b)
(c)
Fig. 4. One-class classication (SVDD) with parameters = 0.1 and = 100. (a) Spatial locations of normal and anomalous parking lots. (b) Median and
median absolute deviation of the normal and anomalous data vectors. (c) Mean and standard deviation of the normal and anomalous data vectors.
cluster0
cluster1
cluster2
cluster0
cluster3
cluster2
cluster3
0.8
0.6
OCC rate
OCC rate
0.8
0.4
0.2
0
00.00
cluster1
0.6
0.4
0.2
05.00
10.00
15.00
Time (hours.minutes)
20.00
01.00
0
00.00
05.00
10.00
15.00
Time (hours.minutes)
20.00
01.00
(a)
(b)
(c)
Fig. 5. Farthest First clustering using four clusters. (a) Spatial locations of the parking lots in each cluster. (b) Median and median absolute deviation of the
data in each of the four clusters. (c) Mean and standard deviation of the data in each of the four clusters.
(a)
(b)
Fig. 7. Spatial locations of the parking lots in selected clusters (using EM
clustering) (a) Cluster 2: have data vectors with a mean/median value of zero
(green lines). (b) Cluster 4 (red lines)
0.8
OCC rate
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
00.00
05.00
10.00
15.00
Time (hour.minute)
20.00
01.00
cluster0
cluster1
cluster2
cluster3
cluster4
cluster5
cluster6
cluster7
cluster8
cluster9
cluster10
cluster11
cluster12
cluster13
cluster14
cluster15
0.9
0.8
0.7
OCC rate
cluster0
cluster1
cluster2
cluster3
cluster4
cluster5
cluster6
cluster7
cluster8
cluster9
cluster10
cluster11
cluster12
cluster13
cluster14
cluster15
0.9
0.6
0.5
0.4
0.3
0.2
0.1
0
00.00
05.00
10.00
15.00
Time (hours.minutes)
20.00
01.00
(a)
(b)
(c)
Fig. 6. EM clustering. (a)Spatial locations of the parking lots in each cluster. (b) Median values of the data vectors from each cluster. (c) Mean values of the
data vectors from each cluster
e.g., crime statistics, other modes of transport, construction activity. This type of detailed correlation with other data sources
can be impractical if it needs to be applied to all parking
locations. However, the cluster analysis can limit the scope of
such an analysis by providing a focus in terms of potentially
interesting locations. In particular, we have demonstrated that
it is possible to nd clusters that are indicative of potential
sensor faults. In the future, we aim to perform clustering based
on both spatial and temporal similarity in the SF data as well
as from other platforms such as Smart Santander.
ACKNOWLEDGMENT
We thank the support from the Australian Research Council
grants LP120100529 and LE120100129.
R EFERENCES
[1] IoT,
http://issnip.unimelb.edu.au/research program/Internet of
Things, 2013.
[2] San Francisco parking data, http://sfpark.org, 2013.
[3] Smart Santander, http://www.smartsantander.eu/, 2013.
[4] V. Barnett and T. Lewis, Outliers in Statistical Data, 3rd ed. John
Wiley and Sons, 1994.
[5] J. Belissent, Getting clever about smart cities: New opportunities require new business models, in http:// www.forrester.com/ rb/ Research/
getting clever about smart cities new opportunities/ q/ id/ 56701/ t/ 2,
2013.
[6] J. C. Bezdek, T. Havens, J. Keller, C. Leckie, L. Park, M. Palaniswami,
and S. Rajasegarar, Clustering elliptical anomalies in sensor networks,
in IEEE WCCI, 2010.
[7] J. C. Bezdek, S. Rajasegarar, M. Moshtaghi, C. Leckie, M. Palaniswami,
and T. Havens, Anomaly detection in environmental monitoring networks, IEEE Comp. Int. Mag., vol. 6, no. 2, pp. 5258, 2011.
[8] F. Caicedo, C. Blazquez, and P. Miranda, Prediction of parking space
availability in real time, Expert Systems with Apps., vol. 39, no. 8, pp.
7281 7290, 2012.
[9] M. Caliskan, A. Barthels, B. Scheuermann, and M. Mauve, Predicting
parking lot occupancy in vehicular ad hoc networks, in IEEE VTC,
2007.
[10] S. Dasgupta and P. M. Long, Performance guarantees for hierarchical
clustering, Jnl. of Comp. and Sys. Sci., vol. 70, no. 4, pp. 555 569,
2005.
[11] T. F. Gonzalez, Clustering to minimize the maximum intercluster
distance, Theoretical Comp. Sci., vol. 38, no. 0, pp. 293 306, 1985.
[12] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, Internet of
Things (IoT): A Vision, Architectural Elements, and Future Directions,
Accepted for publ. in Future Generation Computer Systems, Jan 2013.
[13] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
I. H. Witten, The WEKA data mining software: An update, SIGKDD
Explorations,, vol. 11, no. 1, 2009.
[14] D. S. Hochbaum and D. B. Shmoys, A best possible heuristic for the
k-center problem, Maths. of Oper. Res., vol. 10, no. 2, pp. 180184,
1985.