The information contained in this document represents the current view of Microsoft Corporation on the issues
discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should
not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of
any information presented after the date of publication.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or
transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any
purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement from
Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights,
or other intellectual property.
The descriptions of other companies products in this document, if any, are provided only as a convenience to you.
Any such references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee
their accuracy, and the products may change over time. Also, the descriptions are intended as brief highlights to aid
understanding, rather than as thorough coverage. For authoritative descriptions of these products, please consult
their respective manufacturers.
2014 Microsoft Corporation. All rights reserved. Any use or distribution of these materials without express
authorization of Microsoft Corp. is strictly prohibited.
Microsoft and Windows are either registered trademarks of Microsoft Corporation in the United States and/or other
countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Page 1
Table of contents
Executive summary................................................................................................................. 1
IoT and predictive maintenance.............................................................................................. 2
The Internet of Things.......................................................................................................... 2
Business value................................................................................................................. 4
Megatrends...................................................................................................................... 4
Technology enablers......................................................................................................... 6
Standardization efforts..................................................................................................... 7
Predictive maintenance....................................................................................................... 7
Predictive maintenance scenarios........................................................................................... 8
Healthcare........................................................................................................................... 8
Automotive.......................................................................................................................... 9
Manufacturing...................................................................................................................... 9
Architectural considerations.................................................................................................. 10
Connectivity....................................................................................................................... 10
Interaction patterns........................................................................................................ 10
Connectivity pathways................................................................................................... 12
Connectivity network types............................................................................................ 12
Protocol choices................................................................................................................. 14
Transport-layer protocol choices.....................................................................................14
Transport-layer protocol security....................................................................................15
Application-layer protocol choices..................................................................................16
Security............................................................................................................................. 18
Virtual Private Networks.................................................................................................19
Compliance.................................................................................................................... 20
Device communication patterns........................................................................................21
NAT-based device network............................................................................................. 21
IPv6 direct-addressing device network...........................................................................22
NAT-based, PAN device network.....................................................................................23
Generic concerns with direct addressing........................................................................23
Service-assisted communication....................................................................................23
Designing for scale............................................................................................................ 27
Communication and ingestion........................................................................................ 28
Data storage scalability.................................................................................................. 28
Device registration............................................................................................................ 28
Page 2
Acquiring data................................................................................................................... 29
Message size and format................................................................................................ 29
Message types............................................................................................................... 30
Message priority............................................................................................................. 30
Conditional messaging................................................................................................... 31
Contextual messaging.................................................................................................... 31
Message batching.......................................................................................................... 31
Bandwidth and scale...................................................................................................... 31
Storing information............................................................................................................ 32
Storing data on the device............................................................................................. 32
Transforming data.......................................................................................................... 32
Location.......................................................................................................................... 33
Longevity, format, and cost............................................................................................ 33
Processing information...................................................................................................... 33
Alarm processing............................................................................................................ 33
Complex-event processing............................................................................................. 34
Big Data analysis............................................................................................................ 34
Machine learning............................................................................................................ 35
Data enhancement......................................................................................................... 35
Publishing insights............................................................................................................. 36
Audience........................................................................................................................ 36
Publishing format........................................................................................................... 36
Cost modeling and estimation...............................................................................................37
Common architecture overview......................................................................................... 37
Capacity modeling............................................................................................................. 40
Cost estimation.................................................................................................................. 41
Ingress path cost............................................................................................................ 41
Egress path cost............................................................................................................. 45
Management cost........................................................................................................... 47
System processing cost..................................................................................................... 49
Cost estimate calculation................................................................................................... 49
Strategic choices................................................................................................................... 51
Buy, build, or hybrid.......................................................................................................... 51
Important topics not yet covered.......................................................................................... 53
Networks with automatic handover and fallbacks..............................................................53
The need for the commoditization of devices....................................................................53
The creation and use of information marketplaces............................................................53
Page 3
Management solutions....................................................................................................... 53
The redefinition of SLAs..................................................................................................... 54
Integration simplicity......................................................................................................... 54
Conclusions........................................................................................................................... 55
How Microsoft can help you succeed.....................................................................................56
Page 4
Executive summary
For decades, technology experts have anticipated the Internet of Things (IoT): the
proliferation of tens of billions of connected devices that contain embedded microchips, and
the rise of machine-to-machine and service-to-service communications. IoT will make
inanimate objects, networks, and processes smarteverything from tiny components,
appliances, machines, homes, buildings, and factories to energy grids, transportation
networks, and logistics systems. Its a game-changing opportunity in IT. By analyzing the
vast new streams of data, and by harnessing the precise control that IoT provides, your
organization can reduce costs, create new revenue streams, increase customer satisfaction
and retention, spot trends faster, gain from opportunities more easily, and innovate with
agility. IoT will be especially beneficial in predictive maintenance: performing maintenance
at the right time to predict and prevent failures.
To take full advantage of IoT opportunities in predictive maintenance, you need to think
strategically about the many elements of IoT. For example, one should consider connectivity
pathways and types, transport-layer and application-layer protocol choices, device
interaction and communication patterns, and how to design for the vast scale of IoT. It is
especially critical to understand the complex issues of data security and regulatory
compliance, which can expose the enterprise to legal difficulties if they are not handled
properly. You also should think about how the enterprises communications systems will
ingest data, including message types, sizes, formats, and priorities, conditional and
contextual messaging, message batching, bandwidth, and how to scale a messaging system.
Another pivotal set of questions to ask relate to the data: where will data be stored and how
will it be distributed or potentially sold, and what is the longevity of the data, the right
format, and the associated cost to do that? What is the most efficient way to analyze Big
Data, how can you best take advantage of possibilities, such as alarm processing, complexevent processing, Big Data analysis, machine learning, and data enhancement? Because
data that seems at first uninteresting can be very valuable to the right audience, how do you
find that audience to monetize the insights gained from processing it?
The elements that are needed for security, communication, and scale in an IoT solution
make it very challenging to build one from scratch. To succeed with any IoT solution, it will
very likely require the implementation of a reference architecture that can help accelerate
the use of massive data from millions or even billions of devices. Modeling the systems
capacity to scale, and calculating the costs to do so for related aspects, such as ingress
(device to cloud) and egress (cloud to device, cloud to system) paths and system
processing, is paramount. Depending on the company background, a classic buy vs. build
vs. hybrid decision should be made, based on what you are already using, what is available,
and what will be available in the near future at a price that is acceptable to your business.
This white paper introduces and describes all of these considerations and provides you with
the tools necessary to estimate the operational cost of an implemented reference
architecture in production.
With the Microsoft Azure platform, Microsoft offers a broad set of building blocks to help you
get an IoT solution up and running quickly.
Page 1
1 Microsoft, What Our Customers Are Saying: Top Enterprise Trends of 2014, Susan Houser
2 Wikipedia, Internet of Things
3 Gartner, IT Glossary, Internet of Things
Page 2
Based on
this
Figure 1. Foundational activities, composable within and between devices and systems
Acquiring data. Using sensors to record information about the physical world.
Examples include measuring location, humidity, temperature, light, heart rate, blood
pressure, brain waves, current, and gas detection.
Publishing insights. When embedded sensor data is combined with both internal and
external data from other systems, additional insight from analyzing the data can be
learned and acted upon. Exposing that insight can also drive additional value for other
stakeholders outside the immediate needs of the current system, allowing for the
monetization of this knowledge.
On top of familiar devices, such as phones for input and presentation, a set of core
components to support those activities is needed, though business goals and technical
constraints will drive those that are required. Core components may include:
Sensors: the components that translate a value from the physical world into bits.
Examples include sensors that measure pressure, humidity, heart rate, gas levels, and
acceleration.
Devices: networked, physical, special-purpose systems that emit telemetry data, accept
external information, request external information, and execute remotely-issued
commands. Examples include factory floor equipment, environmental pollution sensors,
and control modules in vehicles.
Page 3
Bridges: systems that act as communication brokers between a device and a gateway,
typically by translating data traffic between different link protocols or methods, for
instance between short-range and long-range wireless protocols. A bridge can also be a
connectivity infrastructure that manages a nationwide or world-wide wireless network on
one side, and a bridge to a cloud system on the other. A bridge might also perform
intelligent preprocessing of data, or act as an autonomous local communications hub in
addition to its bridging function relative to a cloud system 4. Bridges are often also
referred to as gateways, but we reserve the term gateway for a network-based service
with which a bridge communicates.
Machine learning: computational algorithms that can analyze large sums of data and
extract patterns from it to help a system act and learn from that data to drive more
intelligent system responses in the physical world.
Interconnections: different systems sharing learnings and data that in turn form
composite systems.
We have read thought-provoking papers about IoT. Two that we found especially valuable in
providing context to the concepts and opportunities of IoT are:
IoT enables you to build, enhance or extend a business model based on data-driven insights
from pervasive sensors that help you optimize resource use and reduce cost and
environmental impact. IoT also helps you maintain a closer relationship with customers
beyond the point of sale of physical products by enabling contextual, remote actions
automatically and intelligently. Examples include remote servicing, proactive sales, bestpractices guidance, and more.
4 Microsoft, How Microsoft tech is helping affordable housing tenants save money (section
on Captain)
5 Deutsche Akademie der Technikwissenschaften, Final report of the Industrie 4.0 Working
Group
6 General Electric, Industrial Internet: Pushing the Boundaries of Minds and Machines, a
European Perspective
Page 4
Business value
At least 26 billion devices will be connected on the Internet by 2020, and organizations in
every sector will use them.7 Billions of connected devices will help businesses to:
Reduce cost. Businesses can use the increased insight into manufacturing and delivery
processes to optimize those processes and reduce cost. For example, reducing the
number of scheduled visits a technician must make by scheduling service visits based on
duty cycles and expected product lifespans informed by actual usage.
Create new revenue streams. Using the ability to sense from and actuate in the
physical, new business models are emerging. Business can capitalize on these new
opportunities and create new innovate revenue streams. Some examples would be
monetizing newly collected datasets, offering APIs to create new business partnerships,
increasing service revenue by notifying and offering improved convenience to customers,
offering differentiating SKUs based on usage patterns, supplying optimized configuration
services, and so on.
In the blog post 10 reasons businesses need a strategy for the Internet of Things now, 8 the
author identified a concise set of benefits that a company can realize by adopting an IoT
strategy.
Megatrends
The world faces many challenges, such as changes in wealth distribution, resource scarcity,
and an aging population in developed countries. The authors of the book From Machine-toMachine to the Internet of Things: Introduction to a New Age of Intelligence analyzed these
megatrends and capabilities in detail.9 They found that these megatrends are driving a
proliferation of embedded devices with sensors, which in turn require new capabilities for
new market scenarios, as the graphic below shows.
7 Gartner, Gartner says the Internet of Things Installed Base Will Grow to 26 billion units by
2020, December 2013
8 Microsoft, 10 reasons businesses need a strategy for the Internet of Things now
9 From Machine-to-Machine to the Internet of Things: Introduction to a New Age of
Intelligence, ISBN 978-0124076846
Page 5
Natural resource constraints. The world population is growing at a high rate, with a
projected peak population of 9.22 billion in 2075. 10 Given this growth and the impact it
has on the growth of the worldwide economy, the world will increasingly have to do more
with less, and optimize the way that we produce. IoT can support the optimization of
production, loss reduction, and the efficiency of the necessary supply chain.
Economic shifts. Much like the shift in IT, going from packaged products to as-a-service
solutions, the global economy is moving from a product-oriented to a service-oriented
perspective.11 For a viable service-oriented economy to come into existence, it needs to
be supported by a large set of devices that provide context to the customer environment
for the system in order to offer the right service, at the right price, and at the right time.
Page 6
Climate change. The impact of human activities on the environment, although debated
at length, is detrimental to the sustainability of the world. In recent years, there has been
a growing movement of green technologies and services, ranging from electric cars to
corporate and government policy changes. IoT can be a supporting factor in both
providing footprint insight and reduction.
Technology enablers
The ever-decreasing cost and size of components, such as accelerometers, Wi-Fi radios,12
GPS, microcontrollers, and Bluetooth radios is also enabling the Internet of Things (IoT). It
allows components and devices to be used in new settings, such as wearables, on-person
devices, and even smaller equipment.
As shown in Figure 1, IoT depends on several other major technologies and trends. Some of
these technology enablers as well as others warrant clarification:
Cloud computing. For systems that connect hundreds of millions of devices, cloud
computing is the technology that allows for vast scale and acceptable costs, providing
the ability to store large amounts of machine generated data at low cost and perform Big
Data analytics and machine learning.
Power supply and storage technologies. Given the tiny size of many new devices,
their deployment location, and the vast number of them that will be deployed, changing
batteries is often impractical or impossible. Besides optimizing hardware design for these
scenarios,15 enhancing circuitry by limiting their quiescent current (I q) will further
improve battery life. Also, with energy harvesting techniques, such as solar power
12 For example, a network chip for less than $10 for 1,000 units. Texas Instruments,
SimpleLink Wi-Fi Module CC3000
13 maxEmbedded, What is a microcontroller? And how does it differ from a microprocessor?
14 Texas Instruments, MSP430 documentation
15 Texas Instruments, Using power solutions to extend battery life in MSP430 applications
Page 7
supplies, devices can recharge their built in batteries as long as there is a minimal
charge left.
Embedded operating system platforms. With the vast number of devices that will be
installed, cost and energy consumption per device become decisive. Engineers will
create devices that cost less and that are more energy-efficient, even if they have limited
processing capabilities and memory. CPU cycles spent, and the memory allocated will
become important factors for choosing operating system platforms, installed
components, and security configurations. There is a plethora of good general-purpose
operating systems, ranging from Windows Embedded and Embedded Linux to real-time
operating systems, such as FreeRTOS, ThreadX, Integrity, Nucleus, Qnx, Atomthreads,
AVIX-RT, ChibiOS/RT, ERIKA Enterprise, TinyOS, Thingsquare Mist/Contiki, and others.16
In sum, IoT is gaining momentum because of growing customer and enterprise needs
meeting technology enablers at the right cost.
Standardization efforts
Throughout the world, many organizations are working on the standardization of IoT, based
on specific technology or holistically on reference architectures. Examples of this work
include:
In addition to these efforts, there is a lot of work going on in depth in many different
technology areas, such as the standardization of protocols. Protocol choices, both at the
transport as well as the application layer, are discussed later in this document.
Predictive maintenance
This white paper focuses on a common scenario IoT enables that we call predictive
maintenance: performing maintenance with a focus on timeliness, acting exactly when
needed instead of at regular intervals, and predicting and preventing failures before they
happen, based on learning from historical data. Predictive maintenancejust-in-time
maintenancewill massively transform how organizations and consumers manage
equipment as well as people. Predictive maintenance also informs more traditional
preventative maintenance patterns, optimizing routine maintenance activities.
Page 8
Predictive maintenance
scenarios
The potential for useful applications in the Internet of Things (IoT) is endless. This section
focuses on scenarios that illustrate concrete benefits based on predictive maintenance,
where maintenance can be performed on both inanimate and living things. The following
scenarios that we describe provide examples of the enormous potential that IoT holds for
enterprises.
Healthcare
With the previously described change in world demographics, there is an
increasing need for remote patient management, allowing elderly
citizens to only come to the doctor or the hospital when the need arises,
based on telemetry captured by smart devices. Some early innovation in
this space, more geared toward health self-management and consumer
devices can be seen in watches with sensors that collect a variety of data,
such as blood pressure and heart rate. When body temperature, oxygen levels, and CO 2
levels are combined with the ability to display this data to the patient and physician in real
time, this alleviates the stress of full waiting rooms and reduces the cost per patient. 19
Another example is an in-home glucose monitor that uploads a patients vital signs to a
cloud-based health platform, where the data is analyzed and presented back to the patient
in an easy-to-understand format on a mobile device, and in a more complex format on a
touchscreen to the doctor. The doctor can review the patients information and then use the
touchscreen to send feedback to the patient and write a prescription. 20
Powerful, specialized, cloud-connected devices like these that enable doctors and patients to
work together to remotely monitor vital signs, exchange information, communicate, and
alert relatives, all in real time, are either becoming available or in development. By actively
monitoring patients at home21,22 or while they are mobile, healthcare professionals can
provide a higher level of care, reduce in-hospital waiting time and costs, and reduce stress
for everyone involved, which leads to better patient outcomes. Using technology to
accurately predict and signal medical staff about conditions that need attention, enables
19 Samsung Simband aims to take a big step in wearable health,
www.cnet.com/products/samsung-simband/
20 Microsoft Healthvault Medical Intelligent System, www.youtube.com/watch?
v=j8Y4ukdNM60
21 Medical Design Technology Magazine, The Internet of Things and Medical Device Product
Development: Practical Strategy Suggestions, March, 2014
22 YouTube, Medical Intelligent System, Proof of Concept
Page 9
healthcare professionals to anticipate patient issues instead of reacting to them, and remedy
them before they become critical, all while maintaining the security and privacy of the data
collected from such technologies.23 As a positive side effect, the collected evidence of
provided care could also help alleviate the issue where doctors in the U.S. are sometimes
reluctant to provide prescriptions or diagnosis over the phone because of billing
restrictions,24 which forces patients to visit the office of the healthcare provider, and as a
result waste a lot of everyones time for the treatment of common or recurring ailments.
Automotive
Vehicles contain telemetry about their operation, and about the service
activities and faults that happen on them. They travel through different
locations, different weather conditions, and different usage scenariosa
four-wheel drive vehicle climbing trails, a sports car in the mountains, or a
family van loaded with children. Each of these factors can have an effect
on how the vehicle operates, as well as its reliability, comfort, safety, and
performance. If the vehicle manufacturer or a vendor-agnostic data aggregator/analyst can
collect this data, and analyze it over time, trends can be identified to find new, timelier, and
more cost effective and impactful actions to take. These can include maintenance on the
vehicle, reconfiguring it, which in turn can help to prevent recalls, or conversely trigger
recalls to keep the vehicle safe, and more fun, useful, and cost effective for everyone
involved, including the owner, the operator, and the passengers.
Manufacturing
A service technician is dispatched to analyze an elevator after someone
reports that its doors will not close. The building owner is hearing from
people who are unhappy that they have to walk up the stairs. It takes the
engineer an hour to drive to the building and find the elevator. After
arriving, he works through a standard checklist for another hour, only to
conclude that the elevator works as expected. As so often happens, a
fleeting obstruction, such as a coffee cup between the doors of the elevator or accumulated
dust and dirt in the sliding rail might have caused the problem.
The service technician drives back to his office, having spent a total of three hours on a
phantom problem. At $150 USD per hour and with more than one million elevators in
service, incidents where equipment is evaluated as operating normally upon inspection such
as in this scenario can have a big impact on the profitability of an elevator company,
depending on the type of maintenance contract.
Moving beyond this reactive maintenance illustration, capturing telemetry about the motors
that operate the elevator or the speed that the doors of the elevator close allows the
engineer to take a more predictive approach. For example, an increase in the consumption
of energy or a decrease in the door closing speed might signal a service request, and trigger
23 Deloitte, Networked medical device cybersecurity and patient safety: Perspectives of
health care information cybersecurity executives
24 Texas Medical Association, Coding for Telephone Consultations
Page 10
a maintenance crew to provide the service before the elevator breaks down and customers
call support, thus saving money, reducing downtime, and increasing customer satisfaction.
Page 11
Architectural considerations
Designing any system reveals concerns that transcend the individual components of the
system. In this section, we discuss various considerations and architectural approaches that
we have encountered while helping our customers design solutions in the realm of predictive
maintenance.
Connectivity
Interaction patterns
Special-purpose devices differ not only in the depth of their relationship with back-end
services, but in the interaction patterns of these services when compared to informationcentric devices because of their role as peripherals. They are not the origin of commandand-control gestures; instead, they typically contribute information to decisions, and receive
commands as a result of decisions. The decision-maker does not interface with them locally,
and the device acts as an immediate proxy; the decision-maker is remotely connected and
25 Wikipedia, OSI Model
Page 12
might be a machine. We usually classify interaction patterns for special-purpose devices into
the four categories indicated in the following figure.
Inquiries occur when a device solicits information about the state of the world beyond
its own reach based on its current needs; an inquiry can be a singular request, but it
might also ask a service to supply ongoing updates about a particular information scope.
For example, a vehicle might supply a set of geo-coordinates for a route, and then ask
for continuous traffic alert updates about a particular route until it arrives at the
destination.
Telemetry and inquiries are device-initiated, and their counterparts, commands and
notifications, are service-initiated. This means that there must be a network path for
messages to flow from the service to the device, which bubbles up a set of important
technical questions. How do you:
Page 13
Ensure that the device only accepts legitimate commands and trustworthy notifications?
Ensure that the device is not easily susceptible to denial-of-service (DoS) attacks that
render it inoperable?
Connectivity pathways
In the architectures that we have worked on, there are four
common connectivity pathways:
From the work that the authors have done, we have learned that for predictive maintenance
implementations, a bi-directional communication pattern is key to a manageable solution.
The reason for this bi-directional communication ability is to ensure that the system can tell
devices to change the way that they capture telemetry, for example, the rate at which it is
captured or the fidelity of the readings. We have not come across a case where the
requirements were simply to capture data from devices in a one-way communication flow.
Because most systems will need a method of telling devices to capture data at differing
Page 14
Wide area network (WAN). A good example of a WAN is a cellular network. This
network type is a wireless network that is distributed over land areas called cells, each
served by at least one fixed-location transceiver, known as a cell site or base station. In a
cellular network, each cell uses a different set of frequencies from neighboring cells, to
avoid interference and provide guaranteed bandwidth within each cell. When joined
together, these cells provide radio coverage over a wide geographic area. This enables a
large number of portable transceivers (for example, mobile phones, pagers, and so on.)
to communicate with each other and with fixed transceivers and telephones anywhere in
the network, via base stations, even if some of the transceivers are moving through
more than one cell during transmission.28 The most common cellular network is the type
that cellphones use. Cellphones and many integrated components for devices support
network technologies, such as Global System for Mobile Communications (GSM),
Universal Mobile Telecommunications System (UMTS) and as an evolution technology,
Long Term Evolution (LTE), as well as others. As with many technologies in the ecosystem
Page 15
of IoT, there is still a large opportunity for optimization of resource usage and cost for
these technologies.29
Page 16
Protocol choices
After you have chosen a connectivity type, you need to determine which protocols suit the
purpose of your IoT solution. As you can see in the overview of logical protocols, the term
protocol applies to different layers in the stack, and there are different protocols to choose
from for each layer.
Page 17
support either one of these protocols or both of them. The following figure and table provide
an overview of the packet structure and a lightweight comparison of TCP and UDP, as well as
factors to consider before choosing to use either protocol.
TCP
Connection-oriented
Full
+++
++
no
UDP
Connectionless
None
+
+
yes
active
20 bytes
Yes, retransmit
Yes
-
none
8 bytes
Yes, no recovery possible
No
Broadcasts
Multicast
UDP is a good candidate to transmit data from constrained devices over constrained
networks in close proximity, such as LANs or PANs where congestion and packet loss can be
low. The following factors contribute to this:
UDP transactions require only two datagrams, which reduces network pressure.
On networks with a higher probability of packet loss, TCP, being more reliable and secure, is
a viable candidate. The following factors contribute to this:
Page 18
Because TCP is connection oriented, a device that uses TCP can better defend itself
because it can ignore communications unrelated to current connections, whereas a
device that uses UDP must accept every packet it receives on the listening port.
In scenarios that use streaming video or audio, where high throughput is more important
than guaranteed packet delivery, and in telemetry solutions in which segments are missing,
architects often choose UDP because packet loss is often a better tradeoff than experiencing
delays caused by TCP retransmission.38
There are also scenarios where the occasional packet lost for the telemetry channel would
be acceptable, but requirements would exist for guaranteed delivery of commands, making
the case for a composite model to address these needs to uses both UDP and TCP.
Page 19
The first implementation of AMQP was iMatix OpenAMQ,42 but others have emerged as
well, notably Apache Qpid,43 Microsoft Azure Service Bus,44 and RabbitMQ.45
AMQP is a binary wire protocol that supports programming languages such as C#, C,
Java, Perl, Python, Ruby, PHP, and Lisp.
Where many traditional queuing mechanisms have failed, AMQP seems to be
thriving and is currently used in many systems, such as:46
Aadhaar,47 a large-scale identity system with 1.2 billion identities and about 100 million
authentications per day.48
The National Science Foundations Oceans Observatory Initiative, processing 3
petabytes per year49
For more information, see the AMQP site to read the specifications on the protocol
or try a free implementation.
Figure 8. Multicast
Page 20
In projects that we have done where there was a green field for implementation,
AMQP has been the best fit because in addition to it being efficient, reliable, flexible,
and broker independent, AMQP is native to Microsoft Azure Service Bus, the key
technology component for all these customer projects.
Page 21
Security
With devices communicating sensitive information and acting on our behalf, we clearly need
to ensure that the system and the information it captures, processes, and stores, is secure.
With any system, security is a tradeoff with other requirements, such as user friendliness,
performance, cost, and so on. In this section, we cover some important security aspects we
have come across while working in this field.
This is the weather forecast for the week of June 16, 2024 for Texas, the
weatherman says. Last week was hot, but this week will be sizzling, with
temperatures reaching in excess of 110 degrees, with no rain expected. In
hot weather, irrigation is the key to crop and cattle survival. Because most of
the states farmers are using a new irrigation system that depends on
thousands of sensors to determine the best time to irrigate, few of them
worry. What they dont know is that the system is sending faulty telemetry
information that indicates that it rained every day last week. This keeps the
system from irrigating, and now, crops and cattle start to die.
When distributed systems directly influence the physical world by turning valves, controlling
servos, and much more, there is a clear need to ensure that compromised systems do not
kill crops, cattle, and people, burn buildings, or crash cars. The security bar for commands
and data that make things move must be much higher than in e-commerce or finance.
Lets start with a short list of questions about security for the kinds of systems that we have
come across in our work on predictive maintenancea list of factors to think about as you
architect an IoT system. On top of normal security precautions, you also need to know how
to:
Securely onboard new devices. You must ensure that only devices that the system
can register are allowed into the system.
Ensure that device data can be trusted. As devices communicate, you need to
ensure that the data that they transmit is received unaltered and from verified sources
that the data logged in the service by the device must be trustworthy, representing a
point-in-time observation. This requires integrity and authenticity of data in informationsecurity terms.56
Page 22
Prevent devices from denying service. In modern software architecture, the level of
interdependencies is high and increasing. Dependencies within the systemsuch as
devices measuring data potentially critical to effective decision-makingneed to be
available and accessible.
Accept only authorized commands on devices. In any system that acts on external
commands and especially one that interacts with the physical world, it is imperative to
ensure that those commands are only acted on if they are properly authenticated and
authorized.
Remove rogue devices from the system. If you find a bad actor such as a
compromised device in the system, you must be able to remove it quickly.
Ensure that devices are always connected to a particular service. A powerful part
of how modern communication works is by using hyperlinks to let clients dynamically
reroute traffic. Devices will blindly follow these hyperlink redirects without thinking twice
(or once, for that matter). Besides offering flexibility, redirects pose a substantial risk if
someone redirects the dataflow into an intermediate system to alter system behavior,
copy the data, or modify the data stream.
Page 23
A VPN does not help establish any notion of authentication and authorization beyond their
immediate scope. A network application that sits on the other end of a TCP socket, where a
portion of the route is facilitated by the VPN, is oblivious to their existence because it acts on
the transport and application layers of the network model. What matters for the
trustworthiness of the information that travels from the logic on the device to a remote
control system that does not reside on the same network, as well as for commands that
travel back up to the device, is solely a fully protected end-to-end communication path
spanning networks, where the identity of the parties is established at the application layer.
Protecting the route at the transport layer by signature and encryption is done as a service
for the application layer either after the application has given its permission (for example,
via certificate validation hooks) or just before the application layer performs an authorization
handshake, before entering into any conversations. Establishing end-to-end trust is the job
of application infrastructure and services, not of networks.
Compliance
For vertical sectors such as government and healthcare, compliance is a key consideration
as you architect an IoT solution. National and local governments and industry groups have
mandates that affect what a company can share and with whom. Conversely, some
regulations require the sharing of data among government entities or businesses that work
on government programs. The EU has model clause regulations that dictate the storage and
exposure of personal data.58 The U.S. has similar regulations, such as the Health Insurance
Portability and Accountability Act (HIPAA)59 and the Privacy Act.60 Other countries and
entities also have privacy mandates that consider the location of stored data, its origin, the
location and nationality of the users, and the location, nationality, and use of the data
consumers.
If ingested, processed, or published data offers no way to discern details about specific
people, it will less likely be affected by regulation. But all data that is made available to the
public or even a controlled set of partners must be reviewed to adhere to all applicable
mandates because violations present high legal 61 and reputational risks.62
Healthcare
The HIPAA and HITECH laws in the U.S. apply to healthcare and partner organizations that
have access to sensitive patient information, called electronic protected health information
(ePHI). Service providers that work with these entities usually must agree in writing to
adhere to security and privacy provisions set forth in HIPAA and the HITECH act. If an IoT
58 European Commission, Protection of Personal Data
59 US HHS, Health Information Privacy
60 US HHS, Privacy Act
61 TechRepublic, Data security laws and penalties: Pay IT now or pay out later
62 Experian, Reputation Impact of a Data Breach
Page 24
system that supports applications such as the one we described in the Healthcare scenario
captures ePHI, it must adhere to these laws. Microsoft provides a Business Associate
Agreement as a contract addendum to its cloud platform, Microsoft Azure. 63 We also provide
information on some of the best practices for HIPAA-compliant applications, and we detail
Microsoft Azure provisions for handling security breaches.64
63 U.S. Department of Health & Human Services, Health Information Privacy, Business
Associates
64 Microsoft, Azure HIPAA Implementation Guidance
Page 25
Page 26
to act like a server (receiving, parsing, and answering arbitrary requests from clients), which
increases its attack surface for malicious abuse.
Page 27
Service-assisted communication
Another approach to connecting a large number of devices to the central service within a
system is to have the device connect to a well-known service (called a gateway) and then
use that service to tunnel commands to the device. The goal of this approach is to establish
trustworthy and bi-directional communication paths between control systems and special-
Page 28
purpose devices that are deployed in untrusted physical space. To that end, the following
principles are established:
Devices do not accept unsolicited network information. All connections and routes
are established in an outbound-only fashion.
Devices are peered with a gateway to only connect or establish routes to wellknown services. If devices need to feed information to or receive commands from a
multitude of services, they are peered with a gateway that takes care of routing
information downstream. This ensures that commands are only accepted from authorized
parties before routing them to the devices.
The communication path between device and service or device and gateway is
secured at the application protocol layer. This mutually authenticates the device to
the service or gateway and vice versa. Because the application does not normally
concern itself with lower-level layers in the network stack as we discussed earlier in
Connectivity, device applications do not trust the link-layer below.
Page 29
Device. The device acts like a client; it connects to the gateway and does not listen for
unsolicited traffic. The device connects to an external gateway by creating and
maintaining an outbound TCP socket across a NAT boundary or by establishing a bidirectional UDP route, potentially using mechanisms such as Session Traversal Utilities
for NAT (STUN) or with larger NATs, such as Traversal Using Relay NAT (TURN). These
facilitate the detection of a NAT and the discovery of the public IP address of the network
for binding.
Connection. The connection is routed through the edge device, usually a router.
Because the connection is outbound, the port mapping is performed automatically. By
only relying on outbound connectivity, the NAT/Firewall device at the edge of the local
network will never have to be opened up for any unsolicited inbound traffic.
The outbound connection or route is maintained by either client or gateway in a fashion
that intermediaries such as NATs will not drop due to inactivity. That means that either
side might send some form of a keep-alive packet periodically, or send a payload packet
periodically that then doubles as a keep-alive packet. Under most circumstances it will
be preferable for the device to send keep-alive traffic as it is the originator of the
connection or route, and it can and should react to a failure by establishing a new one.
As TCP connections are endpoint concepts, a connection will only be declared dead if the
route is considered collapsed and the detection of this fact requires packet flow. A device
and its gateway may therefore sit idle for quite a while believing that the route and
connection is still intact before the lack of acknowledgement of the next packet confirms
that assumption is incorrect. This conflict in behavior calls for a tradeoff decision to be
made.
Carrier-grade NATs (CGNs) employed by mobile network operators permit very long
periods of connection inactivity and mobile devices that get direct IPv6 address
allocations are not forced through a NAT at all. The push notification mechanisms
employed by all popular smartphone platforms use this to dramatically reduce the power
consumption of the devices by maintaining the route very infrequentlyevery 20
Page 30
minutes or moreso the devices can remain in sleep mode with most systems turned off
while idly waiting for payload traffic. The downside of infrequent keep-alive traffic is that
the time it takes to detect a bad route is, at worst, as long as the keep-alive interval.
Ultimately, it is a tradeoff between battery-power and traffic-volume cost (on metered
subscriptions) and acceptable latency for commands and notifications in case of failures.
The device can actively detect potential issues and abandon the connection and create a
new one when, for instance, it hops to a different network or when it recovers from signal
loss.
The connection from the device to the gateway is protected end-to-end and ignores any
underlying link-level protection measures. The gateway authenticates with the device
and the device authenticates with the gateway, so neither is anonymous to the other. In
the simplest case, this can be done by exchanging a previously shared key. As we see
quite often in more capable devices, it can also be done via a X.509 certificate exchange
as performed by Transport Layer Security (TLS), or a combination of a TLS handshake
with server authentication where the device later supplies credentials or an authorization
token at the application level. The privacy and integrity protection of the route is also
established end-to-end, ideally as a byproduct of the authentication handshake so that a
potential attacker cannot waste cryptographic resources on either side without producing
proof of authorization.
Today, TLS/DTLS and Secure Shell (SSH) dominate as application-level connection
security protocols. SSH is popular, but it lacks a standard session-resumption gesture.
TLS supports both the X.509 certificate-exchange model and a simplified model (TLSPSK) that uses previously shared keys. Removing support for X.509 certificate handling
and wire-level exchange reduces the footprint of the TLS library, and by reducing the
supported algorithms (for example, supporting only AES-256 and SHA-256), its feasible
to use this protocol on compute- and memory-constrained devices while remaining
compatible with other application layer protocols that rely on TLS. The result of all this is
a secure peer connection between the device and a gateway that only the gateway can
feed.
Edge security. Because there are no ports open to listen on the edge device, the attack
surface on the local network and its devices is minimized.
Page 31
projection into the gateway and its address remains completely stable. The gateway can
also be federated with external identity and authorization services, so that only callers
acting on behalf of particular users or systems can invoke particular device functions.
The gateway therefore provides basic network defense, API virtualization, and
authorization services all combined into in one. This approach gets even better when it
includes or is based on an intermediary messaging infrastructure that provides a scalable
queuing model for both ingress (device to cloud) and egress (cloud to device) traffic.
Without this intermediary infrastructure, this approach would still suffer from the
issue that devices must be online and available to receive commands and
notifications when the control system sends them. With a per-device queue or
per-device subscription on a publish/subscribe infrastructure, the control system
can drop a command at any time, and the device can pick it up whenever it is
online. If the queue provides time-to-live expiration alongside a dead-lettering
mechanism for such expired messages, the control system can also know
immediately when a message has not been picked up and processed by the
device in the allotted time.
The queue also ensures that the device can never be overtaxed with commands
or notifications. The device maintains one connection into the gateway and it
fetches commands and notifications on its own schedule. Any backlog forms in
the gateway and can be handled there accordingly. The gateway can start
rejecting commands on the devices behalf if the backlog grows beyond a
threshold or the cited expiration mechanism kicks in and the control system gets
notified that the command cannot be processed at this time.
On the ingress-side (from the gateway perspective) using a queue has the same
kind of advantages for the back-end systems. If devices are connected at scale
and input from the devices comes in bursts or has significant spikes around
certain hours of the day, such as with telematics systems in passenger cars
during rush-hour, having the gateway deal with the traffic spikes keeps the backend system robust. The ingestion queue also allows telemetry and other data to
be held temporarily when the back-end systems or their dependencies are taken
down for service or suffer from service degradation of any kind.
Page 32
Architectures.67 The Microsoft patterns & practices team also has a large body of work on
Cloud Development that provides guidance on building scalable cloud systems. 68
There are specific scalability areas that come up more frequently in IoT scenarios that may
not appear in other IT solutions, however. One area is identity. For web properties, the
concept of identity federation has taken hold, and most modern consumer web properties
now allow a user to use their identity from other well-known identity stores, such as an
account registered with Microsoft, Facebook, Google, Yahoo, and so on. Additionally,
corporate accounts can be federated with platform as a service (PaaS) vendors and partners.
But with the addition of devices, there will often be identities associated with those devices,
relationships between those devices and human identities, and relationships between
multiple humans and devices. This potentially complex set of relationships should be
considered early in an IoT project, and the solution should strive to simplify these
relationships as much as possible.
In our project experience we have not yet seen a pattern that satisfies this level of
complexity with satisfactory results. The initial projects have used Azure Active Directory for
human identities, and external data stores for device identity and the associations with
Azure Active Directory users. Design, prototyping, and testing is an ongoing process to find
more scalable, resilient and feature complete solutions.
Page 33
to take into account the scalability parameters of the Service Bus72, and use as many topics
as needed to comfortably support the number of devices in the system and scale to
additional topics if and when additional devices are added to the system.
Device registration
Registering devices is the critical first step to take to
ensure that the system is secure and remains secure,
only allows data to be ingested from trusted endpoints,
and devices only accept commands from trusted
systems. A device must be uniquely identified, the
system must authenticate its identity, and the device
must know that it is communicating securely with only
the correct collection endpoint.
Often a device will be created with the knowledge of the expected endpoints, or at least
have some influence over the collection point. An example of this is a vehicle whose
manufacturer is selling a connected vehicle experience. In this scenario, when the device is
manufactured, a unique key will be stored on the device. Either that key or a public key
associated with it will be stored in a database, and when the device is enabled, the service
can check the database and verify that the device is an approved device. These keys may
be service-generated, such as by Azure Active Directory Access Control Services (ACS), or
keys created to support the TLS-PSK pattern as described earlier in this paper, or keys
intended for service-specific authentication. Typically, even when the device carries a key
out of the factory, the device will become active in another step; for example, when a
customer purchases, installs, and configures the device. Configuration will associate a user
with the device, which transforms it to an active device. The device may be issued a new
key at this time.
In other cases, the set of potentially connected devices will not be known at manufacturing
time, so keys cannot be installed on the device prior to its release. In this case, device
registration must happen when the device is installed or activated. An example of this might
be a traffic service that will collect GPS and movement telemetry from a smartphone, and in
turn provide free traffic information for users who opt in to share data. In this case, there
72 Microsoft, Service Bus Scalability
73 MSDN, Data Management Patterns and Guidance
Page 34
would be a registration step where a user must identify the device to the service, the service
then sends a key to the device, and then that key is used to manage communication.
Equally important to device registration is the ability to unregister the device, or disable it.
This is critical because even though the communication with the device is secure, the device
itself can become compromised. Being able to unregister the device and refuse
communication is a critical aspect of the system. With device specific keys, the keys can be
revoked and the system can quickly stop accepting telemetry from the device.
Acquiring data
IoT data acquisition is frequently referred to as data
ingestion. In literature about Big Data, the three Vs,
volume, variety, and velocity are often cited74. There are
other aspects to consider as well. In our initial
engagements in IoT, we have seen that device bandwidth,
connection speed, reliability, and cost have been major
influencers in the solution choices made. But each item in
this section is important, and the relative importance of
each will vary depending on a projects requirements. The
following sections discuss many aspects of data ingestion.
Page 35
balance development environment support, device support, the need for compactness, and
storage and processing requirements on the message-receiving side.
Message types
Your system may require different message types that can differ in schema, data type, or
both of these. A real-world example of this is a connected vehicle system that predominantly
sends telemetry information for predictive maintenance. This system might also be used to
send audio or video clips for emergency management, accident recording, and so on. In
these cases, the media files are often enhanced with metadata related to the collection of
the media file. Additionally, the media messages may be of lower or higher priority and they
may require splitting, compression, resumption on error, and temporary local storage. If
different device types are involved, they may provide media files in formats or encoding
levels that are optimized or specific to those devices, which could require normalization at
the storage point.
Message priority
Different message types will often have different priorities in an IoT system. A message can
be a standard telemetry message that is intended specifically for cataloging, and used for
machine learning algorithms downstream. There can be other message types that are
considered events and alarms. An event could be an elevator door opening, a car starting, or
the temperature being increased in a home, whereas an alarm might be a broken window, a
car crash, or a full engine failure.
Message priority will be handled either by providing a separate endpoint for priority
messages, or by detecting attributes in the message itself to assign priority. Using a
separate endpoint for priority massages can reduce the chance of a high priority message
delivery being slowed by a flood of the standard flow messages. If the throughput of the
initial point of ingestion is considered adequate, then downstream detection is an option, for
instance creating a standard subscription and a high priority subscription on an Azure
Service Bus Topic.
There are also cases where device priority may be required. In a connected vehicle scenario,
there may be a premium service that has priority, or there may be sensors in a building with
relative priority, such as one that detects a broken window on the first floor that has higher
priority than one on the fifth floor of the building. In this case, the priority may be handled
similarly to message priority. Another approach is to use a separate service that handles the
higher priority devices.
Conditional messaging
In some of our projects, the solution required the message pattern to change based on
conditions. In this case, if a service technician received an alert that an elevator needed
attention, the technician could send a message to the device asking for it to increase the
detail and frequency of messaging. This would continue for a configurable timeframe.
This type of requirement means that the solution must be scaled to handle the conditional
events. For instance, if the devices could automatically increase the size and frequency of
Page 36
messages, they could cause a dramatic increase in traffic to the system. Safeguards and
throttling should be considered to protect against unplanned data floods in such situations.
Contextual messaging
Similar to conditional messaging, there are use cases that require contextual messaging,
which can follow multiple patterns. There may be situations where the device includes
contextual information in the messages that it sends. The data may include GPS
coordinates, and a vehicle may need to send additional telemetry when it travels above a
certain altitude, or if the ambient temperature rises above a trigger level. The context may
require more data in messages, the collection of data from other sensors on the device, or it
may require more or less frequent message transmission.
Message batching
The natural inclination may be to send messages immediately when data is generated, but
there are several reasons why messages may be batched. A device may be power
constrained, so the connectivity may only be turned on for a limited amount of time. The
connection may be unreliable, so it could make sense to batch the collected messages for a
single transmission once connectivity is available. The device may move in and out of
connectivity, or connectivity may be congested or less expensive at certain times of the day.
If you allow batched messages, the message receiver must be designed to accept them as
well as single messages. In this case, a message envelope that can contain multiple
messages or a single message can simplify the solution.
Page 37
Storing information
In an IoT solution, there are also several aspects to
consider for data storage. The following sections discuss
many aspects of this topic.
Transforming data
Generally the data will go through multiple transformation steps that extend from the
generation, sending, storage, and processing of it. As stated in the previous section, there
may be data transformation happening on the device itself, such as converting its format,
aggregation, and so on. This will rely on local processing capabilities. Other than the local
preprocessing, any other transformation would happen at the collection point.
For years, data processing has been thought of in terms on Extract, Transform, and Load
(ETL). With the advent of Big Data, much of the discussion has changed to Extract, Load,
and then Transform (ELT). The key concept in this transition is that your system is ingesting a
huge amount of data, and the transformation process costs significant compute power.
Additionally, while this transformation is happening, the data is at risk. If it has not yet been
serialized, and the server crashes, then the data is lost. With ELT, the system ingests the
data and immediately stores it. This minimizes the exposure of the data during ingestion,
and provides new opportunities for data transformation and analysis. First, the data can be
transformed asynchronously from ingestion. This helps reduce compute demand. Then the
data can be transformed multiple times, for multiple purposes, and this process also
supports the idea of collecting all data for extended periods of time. This is often referred to
as a data lake75, and this strategy suggests keeping all data for later analysis. The
rationale for this is that machine learning algorithms may find interesting patterns or trends
that would not be expected, and that these would warrant studying other seemingly
unneeded data.
Page 38
Location
Most IoT solutions will send data to a public or private cloud. If connected devices are
geographically distributed, there may be a case for storing the data across several locations
around the globe, in order to store the data closest to where it was generated. There may
also be government mandates that require an individuals data to remain in that person's
home country, or the data may only be interesting within the region within which it was
collected. However, in a large percentage of projects, the value is in the large body of data,
so data must be brought together into a single location for the most insightful analysis. In
this case, the considerations will center on the time constraints of the analysis (how often
are the algorithms run?), the physical limitations of the data centers, bandwidth, and the
cost of moving data.
Processing information
After the data is ingested, it must be processed.
Processing types range from very simple to longrunning and complex. The following sections
discuss common IoT data-processing types.
Alarm processing
A common use case is to watch for specific data items on ingestion and then take action
based on that data. These could be alarms from devices, or any kind of simple event
processing. The characteristic of this type of processing is that there is a specific set of
76 Teradata, Hot and Cold Running Data
Page 39
values that are to be monitored on specific attributes of the incoming data that can trigger
predetermined responses. While this type of event processing is logically straightforward,
the implementation still requires consideration due to the expected high volume of data
being ingested, and the likelihood that the events that must be responded to are of relative
importance.
In alarm processing, the solution must also account for the potential of alarm floods. If a
systemic failure happens, for instance if a home alarm system sends an alarm to the event
processing system when the power goes out, there may be a flood of alarms, or if the there
is no battery backup, messages may be cached on the device, and then when the power
returns, all the devices send their entire set of messages at once. To handle these situations,
the devices may be designed to have a random offset for message delays, or the message
receiving service can implement a circuit breaker pattern 77 to circumvent failure when an
abnormal event pattern happens.
Complex-event processing
Complex-event processing is used to detect conditions or states on data in motion that may
not be directly deduced from simple data evaluation. This might include the detection of a
certain set of events that arrive in a particular order or frequency, such as an event that is
innocuous if it appears once, but that indicates a problem if it occurs a certain number of
times in a certain timeframe, or if the same event is transmitted from a set of devices or
sensors. Imagine that your car sends telemetry to the manufacturer, and one of the items
that it reports is failed starts. By itself, this would mean very little to the manufacturer.
However, if the weather got very cold last night, and none of the SuperCar Model 8s in that
area started in the morning, that could tell the manufacturer that there is a systemic
problem with the car's battery or something related to the starting system.
The industry sees complex event processing as one of the keys to monetizing the vast
opportunity of IoT.78 When envisioning the solution, ensure that initial requirements are
discussed early in the project. This is an area where businesses will learn and improve over
time, but one which should be prototyped early in the process to prove out the concepts,
and to begin to develop the right mindset for capitalizing on the opportunities. This is a rich
area of development within Microsoft, our competitors, and the open source community.
Microsoft has developed StreamInsight,79 which can be deployed in the cloud. A popular
open source project is Apache Storm80 for real-time stream processing, and Amazon is
offering Kinesis for their cloud solutions, which includes stream processing.
Page 40
Machine learning
Machine learning refers to the concept of studying data and deriving insights from the data.
The results will be a model that can be used to predict future outcomes from similar data
sets. The first step is to train the model. This is normally an iterative step performed by a
data scientist where a training set of data is used to infer a function, or model, from that
data. That model will be used to make decisions on incoming data. The model is typically
retrained periodically, so that the model can improve over time, learning from additional
new data and patterns.
Machine learning falls into two broad categories: supervised learning and unsupervised
learning. Supervised learning studies the data looking for a known set of desired outcomes.
In other words, in the vehicle scenario, I may want to minimize the number of times that a
car needs its oil changed. So I would run studies against the data looking for patterns that
give me information about the consequences of delaying oil changes, conditions, and so on.
In unsupervised learning, the concept is to naturally find patterns and relationships of any
kind in the data. After something interesting is observed, then these data points will be
further investigated until they are found to be either useful or not useful.
Common tools for machine learning include MATLAB82, Mahout83 and R84. Microsoft
introduced its ML tooling in June 2014, called Azure ML. 85 Azure ML is a machine learning
service that democratizes the practice of machine learning. It provides a visual experience
81 Wikipedia, MapReduce
82 Wikipedia, MATLAB
83 Wikipedia, Apache Mahout
84 Wikipedia, R (programming language)
Page 41
for constructing data experiments, and easy to use implementations of many commonly
used machine learning algorithms, relieving the data scientist of implementing them in a
programming language. Azure ML integrates easily with Azure Storage, HDInsight, and
Windows Azure SQL Database, and it can expose the models as web services so that they
are simple to integrate into the runtime data flow or applications.
Data enhancement
Another core piece of the IoT architecture is data enhancement. The data collected from the
devices, the volume of it, and the hidden patterns within it provide tremendous value, but
often combining the device data is either critical in order for it to make sense to the
business, or there is even more significant value to be gained by adding other data sets to
analyze with the device data. Enterprise data may be used for simple things, such as
relating device data to customer data. Other areas of opportunity include data markets that
publish datasets that are either sold or available for free. Microsoft offers the Azure
DataMarket86, which offers datasets from governments, research institutions, historical,
environmental, business organizations, and more. One of the most frequent datasets that
gets combined with device data is weather. Devices often exist all over the globe in different
conditions, so predictive maintenance will frequently factor in weather data, which is
normally sourced from weather data providers as opposed to collecting it with the device
itself.
Publishing insights
After data stored in the system has been
processed into information of value to others, the
question becomes how to approach this exposure
in a secure and compatible manner that is easy to
discover and consume. Some organizations want
to make their data available to partners both up
and down the supply chain to realize efficiencies
that result in lower costs and improve margins. Others are realizing the data they have can
be directly monetized as services available for consumption by individuals, corporations and
governments around the world. In addition to the stand-alone value of the data, it may also
be seen as valuable to augment other data services. Data that may seem uninteresting to
those within the organization could in reality be a key ingredient used in a number of
potential external applications or analytical recipes. For an in-depth discussion of datapublishing considerations, see the paper Making Public Data Public from Microsoft. 87 The
following sections discuss many aspects of this topic.
85 Techcrunch, Microsoft announces Azure ML, Cloud-based Machine Learning Platform That
Can Predict Future Events
86 See https://datamarket.azure.com/
87 Microsoft, Making Public Data Public
Page 42
Audience
The target audience for the data will have a significant impact on how it is published. Will it
be used to enhance analysis of other data? Will it be used through data visualization tools,
such as PowerBI or Tableau? Will it be metered and have a price associated with it? Or will
there be different views and price points of the data for different partners?
Publishing format
The choice of publishing format will be influenced by the targeted audience and the type of
information being published. Similar to the discussion earlier in this paper about the
incoming message format, the most likely choices for publishing data are XML, JSON, and
AtomPub. OData88 is a standardized protocol for creating and consuming data APIs. OData
originated at Microsoft, but it has become well-accepted in the industry. OData supports both
JSON and AtomPub, so it is widely consumable by nearly all current tools and programming
languages.
There are tools that can help scale, secure, and normalize the data publishing task. The
Microsoft Azure DataMarket89 is a global marketplace for data and applications that provides
discoverability, interface normalization, and a monetization approach. Microsoft Azure API
Management90 is a service that facilitates publishing APIs. It includes features for API
translation, versioning, aggregation, discovery, authorization, caching, and quotas. Both
Azure DataMarket and Azure API Management can be part of the publishing strategy, using
DataMarket for the broad exposure of large datasets, and API Management to expose APIs
securely with usage metrics and management capabilities.
Page 43
Event Hubs
Event Hubs is a new feature of Microsoft Azure Service Bus. It stands next to topics and
queues as a Service Bus entity, and provides a different type of queue, offering time based
retention, client-side cursors, publish subscribe support, and high scale stream ingestion.
Although it could be argued the use of topics could satisfy the technical requirement for
receiving data from devices, Event Hubs supports higher throughput and has an increased
horizontal capacity.
Architectural details
Starting at the logical architecture level, the main architectural components are depicted in
the following figure.
Page 44
Page 45
Master. Part of the requirements assumption for the architecture is that solutions built
on top of it will aim for a unified global or at least regional management model,
independent from technical scale limitations that might inform how large a particular
partition may grow.
This motivates an overarching architectural model with a common Master service,
shown on the far left of the figure, that takes care of shared management and
deployment tasks, as well as of device provisioning and placement, and several parallel
and independent deployments of Partition services that each take ownership of one or
more logical system partitions.
Page 46
A benefit of introducing scale-units is that they significantly reduce the risk of full system
outages. If a system depends on a single data store and that store has availability issues,
the whole system is affected. However, if the system consists of 10 scale-units that each
maintain an independent store, issues in one store only affect 10 percent of the system.
The principle of running all traffic ingestion through asynchronous Service Bus
messaging entities, instead of into a service edge that writes data straight to the
database, is that Service Bus already provides a scaled-out and secure network service
gateway for messaging, and it is specifically designed to deal with bad network
conditions, traffic bursts, and even sustained traffic peaks. A back-end datastore that is
the target of the ingested data should not be dimensioned to handle specific bursts, such
as vehicle telemetry during core European or U.S. East Coast rush hours.
The group called partition is a set of resources focused on handling data from a welldefined and known device population that has been assigned to and configured into the
partition through provisioning. Cross-partition distribution of devices will be based on
your solution-specific logic, and allocation within the partition is handled by provisioning.
The partition group is the unit of scale. Through testing, the load specifications for the
partition have to be determined and a so-called scale-unit can be defined. A scale-unit is
a group of resources that can effectively support a well-known load profile for the
system, allowing replication of the scale-unit to provide support for an extrapolation of
this load profile. Within the partition group, there are two basic paths, ingestion
(sending data from the device to the cloud) and egress (sending data from the cloud to
the device). These paths accomplish the following:
Ingestion. Ingestion has a given device connect through its supported protocol,
delivering messages to its specific Event Hub, using its assigned credentials.
Egress. Egress routes messages (replies, Command & Control) to their device
destination.
Device Repo. The device repository contains configuration information about the
registered devices for a given partition.
Capacity modeling
Before cost can be modeled, the way that the system will scale needs to be considered and
the characteristics of the architecture need to be determined. Essentially, the attributes of
the previously mentioned scale-unit need to be defined.
There is a throughput ceiling for each of the components in the architecture, including each
of the Service Bus entities. The reason to be cautious when evaluating throughput is that
when dealing with distributed devices that send messages periodically, we cannot assume
perfect, random distribution of event submissions across any given period. There will be
bursts and we need to allow for ample capacity reserve to handle such bursts.
Assuming a scenario of a 10-minute event interval with one extra control interaction
feedback message per device per hour, seven messages per hour from each device can be
expected, and roughly 50,000 devices can be associated with each entity with a 100
messages per second average throughput capacity.
Having covered the flow rate, we can conclude that storage throughput is of little concern.
However, storage capacity and the manageability of the event store are concerns. The per-
Page 47
device event data at a resolution of one hour for 50,000 devices amounts to some 438
million event records per year. Even if these event records are limited in size to only 50
bytes, the yearly payload data is still 22 GB per year for each of the scale-units. This
underlines the need to keep an eye on the storage capacity and storage growth when
thinking about sizing scale-units.
These considerations manifest in a capacity model in the deployment model, which informs
how many entities must be created in the Service Bus namespace backing a partition for a
given device population size like 50,000 devices and for a given load profile.
The load profile is currently informed by how many (telemetry-) messages a device is
generally expected to send, how many commands or notifications the device is expected to
receive per hour, and what the average size of these messages is. The inputs should be wellinformed, but generous estimates because while changing the shape of a scale-unit layout
at a later time is possible, doing so may require re-provisioning the devices.
Determining partitions is not only motivated by capacity concerns, however. Because a
partition also forms a configuration scope, it provides a suitable mechanism to segregate
device populations by region, country, owner, operator, product, or other concerns. As an
example, one deployment can have up to 1,024 partitions.
Each partition corresponds to exactly one Service Bus namespace. Because there can only
be 50 namespaces per Azure subscription, and other dependent services have similar
quotas, a fully built-out architecture will therefore most likely span multiple subscriptions.
In summary, the attributes that we have found to determine the capacity model are:
Cost estimation
With the estimation of cost for a solution built on top of this architecture, there are many
factors to consider. We will work through the list from the ingress of device data to sending
commands. Cost is estimated based on architectural design and necessary scale for success.
As such, cost estimation has variables for the scale that is needed applied to the formula for
calculation.
Page 48
Before we dig into the details, we feel the need to underscore the fact that cost modeling,
like capacity modeling, are inputs for architectural decision making and business case
modeling, where the combination of all inputs should always be considered as a whole. As
an example, you might find using HTTP for communications will be somewhat less expensive
from a cost modeling perspective. However, choosing HTTP over AMQP will inherently impact
performance.
For all pricing related information in the cost estimation formulas outlined in this section, it is
important to state that prices will vary over time and the examples are aimed only at
explaining the formula itself. The latest pricing information can always be found at
http://azure.microsoft.com/en-us/pricing/overview/.
Cost monthly ingress=Cost base charge +Cost brokered connections +Cost throughputunits +Cost messages +Cost protocol gateway + Cost telemetry pump +C
Which expands into a more detailed formula we can work with to fill in the appropriate
variables:
T
Cost monthly ingress=Cost base charge + brokered connections 1000
744
a $ 0.015
a $ 0.025 +744 N throughput units $ 0.0 3+
100 k a 500k
a $ 0.03
a>500 k
a<100 k
Page 49
N devices Am
It should be noted this formula is using the Standard tier offering of Event Hubs 91, which
offers additional brokered connections, filters, and additional storage capacity. The fixed
pricing elements in the formula uses pricing from a point in time, susceptible to change.
Also, the formula assumes a flat use of brokered connections while actual billing is based on
peak use prorated per hour; the dynamics of your system will likely deviate.
The variables in this equation are:
Variable
Description
T brokered connections
N throughput units
N devices
The average number of messages sent into the system, per device, per
month.
msg
A
The average size of each message sent into the system, per month.
N supported scale
Cost worker
The cost per worker role for the ingress path when using custom
protocols and for the telemetry pump, per hour.
A egressGB
Cost egressGB
Example calculation
An example calculation where 1,000,000 deployed devices send a message averaging 128
bytes every 60 seconds, having an average number of 100,000 simultaneously connected
devices during the entire month would yield the following results:
Variable
Value
91 See http://azure.microsoft.com/en-us/pricing/details/event-hubs/.
92 Microsoft, Microsoft Azure, Event Hubs pricing, FAQ What are throughput units and how
are they billed?
93 Microsoft, Microsoft Azure, Data Transfer Pricing Details
Page 50
T brokered connections
N throughput units
N devices
1,000,000
msg
A
N supported scale
Cost worker
A egressGB
Cost egressGB
Not Applicable
1,000,00044,640ceil
1,000,000
Page 51
Given these components, the egress path cost can be calculated using the following
formula:
notifications
single command messages
multi command messages
command response messages
Which also expands into a more detailed formula we can work with to fill in the appropriate
variables:
Page 52
a
64
a
Anm
Acsm
Acmm
1,048,576
ceil ( 1,000,00012.5 )
a $ 0.20
a $ 0.50
100 a 2,500
a $ 0.80
a>2,500
a <100
Anm
A csm
A cmm
A rm
Description
N supported scale
Cost commandAPI
The cost per worker role for the command API host, per hour.
A nm
A csm
A cmm
A rm
Page 53
Cost egressGB
Example calculation
An example calculation using 100,000 notifications per month of 20 KB each, 130,000
commands of 35 KB each with single replies of 80 KB each, and 20,000 commands of 20 KB
each with on average three (3) replies of 70 KB each would yield the following results:
Variable
Value
N supported scale
Cost commandAPI
$0.08 (A1)
A nm
100,000
A csm
150,000
A cmm
20,000
A rm
Cost egressGB
$0.138
100,000 ceil
121
+190,000 ceil (
( 6420 )+150,000 ceil ( 6435 )+20,000 ceil ( 210
)
64
64 )
1
Page 54
1,000,000
Management cost
Besides the messaging
related components in the
reference architecture, there
is also the concept of one or
more masters for managing
the system, as discussed
previously in this paper. The
master is tasked with
provisioning devices,
creating appropriate queues
and topics, storing device
information, provisioning
security, and so on. The
master contains the
following cost components:
Device Repo. The datastore collecting the registered devices per partition.
Given these components, the egress path cost can be calculated using the following
formula:
devicerepoGB N partitions
partition repoGB+
Cost tsGRS + di Cost tx
Cost monthly mgmt =744 N supported scale Cost master +
Equation 3 - The cost estimation formula for management of the reference architecture
Page 55
Description
N supported scale
Cost master
The cost per worker role for the management host, per hour.
partition repoGB
device repoGB
N partitions
Cost tsGRS
The cost for Geo Redundant Storage (GRS) table storage ($0.095 / GB at
the time of writing).
di
The change for device information. Any change to the device information
stored in the system and subsequently in a device repository inside a
partition, will account for at least two operations on table storage.
Cost tx
Example calculation
An example calculation using 10,000 changes to device registration per month (either new
devices, changes in activation, or removed devices) leading to a total partition repo
(assuming a single master instance is used) size of 256 MB and 128 MB device repository
per partition, using 10 partitions, would yield the following results:
Variable
Value
N supported scale
Cost master
$0.16 (medium)
partition repoGB
0.25
device repoGB
0.125
N partitions
10
Cost tsGRS
$0.095 / GB
di
10,000
Page 56
Cost tx
$0.0036 / 100k
Cost monthly mgmt =744 x 2 x $ 0.16+ ( 0.25+0.125 x 10 ) $ 0.095+0.1 x $ 0.005=$ 238.08+ $ 0.1425+ $ 0.000 36=$ 238
As can be observed from the outcome of the formula, the cost of management for the
reference architecture is mostly dependent on the worker roles running to support it.
Page 57
Important
topics not yet covered
cot total per month =( Cost ingress +Cost egress ) N partitions +Cost management
In this paper, we have strived to capture many of our learnings from implementing
predictive maintenance solutions in the Internet of Things (IoT) space. However, in addition
to the topics discussed, there is both much detail to add and more things to think about
when architecting for IoT. This final section touches on some of these topics.
Management solutions
There are standards put forth for managing devices95, such as OMA Device Management96 (of
which Microsoft implemented a subset, called Mobile Device Management 97), CPE WAN
Management Protocol98, Lighweight M2M99, and UPnP-DM100.
95 Blackberry, A Comparison of Protocols for Device Management and Software Updates
96 Wikipedia, OMA Device Management
Page 58
As millions of devices become part of IoT systems, there is a clear need for IoT solutions that
can monitor and manage incidents in the systems, visualize information and effectively
control the environment, and span the various connectivity options and supporting legacy
systems.
Integration simplicity
As IoT promises to extend vertical solutions across horizontal markets, and connect systems
in ways never seen before to add value to businesses and peoples lives, the integration
between these systems and how they are secured needs to happen in a way that
standardizes the integration. AMQP provides an example of this in regard to transport-layer
integration.
Page 59
Conclusions
This paper has gone into great detail about the particulars of building IoT solutions, based on
our experience in working with enterprise customers. As you can see, IoT solutions can be
complex but also offer massive promise for increasing revenue, cutting cost and finding new
business models based on innovate use of technology. An enterprise might believe that its
requirements are so unique that only a custom IoT solution can meet their needs. But the
unusual requirements of IoT solutions in security, communication, and scale make them
complex and expensive to build as custom solutions from the ground up.
The Microsoft Azure platform, on the other hand, has a comprehensive set of building blocks
that you need to build an IoT solution relatively quickly and painlessly by using the
mentioned reference architecture.
Page 60
Page 61