Anda di halaman 1dari 3

Athletes from around the world come here to discover new places to be active.

Here’s what you


should know about the heatmap and the data it reflects:

The heatmap shows 'heat' made by aggregated, public activities over the last two years.

The heatmap is updated monthly.

Activity that athletes mark as private is not visible.

Athletes may opt out by updating their privacy settings.

Areas with very little activity may not show any 'heat.'

700 million activities

1.4 trillion latitude/longitude points

7.7 trillion pixels rasterized

5 terabytes of raw input data

A total distance of 16 billion km (10 billion miles)

A total recorded activity duration of 100 thousand years

( Jag skulle vilja veta hur mycket energi de förbrukar)

Beyond simply including more data, a total rewrite of the heatmap code permitted major
improvements in rendering quality. Highlights include twice the resolution, rasterizing activity data as
paths instead of as points, and an improved normalization technique that ensures a richer and more
beautiful visualization

From 2015 to 2017, there were no updates to the Global Heatmap due to two engineering
challenges:

1. Our previous heatmap code was written in low-level C code and was only designed to be run
on a single machine. It would have taken months for us to update the heatmap with this
restriction.
2. Accessing stream data required one S3 get request per activity, so reading the input data
would have cost thousands of dollars and been challenging to orchestrate.

The heatmap generation code has been fully rewritten using Apache Spark and Scala. The new code
leverages new infrastructure enabling bulk activity stream access and is parallelized at every step
from input to output.

From data streams to a data lake

Streams are what we call the raw measurement sequences that define a Strava activity, such as
latitude/longitude, time, heartrate, power, elevation, or speed. Strava has more than 150 TB of
stream data from over 1 billion activities. This dataset offers enormous potential to build amazing
things such as Global Heatmap, Route Builder, Clusterer. However, until recently our streams have
only been persisted in a legacy format intended for the normal production pattern of reading a single
stream in a request.
The Problem

There are two components of our historical stream storage schema that make rapid processing of
streams difficult. Firstly, each stream is stored as a single file on S3 as a gzipped JSON array. At
today’s scale, iterating through streams for all activities just once would cost $5000 for the S3 queries
alone because there is a charge per single file get request. For this reason, it is generally not a good
idea to store billions of tiny files on S3 if you want to frequently access them all. Extracting and
deserializing the gzipped JSON format would also take around 1000 CPU hours.

The second technical problem that prohibits rapid processing is how activities are prefixed in S3. The
prefix for an activity’s streams is the auto incrementing activity ID, a practice which AWS specifically
recommends against. The data for successive keys (ordered lexicographically) are clustered on the
same S3 shards during writes, so a sequential read of the data will not distribute the load well and
will easily trigger S3 rate limits. This means that rapid sequential stream access is not possible in a
reliable way, even if we are willing to pay the cost. The infrastructure bottlenecks around bulk stream
access have long been a major limitation in our ability to explore our data and develop new
capabilities.

The Solution

We needed a denormalized stream dataset optimized for rapid bulk access. We ending up choosing a
Spark based “data lake” solution. With this solution, stream data is still stored in S3, but in a highly
compressed format using Parquet and Snappy. Instead of storing stream data directly, Parquet will
store data using a Delta Encoding which is fantastically efficient at compressing typical stream data.
Using these methods we achieved a compression ratio of ~15x. The stream data is also grouped into
chunks of around ~100 MB of data per file. Spark workers can now read vast amounts of stream data
in a small number of S3 requests.

The costs of this approach are also relatively low. Since our data is stored on S3, rather than in an
always-running database, we only commit to paying the relatively low passive cost of S3 storage.
Additionally, we only have to pay for compute resources when they are needed — using Mesos and
Spark we can use an arbitrary amount of compute resources as needed for jobs.

Heat Normalization

Normalization is the function that maps raw heat counts for each pixel after rasterization from the
unbounded domain [0, Inf) to the bounded range [0, 1] of the color map. The choice of normalization
has huge impact on how the heatmap appears visually. The function should be monotonic so that
higher heat values are displayed as higher “heat”, but there are many ways to approach this
problem. A single global normalization function would mean that only the single most popular areas
on Strava would be rendered using the highest “hottest” color.

A slick normalization technique is to use the CDF (cumulative distribution function) of the raw values.
That is, the normalized value of a given pixel is the percentage of pixels with a lower heat value. This
method yields maximal contrast by ensuring that there are an equal number of pixels of each color.
In photo processing, this technique is known as Histogram equalization. We use this technique with a
slight modification to prevent quantization artifacts in areas of very low raw heat.
Computing a CDF only for the raw heat values in a single tile will not work well in practice because a
map view typically shows at least 5x5 tiles (each tile is 256x256 pixels). Thus, we compute the joint
CDF for a tile using that tile and all tile neighbors within a 5 tile radius. This ensures the normalization
function can only vary at scales just larger than the size of a typical viewer’s screen.

To actually compute an approximation of the CDF that can be quickly evaluated, you simply sort the
input data, and then take some amount of samples. We also found that it i s better to compute a
biased CDF by sampling more heavily towards the end of the array of samples. This is because in
most cases, only a small fraction of the pixels have interesting heat data.

Recursion Across Zoom Levels

So far we have only talked about generating heat for a single zoom level. To process lower levels, the
raw heat data is simply added together — four tiles become one tile with 1/4th the resolution. Then
the normalization step is rerun as well. This continues until the lowest zoom level i s reached (a single
tile for the whole world).

It is very exciting when stages of a Spark job need to process an exponentially decreasing amount of
data, and thus take an exponentially decreasing amount of time. After waiting about an hour for the
first level stage to finish, it is a dramatic finish to see the final few levels take less than a second.

Serving It Up

We finally store normalized heat data for each pixel as a single byte. This is because we display that
value using a color map, which is represented as a length 256 array of colors. We store this data in
S3, grouping several neighboring tiles into one file to reduce the total number of S3 files required. At
request time, the server fetches and caches the corresponding precomputed S3 meta-tile, then
generates a PNG on the fly from this data using the requested colormap. Our CDN (Cloudfront) then
caches all tile images. We also made various front-end updates. We are now using Mapbox GL. This
allows for smooth zooming, as well as fancy rotation and pitch controls.

Anda mungkin juga menyukai