Anda di halaman 1dari 28

School of GeoSciences

DISSERTATION

For the degree of

MSc in Geographical Information


Science
Student Name: Christopher J McCarthy
Date: August 2014

--This page is left intentionally blank

Does NoSQL have a place in GIS? - An open-source spatial


database performance comparison with proven RDBMS.

MSc Geographical Information Science 2013/2014

Author: Christopher J McCarthy


Supervisor: Bruce Gittings

Copyright of this dissertation is retained by the author and The University of Edinburgh. Ideas
contained in this dissertation remain the intellectual property of the author and their supervisors,
except where explicitly otherwise referenced. All rights reserved. The use of any part of this
dissertation reproduced, transmitted in any form or by any means, electronic, mechanical,
photocopying, recording, or otherwise or stored in a retrieval system without the prior written
consent of the author and The University of Edinburgh (Institute of Geography) is not permitted

I declare that this dissertation represents my own work, and that where the work of others has been
used it has been duly accredited. I further declare that the length of the components of this
dissertation is 6254 words for the Research Paper and words for the 8591 Technical Report

Signed:

Date: 5th August 2014

Acknowledgements
Many thanks go out the staff of the University of Edinburgh School of Geosciences, in particular
Bruce Gittings, who have all lead me through an exciting and rewarding MSc year. Their invaluable
knowledge, organisational skills and the overall reputation of the MSc GIS course have provided me
with a deep understanding and skillset of the GI industry.
A big thanks to all my course comrades who have also battled through the MSc course and
provided friendship with support throughout the year.

Part 1: Research Paper

Abstract
With the relational database model being more than 40 years old, combined with the continuously
increasing use of big data, NoSQL systems are marketed as providing a more efficient means of
dealing with large quantities of usually unstructured data. NoSQL systems may provide advantages
over relational databases but generally lack the relational robustness for those advantages.
This project attempts to contribute to the GIS field in comparing Open-Source RDBMS and NoSQL
systems, storing and querying spatial data with the overall goal to determine if NoSQL systems
(specifically MongoDB ) have a place within the GI world. Working with Open-source spatial dataset,
OpenStreetMap, a scalable approach is taken working through global to local scaled data. This
approach aims to provide insight to how either system may present performance advantages related to
data size.
The research highlights how the performance of each system is limited by the system functionality.
MongoDBs spatial capabilities are lacking in comparison to the PostgreSQL spatial extension
PostGIS. The outcome is that MongoDB cannot support the spatial needs of a specialist GIS operative
currently, however if basic spatial functionality is all that is needed, MongoDB presents high
performance on large datasets. PostGIS has a complex, highly specialist ream of spatial functionally
making it the best performing spatial system, however increasing dataset size does present a system
slow down relationship.
The use of each system is dependent on the application but at the present time this NoSQL system is
spatially outclassed thus not worthy of the specialist GIS industry.

1 Introduction
1.1 Background
The growing storage and retrieval of data has continually been an issue across many industries.
Efficiency of database management is key to providing a lucrative service and upholding the
performance of this. Databases were created to provide a means of storing and retrieving data as
required in a highly organised way. Since there development in the 1960s a lot has changed, the
present day data levels have increased unprecedentedly and multiple types of database design have
evolved through these times. The most widely used systems today are relational databases and a
newer developed architecture which is gaining popularity, NoSQL systems (Strauch, 2013).
Relational Databases (RDBMS) appeared in the 1970s, created by Edgar Codd, they became a
popular choice due to the sound theoretical model they were based on (McFadden, et al., 1998). The
Cobbs model was a 0 12 step model that aided the advantages of relational databases when
redundancy and accessibility were concerned, establishing data completeness benefits. The system
stored data in tables, with tables containing relationships with each other. Each column represents a
field and each row represents a record (Rigaux, et al., 2002).
The NoSQL movement, is the development of non-relational systems, an approach allowing for
the better fit of unstructured and semi-structured data without the restrictions of poorly fitting data
models.. The rise in large web applications has seen development in database systems data handling at
scale. The NoSQL movement began in the early 21st century focusing on creating web-scalable
database systems operating with hundreds of millions of users (Vaish, 2013). NoSQL (Not only SQL)
works on a loosely-defined class of non-relational data stores. Although these NoSQL systems are
relatively new, their popularity has increased due to their ability to handle unstructured data (Anon.,
2010), as collected by the increasing number of data capture devices in the public domain today for
example, smartphones and GPS devices. They do not operate on a fixed schema but rely heavily on
metadata in order to achieve the fast performance that users require (Scherzinger, et al., 2013). These
systems might boast about performance and provide advantages over relational systems, they
generally lack the relational reliability due to less data integrity and facilities.
The traditional architectures of RDBMS have become less adaptable in dealing with
unstructured large volumes of data. Big Data is defined as data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage and process data with a tolerable elapsed
time (Snijders, et al., 2012), applies to the field of spatial database management. To handle these huge
often unstructured datasets, NoSQL systems provide distributed storage and indexing techniques
using map/reduce functions overall producing a better suited system to these data types (Dean &
Ghemawat, 2008). Xiao & Yimin, 2011 presented evidence that relational systems are expensive to
run and maintain, especially when the dataset sizes are increasing, big-data. It concluded that
8

NoSQL systems handled the huge volumes or data, network partitioning, wide applications, high
performance and scalability better than the traditional relational systems when loaded with remotely
sensed images. Stonebraker argues that RDBMS have been architected more than 25 years ago when
hardware, user requirements and database markets where different from those today. His conclusion is
to retire RDBMS in favour of starting from scratch developing specialized engines designed for
tomorrows requirements, rather than continue to alter code lines for yesterdays needs (Stonebraker
& Centintemel, 2005). Non-spatial data is not the only data type to experience this increase of data
volumes. Spatially relevant data has escalated, collected through modern smartphones, GPS devices,
camera and fitness monitoring devices. These newer means of spatial data capture have contributed to
the continual increase of data sources.

1.2 GIS
Within the field of geographical information science, spatial databases are the backbone to the
capabilities presented. Various RDBMS such as Oracle Spatial provide the means to work with spatial
data. The increasing use of GIS systems across a wide range of applications are exploring larger
dataset scales for analysis and further GIS processing techniques.
The development of NoSQL systems are incorporating spatial capabilities, leading to questions
surrounding a new spatial system within GIS. Incorporation of NoSQL within GIS is a logical choice
if spatial capabilities can be matched to current spatial systems. NoSQL DBMS are already being
used for spatial data although limited. Pouria, et al., 2013 raises the issue to why NoSQL systems are
a suitable choice for large dataset and user bases within GI. Todays increasing amounts of
unstructured and semi structured spatial data vary in formats with collection through a wide range of
methods, observations and measurement form sensors, geospatial data of Location Based Services and
social networks are examples.
With GIS systems presenting high levels of functionality, representing performance testing
with some realistic real life scenarios is important. de Hass, et al., 2008 states that there are four
different spatial database user types: server builders who publish data via webservers, GIS users who
load datasets and perform complex analysis, data maintainers who primarily focus on maintaining one
core dataset and power users who do all of the above. With no general GIS user profile to target, this
could explain the very specific spatial database benchmarks that have been previously published
(Gurret, et al., 1999,Patton, et al., 2000 and Stonebraker, et al., 1993). Developing a series of
benchmarking queries representative of general and specific GIS spatial functions is the optimum aim
for the performance testing of this project. The wide-ranging datasets within GIS use vary different
spatial data complexities from precise scientific data to less specialist locational marketing sales
datasets. It is hypothesised that these wide-ranging data types will be better handled in different

database systems due to the functionality or the design direction taken by the database setup. Finding
comparable spatial functionality that operates on all possible datasets is a key factor in this project.

1.3 Previous works


Benchmarking of database systems is a means of assessing their performance under a controlled set of
operations. Both (Goodchild, 1992, Suprio, et al., 2011) state that three basic steps have been
highlighted for system benchmarking, design, execution and analysis. (Anand & Kodali, 1999)
Provide a solid review of benchmarking techniques, although not spatially inclined, an overall
approach to design and methodology are provided, while (Vyas, et al., 2011) present a more spatial
based querying and benchmarking overview. Many examples of previous benchmarking of spatial
databases are very specific to the users requirements. The use of focused functions dont always
relate to representative results of true system performance and on the other hand many outcomes have
never been published to the general public domain due to commercial restrictions (Goodchild, 2002,
Marble & Sen, 1986). When bringing spatial NoSQL systems into the mix, the amount of research
into spatial capabilities of these system is drastically reduced. Comparison works showcasing
performance differences between RDBMS and NoSQL systems were minimal also with only a couple
un-published examples available. That being said performance testing outcomes can be very secretive
with results of software vendors closely guarding their performance statistics (Marble & Sen, 1986) .
Out of the limited research into spatial benchmarking only a few research projects exist. The
best known spatial benchmark is SEQUOIA 2000 (Stonebraker, et al., 1993). SEQUOIA 2000 is
specifically designed to be an earth sciences database requirements capturing project, focused on
raster datasets and the performance handling of the processes orientated around that data type.
Stonebraker presents a very specific earth sciences benchmarking system, however it is unclear if the
tests presented are fully represented of the earth sciences processing workload and the results are
reported in price/performance using a defined equation. This is not an ideal output to be implemented
by other systems.
VESPA is spatial database benchmark that operates on vector datasets. Developed in response
to the earth science raster specific benchmark of (Stonebraker, et al., 1993), VESPA was to target
certain benchmarking criteria not touched by SEQOUIA. The benchmark includes a wide range of
spatial queries and system queries such as update functions. The wide range of functions used in this
benchmark are wide, however gives little indication to the comparison of the available range of spatial
functionality. The VESPA benchmark uses a synthetic dataset that includes the scalability that allows
for increase dataset size performance to be analysed (Patton, et al., 2000).
Jackpine is a benchmark design also for spatial database performance (Suprio, et al., 2011).
Jackpine was designed to cover both micro benchmarking coverage of basic spatial operations but
also modelling real-world application loads in the spatial department. It operates a selective range of
10

spatial functionality available in the database system and continues with real-world operations such as
Flood Risk analysis operations, tasks often found in todays web based GIS systems.
Benchmarking Approach for Spatial Index Structures (BASIS), is a prototype system for
assessing spatial indexing performance (Gurret, et al., 1999). This focuses on the indexing of spatial
databases rather than a full database systems performance. This research does aid in highlighting how
spatial indexing procedures can in turn have massive influences on the overall database performance
outcomes during benchmarking and must be factored into the interruption of any presented
benchmarking results.

2 Aims and Research Questions


The main aim of the project is to perform a series of performance comparison tests on an Open-Source
spatial RDBMS and NoSQL system through the use of an OpenStreetMap spatial dataset, using a range
of benchmarking techniques. A series of benchmarking operations will be run on the same computing
platform and datasets, altering complexities and scales. This will present a view to how scalable each
system is and allow their respectable performance timings to be shown. To allow a relatively matched
comparison to exist, the various systems features will be assessed. The results of this project will lead
to a series of numerical outputs providing answers to the following research questions in relation to GIS
uses of spatial databases.

2.1 Research Questions


1. Do NoSQL systems have a place in the GIS industry?
a. What are suitable methods to query each system spatially?
b. Do indexing structures play a significant role on spatial performance?
2. Are there advantages of a NoSQL system over the more accepted Relational Database
Management System in GIS?
3. In relation to the tested RDBMS and NoSQL systems, which provides the best functional
performance?

3 Research Methodology
3.1 Database systems
The database market provides many options for consumers to choose from. As for this project,
it required both a RDBMS and a NoSQL system to be selected for comparison, sticking to
Open-Source products, thus not incurring licensing issues and data fees. Systems in both
categories were assessed with some general criteria including familiarity, query language, setup

11

probed and most importantly the spatial functionality. This was the first step and a fairly
important choice.
The chosen RDBMS was PostgreSQL 9.3.4 with the spatial extension PostGIS 2.1.3 due to the
documentation and spatial functionally proved within the GIS industry. SQL is used for the
PostgreSQL query language which was a benefit as previous SQL experience exists. In addition
PostgreSQL is a very competent system second behind Oracle and MySQL in the public domain
database usage (DB Engines, 2014). PostGIS had the ability to run on the Windows operating system
unlike some of the other possible options.
The chosen NoSQL database system was MongoDB which does not require any further spatial
extensions. No previous experience with these new age systems exists, so a neutral decision was made
purely on spatial capabilities and support documentation which is key when dealing with an
unfamiliar system. MongoDB 2.6.2 was the system that provided the most in-depth documentation
and spatial functionality. The spatial functionality was the best match to PostGIS for comparison
purposes, while the online help and documentation was levels above any other option. The system
also met the system hardware setup and was a straightforward install.
Assessments of database systems were analysed, leading to the chosen databases for use in this
research (McCarthy, 2014).

3.2 Dataset
Selection of the spatial dataset was to remain in the Open-Source world and also meet the scalability
criteria for this project. A large Open-Source dataset that automatically jumped out was
OpenStreetMap. The ability to access the entire planet OpenStreetMap data was a major bonus
providing a big-data dataset 36 GB compressed. From OSM data the option to scale the data from
global to, regional and local scale provided the exact spatial scale strived for. Although this doesnt
represent the totality of GIS datasets, it is a common dataset to appear on many web mapping and
disaster relief mapping applications. OpenStreetMap data contains Points, Lines and Polygons with
regular updates every week, the data is very recent, although not necessary its a nice advantage.
Downloaded in a compressed XML format, two third party loading systems for each database system
were used to import the dataset into the systems.

3.3 Measures of performance


Performance is essentially the measure of how efficiently a computer system can operate in relation to
time and resources available. In relation to database performance, the reduced query operation time
with the continual interaction with traffic, disk interactions and computer hardware is what is
measured. Benchmarking systems provide a yardstick to measure these factors under varying
workloads (Dietrich, 1992).
12

Most of the database management systems have in built time monitoring features. The various
methods of how databases systems operate can lead to unrepresentative times for example, MySQL
has contrasting times between query completion and the requesting of the query result, which might
be classed as the actual query time. MongoDB records query times to a system specific database
under a system.profile file. While PostGIS has a \timing or Explain command which gives varying
levels of detail into the timing and processing of the query. With the same queries and same datasets,
the timing will be the comparable factor during the running of the spatial queries in this research.
In addition to computation performance, functional performance will vary between the systems.
The ranging range of spatial and non-spatial functionality can dictate the overall performance of a
system regardless of the computation performance.

3.3.1 Indexes and Data Models


Spatial datasets contain many spatial data relationships. Efficient processing of querying these
relationships can be greatly improved with the use of spatial indexing. It is highly inefficient to store
all relationships and interactions between every spatial object in a big-data OSM dataset, for this
reason relationships can be materialized on the queried data during query processing. Many methods
exist to achieve this but as for the systems in question, PostGIS uses a version of R-Tree indexing and
is implemented on the PostgreSQL GiST infrastructure (PostGIS, 2014), while MongoDB implements
its spatial indexes by encoding geographic hash codes on top of the standard MongoDB B-Tree
indexing structure (MongoDB Manual, 2014).
The OSM dataset has clearly constructed relationships within a single OSM entry helping the
data completeness when created by a user. The database models were constructed within the
importing tools for each of these database systems. MongoDB has the flexibility of the schema-less
factor, however the RDMBS cannot. The data model used for PostGIS implemented through the
importing script, was trusted as being an effective and an operationally sound model. The model
barely changes from the OSM dataset layout so is more complex including topology relationships
highlighting functionality differences.

13

3.4 Queries & Implementation Overview


With reference to previous spatial benchmarking research, a series of spatial and non-spatial queries
were formulated to test both database systems, Table 1. These queries were developed with real life
GIS operation in mind to simulate the actual system load that might be often required in everyday
operations (McCarthy, 2014). During the query creation it became clear that not all functionality
would be completely comparable like for like. The differences in functionality is elaborated on.

Table 1 Query Benchmark Approach

QUERY
1
2
3
4
5
6
7
8
9
10
11
12

QUERY DESCRIPTION
Insert
Update
Delete
Closest Point and Distance
Buffer
Distance Buffer
Bounding Box
Bounding Box Render to KML
Line Intersects Polygon
Area of Polygon
Length of Line in Polygon
Length of Lines

3.4.1 Hardware Environment


Since the hardware environment is a fundamental aspect of database performance, it was opted to
install both systems on the same system thus continuing to remain neutral. This was an Intel i7 920
quad core (2.67GHz) with 10GB of RAM and the databases stored on a Seagate Barracuda 1TB 64mb
10,000rpm HDD. The computer was running Windows 7 64bit as the operating system. Tuning that
occurred to the PostgreSQL DB was minimal and purely to prevent the importing errors that occurred,
the parameters remained as close as default that allowed, while MongoDB was able to accept spatial
datasets without parameter changing. Data loading tools OSM2PGSQL and MONGOSM were used.
Both tools performed similar operation retrieving records from the given OSM file, insert them into
the tables created and indexing, one non-spatial primary index and another spatial index on the
geometry column. MongoDB indexes created outside of the importing tool.

14

4 Results
4.1 Data Loading

IMPORT TIMES (S)


MongoDB

PostGIS

453282

64
Edinburgh

220
Scotland

58675

1890

585

42

1029

4928

Europe

Planet

402
British Isles

PostGIS
MongoDB

Figure 1 Import Times For Both Database Systems

Results from the system performance tests present MongoDB system operations to be superior to the
PostgreSQL system. Import times are dramatically quicker than the PostGIS enabled system even
with the largest dataset scale, apart from the 2 second difference in the smallest dataset. PostgreSQL
continues to struggle with the importing process the larger the dataset is, during the Europe and the
Planet imports it seemed as if an I/O bottleneck had been hit. PostgreSQL does not support multiple
core process and often 100% load is on a single core (Kim & Jeong, 2007), further tuning of the
system configuration file might have alleviated the bottleneck problem but a lack of knowledge and
unbalancing the level work environment was a major restriction in tweaking settings. In contrast the
MongoDB import process might have been limited by the Python importing scripts, however speed
was constant with no major bottleneck moments or slowdowns witnessed. This relates to the hardware
environment and how MongoDB can operate more effectively in the same amount of memory
available for both systems, often a cause of bottlenecks. No MongoDB heap space errors occurred.
MongoDBs import results would have also been faster due to the fact the OSM2PGSQL
script also created indexes during the importing process. This was not the case with the MONGOSM
import process and had to be manually created afterwards. Although the indexing process for
MongoDB was faster when added into the entire data loading process, PostGIS indexing was
maximising the overall spatial performance.

15

4.2 System Measurements

RESPONSE TIMES (s)


0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
Edinburgh

Scotland

British Isles
PostGIS

Europe

Planet

MongoDB

Figure 2 Database System Response Times

Before performance queries could be run simple response tests were carried out. This was to ensure a
stable work environment existed and response times to the databases did not drastically fluctuate due
to the hardware or software inconsistencies. The interaction time between the request and the
operation of the database system may only be fractions of a second, however assessing this had to
occur to rule out and performance gains.. Figure 2 reports these response times. It was deemed that
both databases had a stable operating status with fractions of milliseconds differentiating between the
systems.

Table 2 System Function Query Running Times

INSERT

DELETE

PostGIS MongoDB

UPDATE

PostGIS

MongoDB

PostGIS

MongoDB

(s)

(s)

(s)

(s)

(s)

(s)

EDINBURGH

0.043

0.014

1.314

0.241

1.365

0.352

SCOTLAND

0.049

0.011

5.582

0.253

3.479

0.429

BRITISH ISLES

0.139

0.024

12.494

0.748

18.289

0.940

EUROPE

0.349

0.121

134.392

1.581

213.891

1.431

PLANET

0.682

0.124

381.482

2.488

492.482

35.392

DATASET

16

Where MongoDB continued to outperform was in the remainder of the CRUD operations,
Creating, Reading Updating and Deleting processes.
Table 2 shows the quicker query times in completing the tasks in question. MongoDB being a
document store means all the related data is stored in a single document, when operating with the
basic system functions, no join operators need to be created on the fly resulting in better performance.
This is a boasted performance advantage over all RDMS in the NoSQL marketplace, this advantage
continues to become apparent with the increase in dataset scale. MongoDBs automatic sharding
ability across multiple systems would continue to increase this performance advantage in the case of
big data datasets in multiple cluster systems. PostgreSQL operating on the ACID (Atomicity,
Consistency, Isolation, Durability) transactional model keeping strong consistency and isolation
practices is a slower process over the BASE rule set used by MongoDB. The BASE (Basic
Availability, Soft-state, Eventual consistency) system is simpler and faster (Brewer, 2000), though the
effectiveness and security of data is dependent on the application and tradeoffs between performance
and database consistency that are made.

DATABASE SIZE (mb)


0

100000

200000

300000

400000

500000

600000

700000

40000
35000

Size (mb)

30000

25000
PostGIS

20000

MongoDB
15000

XML OSM File

10000
5000
0

Edinburgh

Scotland

British Isles

Europe

Planet

Dataset

Figure 3 Database Sizes after Import with Corresponding OSM XML File Size

Database size after data import was less in PostgreSQL, on every dataset scale MongoDB used
more space, Figure 3. This was an unexpected outcome as it was generally expected a flat file
database might be less storage hungry. MongoDB is storage heavy because of the BSON and JSON
formats used, these are larger due to the full key names and the value for each field in a documents all
being stored in ASCII format. MongoDB also seems to assign extra storage space to the data file size,

17

this is part of MongoDB system processes which pre-allocate data and journal files (MongoDB,
2014).

4.3 Benchmark queries


Table 3 Database System Nodes, Ways and Relations Count

POSTGIS

MONGODB

Nodes

Ways

Relations

(s)

Nodes

Ways

Relations

(s)

EDINBURGH

629914

89192

1224

0.387

629914

89192

1226

0.134

SCOTLAND

9599366

781141

3410

0.639

9599927

781550

7767

0.154

68663589

7903649

69120

1.381

68663974

7903055

129212

0.159

EUROPE

910436599

149397410

162382

4.332

910436463

149397

2753972

0.175

PLANET

2378825598 237290582

2604175

16.521 2378825371 237290316

2604372

0.230

BRITISH
ISLES

Operating on the identical OSM spatial datasets each system imported slightly different
numbers of OSM Nodes, Ways and Relations as shown in Table 3 from the raw XML OSM file.
MongoDB and PostGIS varied minuscule amounts until the British Isles dataset, here MongoDB
recognised a higher number of relations. This theme continued up through the remaining dataset
scales with the final Planet.OSM set showing less variations again. These differences in system
importing comes down to the different importing tools used. Generally the interruption of Ways and
Nodes are the same with the Relations causing the most dissimilarities. The differences in
interpretation of the OSM data can occur due to the different interpretation rules. Open ways might
be classed differently if the starting and ending Nodes are not attached where a Way that does, might
be classed as a closed line or an area. These different interpretations can differ between importing
tools, OSM2PGSQL has a stricter import criteria as displayed in the default.style script which
defines the importing and extraction of OSM data into the geometries within the DBMS. The
contrasting importing of the raw data will alter the geometries recorded within the databases when
spatial queries are run.

18

Table 4 MongoDB Outperformed PostGIS

FIND NEAREST POINT

DISTANCE BUFFER

PostGIS

MongoDB

PostGIS

MongoDB

(s)

(s)

(s)

(s)

EDINBURGH

0.184

0.134

0.011

0.159

SCOTLAND

0.21

0.135

0.015

0.341

0.29

0.137

0.034

1.893

EUROPE

8.34

0.259

13.15

3.682

PLANET

59.783

0.361

53.614

7.928

BRITISH
ISLES

MongoDB highlighted its performance advantage in two situations (Table 4), Find Nearest
Point and Distance Buffer. These queries operated the same functions, using $geoNear and $near with
a $maxDistance, $geoNear just providing further diagnostic information. These queries outperformed
PostGIS with Find Nearest Point query having the biggest advantage. The overall indexing and
database operation presents a more effective process over the relational system but analysing internal
system operations and query processing where the quicker performance stems from is hard to identify
and quantify. The $geoWithin operators increased performance over the $near function is down to the
query results not being sorted, as MongoDB states, this results in quicker query results due to the lack
of the extra sorting option on return (MongoDB, 2014).
The remainder of the spatial queries all ran quicker on the PostGIS system (Table 5), possible
impacts of the use of tags for the where query criteria such as amenity=highway, adding the use
of an unindexed clause in the query has had a greater impact on the MongoDB system. Although no
special indexes were built on these values, it shows that MongoDB is not intended to search well for
tags within these spatial queries. PostgreSQL performs better in these situations possibly due to the
spatial indexing being more effective when processing the spatial factor within the query speeding up
the query overall.

19

Table 5 PostGIS Provided Better Spatial Performance On The Whole

POINT IN POLYGON

LINE INTERSECT

BBOX GEOMETRIES

BBOX KML RENDER

PostGIS MongoDB PostGIS MongoDB PostGIS MongoDB PostGIS MongoDB


(s)

(s)

(s)

(s)

(s)

(s)

(s)

(s)
JSON

EDINBURGH

0.215

2.49

1.225

3.549

0.023

0.948

0.054

10.494

SCOTLAND
BRITISH
ISLES
EUROPE
PLANET

0.21

2.569

1.732

7.982

0.065

1.349

0.092

13.444

0.448

4.429

2.745

17.138

0.327

1.928

0.668

14.591

16.221
45.897

18.562
76.284

8.056
98

29.389
92.298

6.813
29.332

7.859
34.384

9.841
12.948

34.948
54.582

PostGIS performs better with bounding box queries. MongoDB bounding box operators only
allow for use when a spatial index has been created, even so MongoDB is out performed. PostGIS
spatial indexing is a major factor in securing the faster performance. The classing of geometries
bounding boxes and what features are related to others recorded to a greater detail, pays dividends in
query speeds. With regards to the quicker Line Intersect query operated on the Planet dataset,
MongoDB was the quicker system. During these queries PostGIS was the faster system until this
dataset size point, the outcome could be done to a system slowdown on operation, slightly larger
dataset due to the importing of the geometries or a combination of dataset size and complexity of the
query, although this was not the case across the other Planet queries.
The rendering of spatial geometries in the GIS industry is a task quite commonly run (Shekhar, et
al., 2001). The addition of this query was to compare the output qualities of each system. MongoDB
could not export common spatial formats or OGC standard formats. With PostGIS having an option,
KML was selected. MongoDB had and option of CSV or JSON, although both systems are not
comparable in this situation, PostGIS had significantly quicker output times.
The spatial queries continued to highlight how MongoDBs spatial functionality is limited.
Buffer, Area of Polygon, Length of lines and Length of Lines within polygon queries could not be run
due to a lack of functionality. PostGIS was a far superior spatial system capable of complex spatial
queries. PostGIS performed relatively quickly but the increased dataset sizes found with Europe and
Planet OSM files a dramatic slow in performance was witnessed. This occurred for multiple spatial
queries run in PostGIS, Length of Lines in Polygons, Length of Lines, Area of Polygon and Buffer
function.
The increased dataset size is having a performance reducing effect on the PostgreSQL
system, similar outcomes as found by (Patton, et al., 2000). System configuration tuning could be a
20

major factor with larger datasets processed in the system hardware environment. Effective tuning of
parameters would definitely aid system performance, however besides this, the system indexing
seemed to lose effectiveness in contrast to the simplistic MongoDB index. It is hypothesized that
regardless of database tuning and increased hardware performance, PostGIS will continually have a
linear relationship between increasing query time and increasing dataset sizes. The high density of
data within a small area of earth can impact on spatial indexing effectiveness coupled with the
minimum bounding rectangles, especially of the line intersects being quite large, thus not effective of
scaling the search space. Areas of better database setup such as the introduction of table partitioning
might help as this can boost query performance through reducing index size, removing the heavy
access load on a single table and allowing these accessed indexes to fit into the system memory
(PostgreSQL, 2013).

5 Discussion
5.1 Functionality
The database selection process at the start of the project tried to select a comparable NoSQL and
Relational database systems through functionality and spatial capabilities. The PostGIS extension for
PostgreSQL has already proved its functionality and reliability through many GIS applications
(CartoDB, 2014, PostGIS, 2012). MongoDB, boasting its latest spatial capability updates and strong
documentation, was selected as an adequate match. During the process of the project various
functionality thought to have existed became clear to not exist. MongoDB being under matched
spatially, has possibly decreased the overall potential of this system but its performance over PostGIS
have become clear through the testing phase. This leads to an important question, does NoSQL have a
place in GIS?
In the creation of the spatial querying criteria, the vast spatially functionality of PostGIS was
already not fully tested in order to match MongoDB. The comparable spatial queries were based on
the spatial functionality available, in making use of these in this project the methodology behind this
project was deemed suitable for testing the spatial performance. Purely down to the spatial capabilities
of MongoDB, this NoSQL system does not meet adequate standards to operate in the specialised GIS
industry currently.
The NoSQLs basic functionality presents a promising outcome. Where it outperformed
PostGIS it did exceptionally, a jump in performance was noted. This functionality is acceptable for
the majority of non-specialist GIS use, where location based features are in use. It cannot be simply
accepted that general GIS users will only need this functionality, but in the case of web based app
working with spatial data collected by the public, it seems very capable of meeting the spatial needs

21

with high performance. The better CRUD performance overall highlights the potential of NoSQL
system providing a quick and efficient backbone to any large user base application.
The in-depth and vast spatial analysis used in specialist GIS applications today simply
outweigh MongoDBs compatibilities. PostGIS has a much richer spatial functionality catalogue that
allows for specialist GIS applications to continue to develop. However, where MongoDB could and is
being used within the GIS industry is in simple web-based spatial services. Foursquare is a prime
example of a massive user base operating with basic spatial operations.
Spatial operations tested such as, nearest neighbour queries Table 4, show that MongoDB
provides fast performance under small to large datasets, outperforming the more sophisticated spatial
database with increasing dataset size. Presenting many performance benefits over PostGIS in the
comparable queries, MongoDB is still a young system with an even younger spatial component. In
relation to spatial functionality this NoSQL system doesnt have a place in the advanced GIS user
environment but for less spatially inclined analysis and big user environments, MongoDB seems a
viable option.
PostGIS supports a selection of Open GIS Consortium (OGC) standards, simple features
standards, markup language for simple features and the standards for managing the spatial tables
using SQL. This compliance to the OGC standards allow for an open interaction with other GIS
components. Regardless of the OGC agenda to make systems accept inter-exchangeable data formats,
having the option is a great factor in the GIS industry contrary to certain cooperate database systems.
MongoDB does not support OGC standards but with the implementation of various open-source
software, the NoSQL system can comply with the OGC standards (Baptista, et al., 2011).

5.2 Performance
5.2.1 Spatial
Overall the spatial performance of PostgreSQL with the PostGIS spatial extension is better than
MongoDB providing a fast and efficient service with system queries. An inferior spatial indexing
framework impacts the limited functionality of the NoSQL system, a major factor in the overall
spatial performance of MongoDB. MongoDB does show promising signs for basic spatial
functionality and with its scaling abilities, the possible use for web based spatial apps seems much
more realistic.
PostGIS is a specialist database system for GIS users, a rich spatial functionality sets it apart in
this project, however to achieve maximum performance potential, a deep understanding of database
tuning and setup is required in comparison to the easy MongoDB install. MongoDBs spatial
functionality provides specialist database functionality while PostGIS provides a sophisticated service
to experience spatial users. The basic spatial operations in Table 4 highlight the drastic performance
22

benefits over PostGIS. Without the parameter editing, MongoDB performed dramatically better over
the larger datasets indicating that PostGIS requires much more advanced user setup as well as a
degrade of performance.
PostGIS operates spatial indexes on R-tree over GiST compared to MongoDBs B-tree
indexes. The import time deficit between the two systems can be explained by the creation of
PostGISs superior spatial indexing system. The more complex R-tree over GiST framework might
sacrifice performance over a standard R-tree index (Nguyen, 2009, Simion, et al., 2013), but allowing
for more complex spatial functions than, contains and intersects is priceless functionality in GIS
systems. This is where the quicker to build MongoDB B-tree indexes fall short (Pourabbas, 2014). Btree indexes can only be used for basic operations such as equality or ordering, somewhat restrictive
in spatial applications. The poor functionality operations in tandem with indexing shortfalls is what
was witnessed in this project, on both sides. A B-tree index is an efficient ordered key-value map,
meaning its quick to find records when given the key (Samet, 1995). This is why MongoDB excels at
the easier read, write and CRUD performance tests (Rigaux, et al., 2002). The B-tree index cannot
store multi-dimensional data thus hindering the spatial performance of MongoDB. MongoDB
implements Geohashing to its B-tree framework to get around this issue. MongoDB computes
geohash values for coordinates within the specified location range and indexes the geohash values.
This is not complex geometry friendly but for basic operations it can be sufficient, a Geohashed Btree approach can have many search efficiency impacts with neighboring points on the Geoid ending
up at opposite ends of the plane (Rigaux, et al., 2002). This difference in spatial superiority highlights
the issue that indexing structures do play a significant role in the spatial performance of database
systems but are not necessarily limiting performance if complex spatial functionally is not available.
The added topological schemas implemented within the PostGIS data model impacts the
spatial operation performance (Batcheller, et al., 2007). With adjacency relationships stored within the
PostGIS system this provides performance advantage possibly adding to the Table 5 performance
advantage.

6 Limitations
With PostGIS far outweighing the NoSQL system, the queries were scaled back to match the
capabilities of the lesser system. MongoDB could not perform a few of these queries which meant
unmatched performance results in certain tests. This limitation was caused by the functionality and
was hard to mitigate without further insight into the systems during the selection process (McCarthy,
2014).
Time constraints were always an issue, loading times became a rather time consuming
process especially with the larger datasets. Many importing failures with PostgreSQL wasted large
23

amount of time but was hard to eradicate without any previous experience of loading OSM data
through the selected method. Importing issues were caused by insufficient parameters relating to
cache memory stores and memory check parameters. Understanding these parameter settings without
altering the database performance was required in order to fully load all the datasets into PostgreSQL.
The use of the Windows operating system limited the abilities and options associated with
OSM2PGSQL. Various options of importing the dataset in WGS84 was not available on the Windows
system. This required some manual conversion in order to create the MongoDB coordinates.

7 Further work
Further areas of future research would include the performance testing of other NoSQL systems.
Encompassing the entire variety of NoSQL database types available could highlight if certain types
are better suited for spatial data. Graph type NoSQL systems promote focussed networking potential
and within GIS network/topological analysis is a common task. Running more focused tasks on the
best suited databases for these uses could lead to specific GIS related comparisons.
In addition, the further assessment of the systems presented in this paper would continue
including the performance monitoring of parameter editing and management. Introducing increased
database system tuning could provide a better representation of real life GIS database performance.
The further testing of these systems would also entail a web based service comparison with each
system providing the backbone data store. Increased use of web based GIS would be desirable to test
analysing the web hosting functionalities of each of these systems, determining if full functionality is
available through web server requests.

8 Conclusion
This paper has presented performance comparisons of two database systems tested under a series of
system and spatial functionality queries. An industry accepted RDBMS, PostgreSQL with the
PostGIS spatial extension and the new NoSQL system, MongoDB. MongoDB showed performance
benefits during routine system tasks and basic spatial functionality but was over shadowed in the
majority of spatial queries by the far more sophisticated PostGIS system.
The results project MongoDB currently does not have a place in the specialist GIS industry
but, outperforming the RDBMS in a series of basic spatial operations, arguing that it does have a
place in basic GIS operations to the correct application. These performance gains over the RDBMS is
down to computational performance advantages rather than functionality. The functions in which
MongoDB outperformed PostGIS is down to the sluggish operation and over complexities of the
RDBMS which are not always needed. The basic outperformance with large data and with basic
spatial functionality suggests it is an appropriate system for todays web based applications as
24

currently suggested by its used by Foursquare. With the increase of big data web applications
currently evolving and the progression of web GIS, NoSQL systems may be a logical step in the
future
This research developed suitable methods to spatially query each system copulating the real
life GIS operations used in the GI industry. A series of spatial operations harnessed the spatial
functionality to a comparable level between each system. The suitable method of querying these
systems is always restricted by the lesser capable of the systems. The deficit in functionality in either
system is a necessary variable to present regardless of computational performance benefits that a
system might have.
The ability to scale the relational database management system to cope with larger datasets
may improve with the customisation of parameters but the increased degrade of performance that
comes with the increased data set highlighted in this paper, may require further system infrastructure
to keep fast performance in comparison to MongoDB on a single node. In contrast PostGIS provides a
solid spatial tool for many GIS users looking for advance analysis and manipulation of spatial
datasets, however to achieve maximum performance a strong knowledge and skill set in database
configuration is needed. Unlike MongoDB which can also be tuned, the system setup was more
straightforward and adapted to the system environment better without complex parameter editing.
Spatial indexing plays a significant role on performance. During the querying it is clear that
the more complex PostGIS indexes benefit the specialist spatial queries but when basic functions are
only needed it becomes a hindrance to performance and the more simplistic MongoDB indexing
excels.
The overall choice of system will be orientated around the application and desired use,
looking at the user base, data size and the spatial functionality needed, will undoubtedly highlight the
best choice. The common theme between both systems is the open source factor, providing these high
performance systems without the restriction of user licenses making them both cost effective choices.
We should expect "to see increasing fragmentation of database technology based
on use case." So, expect more options in the database realm, not fewer. And, I'll
add, expect more options (built in and extensions) to store and manage
geospatial data in those new offerings. Michael Stonebraker
Stonebraker looks forward with idea of increasing fragmentation something MongoDB autosharding features do well allowing for better scalability without modification of system schemas.. In
this project only a single node instance was used, but these NoSQL database systems could be the
systems of the future increasing their geospatial scalability today and structuring globally sized
dataset across a wide system architecture.

25

9 References
Anand, G. & Kodali, R., 1999. Benchmarking the benchmarking models. Benchmarking: An
International Journal, 15(3), pp. 257-291.
Anon., 2010. NoSQL A Relational Database Managment System. [Online]
Available at: http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20Page
[Accessed 24 02 2013].
Baptista, C. d. S. et al., 2011. Using OGC Services to Interoperate Spatial Data Stored in SQL and
NoSQL Databases. Campos do Jordao, Bairro Universitario.
Batcheller, J. K., Gittings, B. M. & Dowers, S., 2007. The Performance of Vector Oriented Data
Storage Strategies in ESRI's ArcGIS. Transactions in GIS, 11(1), pp. 47-65.
Brewer, E. A., 2000. Towards Robust Distributed Systems. [Online]
Available at: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
[Accessed 01 07 2014].
CartoDB, 2014. CartoDB. [Online]
Available at: http://cartodb.com/
[Accessed 02 08 2014].
DB Engines, 2014. DB-Engines Ranking. [Online]
Available at: http://db-engines.com/en/ranking
[Accessed 10 05 2014].
de Hass, W., Quak, W. & Vermaji, M., 2008. A spatial DBMS buyers guide, s.l.: Delft University of
Technology Section GIS Technology.
Dean, J. & Ghemawat, S., 2008. Mapreduce: simplified data processing on large clusters.
Communications of the ACM, Volume 51, pp. 107-113.
Dietrich, S. W., 1992. A Practitioner's Introduction to Database Perfromance Benchmarks and
Measurements. The Computer Journal, 35(4).
Goodchild, M. F., 1992. Geogrpahical information science. Geograpical Information Systems, 6(1),
pp. 31-45.
Goodchild, M. F., 2002. Measurment-based GIS. In: W. Shi, P. F. Fisher & M. F. Goodchild, eds.
Spatial Data Quality. London: Taylor & Francis, pp. 5-17.
Gurret, C., Manolopoulos, Y., Papadopoulos, A. N. & Rigaux, P., 1999. The BASIS System: A
benchmarking approach for spatial index structures. In: Spatio-Temporal Database Management.
s.l.:Springer Berlin Heidelberg, pp. 152-170.
Kim, S. W. & Jeong, B. S., 2007. Performance bottleneck of subsequence matching in time-series
databases: Observation, solution and preformance evaluation. Information Services, 177(22), pp.
4841-4858.
Marble, D. F. & Sen, L., 1986. The development of standardised benchmarks for spatial database
systems. Proceedings of the Second International Syposium on Spatial Data Handling, pp. 488-496.
Marble, D. F. & Sen, L., 1986. The development of standardised benchmars for spatial database
systems. Seattle, Washington, s.n., pp. 488-496.
26

McCarthy, C., 2014. Does NoSQL have a place in GIS? - An open-source spatial database performance
comparison with proven RDBMS.
MongoDB Manual, 2014. Index Introduction. [Online]
Available at: http://docs.mongodb.org/manual/core/indexes-introduction/
[Accessed 13 05 2014].
MongoDB, 2014. $geoWithin. [Online]
Available at: http://docs.mongodb.org/manual/reference/operator/query/geoWithin/
[Accessed 06 06 2014].
MongoDB, 2014. MongoDB Manual. [Online]
Available at: http://docs.mongodb.org/manual/faq/storage/#why-are-the-files-in-my-datadirectory-larger-than-the-data-in-my-database
[Accessed 01 07 2014].
Nguyen, T. T., 2009. Indexing PostGIS Databases and Spatial Query Performance Evaluations.
International Journal of Geoinformatics, 5(3), pp. 1-9.
Patton, N. et al., 2000. VESPA: A benchmark for vector spatial databases. Proceedings of the 17th
British National Conference on Databases: Advances in Databases, pp. 81-101.
PostGIS, 2012. InfoTerra. [Online]
Available at: http://postgis.net/2012/10/17/infoterra
[Accessed 02 08 2014].
PostGIS, 2014. Chapter 4. Using PostGIS. [Online]
Available at: http://postgis.refractions.net/documentation/manual-1.3SVN/ch04.html
[Accessed 12 05 2014].
PostgreSQL, 2013. PostgreSQL Manual: Partitioning. [Online]
Available at: http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html
[Accessed 05 06 2014].
Pourabbas, E., 2014. Geographical Information Systems: Trens and Technologies. Boca Raton: Taylor
and Francis Group.
Pouria, A., Winstanley, A. & Anahid, B., 2013. NoSQL storage and management of geospatial data
with emphasis on serving geospatial data using standard geospatial web services, s.l.: Department of
Computer Science, National University of Ireland Maynooth.
Rigaux, P., Scholl, M. O. & Voisard, A., 2002. Spatial Databases: With Application to GIS. San
Francisco: Morgan Kaufmann.
Rigaux, P., Scholl, M. & Voisard, A., 2002. Data-Driven Structures: The R-Tree. In: D. D. Cerra, ed.
Spatial Databases With Applications To GIS. San Francisco: Morgan Kaufmann Publishers, pp. 237259.
Rigaux, P., Scholl, M. & Voisard, A., 2002. Spatial Databases With Application To GIS. 2nd ed. San
Francisco: Morgan Kaufmann.
Samet, H., 1995. Spatial Data Structures. In: W. Kim, ed. Modern Database Systems, The Object
Model, Interoperability and Beyond. College Park, MD: Univeristy of Maryland, pp. 361-385.

27

Scherzinger, S., Klettke, M. & Storl, U., 2013. Managing Schema Evolution in NoSQL Data Stores.
Proceedings of the 14th International Symposium on Database Programming Languages.
Shekhar, S. et al., 2001. WMS and GML based interoperable web mapping system. GIS '01
Proceedings of the 9th ACM international symposium on Advances in geographic information
systems, pp. 106-111.
Simion, B., Ilha, D. N., Brown, A. D. & Johnson, R., 2013. The Price of Generality in Spatial Indexing,
Toronto: Department of Computer Science, University of Toronto.
Snijders, C., Matzat, U. & Reips , U. D., 2012. Big Data: Big gaps of knowledge in the field of Internet.
International Journal of Internet Science, Volume 7, pp. 1-5.
Stonebraker, M. & Centintemel, U., 2005. One Size Fits All: An Idea whose Time has Come and Gone.
ICDE '05: Proceedings of the 21st International Conference on Data Engineering, pp. 2-11.
Stonebraker, M., Frew, J., Gardels, K. & Meredith, J., 1993. The SEQUOIA 200 storage benchmar.
SIGMOD 93' : Proceedings of the 1993 ACM SIGMOD International conference on Management of
data, pp. 2-12.
Strauch, C., 2013. NoSQL Databases, Stuttgart: Stuttgart Media University.
Suprio, R., Bogdan, S. & Demke, A. B., 2011. Jackpine: A Benchmark to Evalutate Spatial Database
Performance. Data Engineering (ICDE), Volume 27, pp. 1139-1150.
Vaish, G., 2013. Getting Started with NoSQL. Birmingham: Packt Publishing Ltd.
Vyas, R. K., Paliwal, M. & Pal, B. L., 2011. Conceptual Review on Relational and Spatial Database
Query Processing and Benchmarking. International Journal of Advanced Research in Computer
Science, 2(5), pp. 578-580.
Xiao, Z. & Yimin, L., 2011. Remote Sensing Image Database based on NoSQL Database.
Geoinformatics, pp. 1-5.

28

Anda mungkin juga menyukai