Ingestion Layer PDF

Ingestion layer on Big Data Architectures.
Concepts and challenges of batch and streaming ingestion systems in

lambda architectures
European Business Intelligence and Big Data Summer School, July 2018
Big Data Management and Analytics
Marc Garnica Caparrós1 and Quang Duy Tran2

1 marcgarnicacaparros@gmail.com
2 duy9968@gmail.com
ABSTRACT
In the era of Internet of Things, Mobility and biometrics measurement, with a huge volume of data becoming available at a fast
velocity, database systems are evolving in order to be ready to ingest, store and manipulate this data with the final goal to make
better decisions and take meaningful actions at the right time. Data science is experimenting new models and technologies in
all the different variables of a Big Data System. Distributed storage and high throughput analytics are mostly taking all the
attention but where is all starting? How are Big Data Systems innovating and solving the challenges of an efficient ingestion
layer? This paper goes in detail into the state of the art of Ingestion Systems in large database and analytics and the need
for batch ingestion process and stream-based ingestion systems, focusing and exploring the new streaming and real-time
ingestion technologies and challenges.
Introduction
Data is being massively collected in a huge range of different application areas. Utilities of Big Data for data-intensive
decision-making systems is nowadays undeniable. The importance of big data is not only focused in the amount of data a
systems has but what the systems does with it. The main goals of Big Data Systems are basically inherited from the emergence
of Data Warehousing and Business Intelligence architectures. Their main problem is that traditional BI and data warehouse
systems use structured data extensively and are mostly linked with Relational Databases. On the other side Big Data systems
propose more flexible technologies to adapt to such large and heterogeneous amounts of data (with NOSQL or co-relational
database systems). In the last years, researches and schools have defined the popular challenges of Big Data Systems as the 10
Vs of Big Data as follows:
• Volume : Size of the data
• Validity: Data quality, governance, master and management on massive data
• Velocity : Speed of data generation
• Variability : Dynamic, evolving behavior of the data sources
• Variety : Heterogeneity of the data
• Venue : Distributed heterogeneity from different platforms
• Veracity : Accuracy of the data
• Vocabulary : Data modeling, semantics and structure
• Value : Usability of the data
• Vagueness : Broad range of tools and trends creating confusion

Big Data Architecture
Big Data Systems architecture is one of the most popular research fields nowadays as it is still not clear how to build an optimal
architecture for analytics dealing with the challenges presented before. In this study we take as the base knowledge the Lambda
Architecture proposed by Nathan Marz and James Warren1 . The lambda architecture strive the following properties desired for
a Big Data system: 1) Robustness and fault tolerance, 2) low latency reads and updates, 3) scalability, 4) extensibility, 5) ad hoc
querying, 6) minimal maintenance and 7) debuggability
Figure 1. Lambda Architecture Diagram1
In this direction, the lambda architecture, presented by Marz and successfully adopted by companies such as Twitter for
data analysis, identified 3 main components to build a scalable and reliable data processing system:
• The batch layer, to store an incrementally growing dataset providing the ability to compute arbitrary functions on it.
• The serving layer, to save the processed views on the data set, using indexing techniques to speed up the queries and
processes.
• The real-time or speed layer able to perform analytics on fresh data with incremental algorithms to compensate for
batch-processing latency.
Several implementations and adoptions of the Lambda-Architecture are appearing in companies and research groups in recent
years. A number of studies proposed new reference architectures and implementations of the mentioned Lambda-Architecture.
Some of them are presenting a modular architecture for big data systems23 and others are presenting a multilayer-based
architecture4 .
Bolster architecture4 propose a software reference architecture dealing with Big Data systems requirements combining the
lambda-architecture approach and Semantic Web principles. It is abstracted from the different implementations and proposed
architectures on Big Data systems that a combination of different layers and tools is going to be the key point to maintain
systems with such requirements. Following the dividing strategy, Big Data Architectures can be easily described as a layered
architecture where each layer performs a particular function.
On recent research about Big Data Architecture solutions5 a setup with all the components required to analyze all aspects of
the big datasets are in place. This setup enables to get valuable insights and make the correct reasoning and decision making
2
systems from a big data ecosystem. A Big Data Management architecture is composed by a set of pluggable parts, each of them
with very tailored and concrete tasks2.
Figure 2. Big Data Framework powered by ElixirData
In the presented framework 2 there is a clear picture on the different components originated from the Lambda-Architecture
1 Consequently from Bolster architecture4 it can also be inferred an upper layer abstraction with the equivalent components and
a metadata repository with the Semantics Web engine3.
Figure 3. Bolster architecture abstraction. Big Data Management course at Universitat Politècnica de Catalunya, Spain
3
Data Ingestion
An efficient, robust and fast ingestion layer is one of the key parts of the Lambda-Architecture pattern. The Ingestion Layer
needs to control the arrival rate of the data and how fast this data can be delivered in to the analytical models and views. Data
ingestion is the process of importing, transferring, loading and processing the data before parsing to the next step for storage.
Data ingestion was a crucial step in data warehousing and business intelligence systems by means of the ETL (Extraction,
Transform and Load) processes. In these, batch processing in charge of (1) getting the data from the different sources (2)
perform pre-defined transformation and minor data processes on the incoming data and (3) load the data into the centralized and
materialized database. This approach is still used in Big Data systems implementing so-called data-at-rest infrastructures.
As a first step into the new paradigms and requirements of the Big Data era, some technologies started to appear focusing
on the direct management and processing of the data in order to get the biggest throughput possible. Evolving from data-at-rest
infrastructures to data-in-motion infrastructures.
This two approaches are presented in modern Big Data systems and need to coexist in order to adapt efficiently to the
characteristics of the source and destinations. In many Big Data scenarios the source and the destination may not have the same
input rate, format or protocol and will require some on-line transformations for further analysis.
Big Data Ingestion

As introduced in previous sections of this study, data ingestion has been always a key point on any analytical architecture. In
Big Data systems, data ingestion needs to deal with the inherited problems of the traditional data warehousing adding the Big
Data challenges already presented. Big data ingestion is about moving data, in many cases unstructured data, from where it is
originated into a system where in can be efficiently stored, distributed and analyzed.
Big data ingestion can be continuous or asynchronous, real-time or batched. Lambda layered architectures expects both
cases to happen depending on the characteristics of the source and destination. In most of the Big Data use cases such as
Internet of Things deployments, both volume and variety of data sources are expanding exponentially, sources need to be
accommodated and often in real-time. Being able to extract the data and pipeline it to a destination system for useful analysis is
a basic feature of any big data system and in the same way it represents many challenges in terms of time and resources.
Challenges of Big Data Ingestion

The challenges of Big Data Ingestion can be grouped in three main categories: Complexity, Security and Data Accuracy,
Massive Data sources.
Complexity
In many cases, big data systems are purpose-build and over-engineered tools which ends up with big processes increasing
the complexity of the data ingestion, time consuming and expensive. The combination of multiple tools and customized
parametrization prevents on-time decision making required from the current business environment. Finally existing streaming
data processing tools create lock-in problems through their software architecture and command line interface.
Security and Data Accuracy

The transport and collection of data needs to fulfill the data security requirements, access roles and privacy. In any lambda-
architecture implementation the preservation of data security and privacy in both batch layer and speed layer is on of the most
challenging aspects. The verification of the source and its traceability increases the complexity of the ingestion layer.
Massive Data sources

In most of the cases, massive sources are present in both terms: volume and variety. Smart cities environment, Industry 4.0 or
Biometrics systems need to be ready to ingest a huge amount of continuous and unstructured data. Ingestions systems need
to be able to deal with all this amount of data with limited resources and redirect it to the correct connector into the system.
Moreover modern data sources are evolving rapidly, and the system should be aware of it and remain efficient while interacting
with them.
Semi-structured or unstructured data makes even more difficult to be able to track an state of the current data and be aware
of the changes and updates on it.
Data sources
One of the emerging challenges and opportunities of Big Data Systems is the amount of different and vigorous data sources
potentially feeding the system, from internal and external to the enterprise bound. Noisy data has always been a problem but
with the petabyte era the concerns increased considerably. The wide variety of data, along with high velocity and volumes has
to be seamlessly merged in the big data ecosystem through the ingestion layer. Later this data needs to be consolidated and
established in order to let analytics engines and visualizations tools operate on it.
4
Most of the Big Data Systems relies on their storage layer with a Data Lake infrastructure. Data lakes are pools of data to
be processed and analyzed with as less as possible global schema or configurations. The data is tagged for inquiry or search
patterns. Data ingestion systems need to be able to process the incoming data from the data sources and collect them into the
storage layer.
Figure 4. Variety of data sources5
Data sources variety involves from industry data from external and internal data management tools to social networks
analytical data with a broad range of different possibilities in terms of formats (XML, JSON, HTML-based) and arrival rates
(streaming, pulled, pushed).
Ingestion layer responsibilities

Effective data ingestion starts by prioritizing and cataloging data sources, validating incoming data and routing (by means of
tagging) the data items to the correct destination in the Big Data stack. When the amount of data sources and formats is huge,
one of the biggest challenges is to ingest data at the reasonable speed and efficiently process it with the final goal of analyze it.
Facing all the data ingestion challenges implies that a correct and efficient ingestion layer needs to able the handle and
upgrade the new data sources, technology and applications. The extraction, ingestion and collection of data needs to be
performed correctly, consistently and with trustworthy data. Moreover the process needs to be aligned with the data
generation speed. The system needs to scale according to the input from the data sources and be fault tolerant.
In some specific cases, the data ingestion layer may also be in charge of performing some aggregating or minor processing
algorithms in order to reduce the amount of data ingested by the system. Nevertheless this is a design consideration as the data
once aggregated would not be able to trace back to the finest granularity.
The ingestion layer can be imagined as a set of blocks that can be plugged or unplugged to the process depending on the
characteristics of the data. This sequences of blocks may include 1) Identification of first of all the data but also the format
and structure, 2) Filtration and semantic ingestion through the knowledge of the system, 3) Validation against the metadata
repository, 4) Noise reduction, 5) Transformation by means of splitting, converging, denormalizing or summarizing the data,
6) Compression when the system is not enable to ingest the whole size of the data, even though this may incur into losing part
of the data. 7) Integration into the destination: Storage or speed processing layer.
Data ingestion patterns

Common solutions for ingestion and collection layers in databases systems are included in this set of patterns5 solving most of
the problems commonly encountered.
1. Multisource Extractor Pattern: Dealing with the ingestion of multiple data sources types efficiently.
2. Protocol converter Pattern: Protocol exchange and translator abstracting the information received.
3. Just-in-Time Transformation Pattern: Batch models use traditional ETL methodologies to ingest large quantities of
unstructured data.
4. Real-Time Streaming Pattern: Instant analysis is something that is becoming more and more demanded due to the
need for immediate results and analytical dashboards in realtime. Real-time ingestion and streaming data is required.
5
Ingestion layer: Batch and Stream ingestion
Depending on the system requirement and business analytics features of the Big Data Management Infrastructure, the system
will rely more into a Batch Layer ingestion or into a Speeding Layer ingestion or streaming Ingestion. Following with
the Lambda-Architecture1 a powerful and sustainable Big Data Architecture must include a combination of both ingestion
approaches.
Batch ingestion is a more mature concept and usually related with traditional Data Warehouse approaches and SQL
technologies. It basically uses structured batch sources for an update of the Storage Layer. Adding the fact that the number
of sources in Big Data Systems can changes regularly batch ingestion systems should be able to adapt to new and changing
sources.
Stream ingestion is a more new concept going aligned with Stream processing research and technologies. Its main goal
is to be able to maintain a real time ingestion and scalability. In this cases, the systems is usually facing unbound memory
requirements and dynamic arrival rates and order. Data sources evolve over time and imperfection must be assumed.
Batch Ingestion
Batch ingestion was the preferred and most used ingestion pattern in traditional data warehouses and data marts. Historical data
was gathered from multiple sources regularly by bulk updates of the data warehouse. Some sophisticated systems were able to
track the changes in the data warehouse instance and update specifically new or updated data.
This processes were mainly used in the introduction of ETL processes (Extract, Transform and Load). This chain was one
of the first abstraction of data pipeline and flows applying the data ingestion patterns efficiently, adding data governance, data
cleaning and deduplication, aggregations, validations and fault tolerance.
As the Big Data requirements were appearing in the traditional data warehouses, the number of sources was increasing and
so often with structured or unstructured data, batch ingestion processes needed to adapt and evolve in order to phase the new
challenges. The last goal of batch ingestion is to ingest structured or unstructured data into the storage layer.
RDBMS connectors
In most cases, batch ingestion is in charge of connecting to the more traditional data sources available. RDBMS are not removed
from the ecosystem when Big Data comes into analysis, instead they still are a key part of it. Batch ingestion systems need to
make sure they are compatible with those RDBMS systems with enterprise data.
Batch ingestion systems supports connections to RDBMS and extracting of data by means of SQL queries. They act as an
intermediate layers between relational databases and more flexible storages engines.
Batch main features

The main features a batch ingestion tool must have are the following:
• Multiple and pluggable sources.
• Full load and bulk import of data.
• Incremental load of data by means of triggers and/or incremental snapshots.
• Compression of information if the size exceeds certain threshold.
• Security and integration.
• Loading data and interacting directly with upper levels of the storage engine, to update views and analysis engines.
• Scheduled jobs and tasks with user defined directives.
Batch ingestion tools

Apache Sqoop6 is the most extended tool for bulk synchronous and asynchronous data transfers between storage engines like
Hadoop and structured datastores such as relational databases: Teradata, Netezza, Oracle, MySQL, Postgres and HSQLDB.
Basically Sqoops connects with this RDBMS and enables fast copies, parallel data transfers and load balancing during the
batch import of data.
Similarly Apache Gobblin7 was developed by Linkeding as a data ingestion framework to efficiently extract, transform and
load large volumes of data for a variety of sources. It combines auto scalability, fault tolerance, data quality and metadata
management.
6
Batch Layer
Batch ingestion is the intermediate step between the data sources and the batch processing. The batch processing layer is
responsible for two things: 1) Storing a copy of immutable and constantly growing (append-only) master dataset when it is
ingested and 2) pre-computing batch views on that dataset .1 Since the raw dataset is stored, it is possible to re-compute views
based on the ad-hoc queries. This re-computing is a continuous process, when new data arrives it will be aggregated into the
new views. However, this operation needs to compute from the whole dataset, it will take a lot of time and resources to run
each time. Hence, it is not expected to update the views frequently. This is the reason by the entire batch layers needs to be
interconnect and combine their task for a more efficient batch processing.
After the views are computed, they are stored in serving layer. This layer is a specialized distributed database that loads in a
batch view and provide indexing on it. The database only need to support batch updates, random reads and omit random writes.
This make it noticeably simple, since random writes causes most of the complexity in databases.1
Figure 5. Batch and serving layer
Data storage system

In this layer, every data will be written exactly once, since they are immutable. The only write operation will be adding new
data unit, modifying data is not needed. Moreover, as the layer is responsible for computing on the entire dataset, the it needs to
be able to read lots of data at once. From those factors, we can define the requirements for the storage system1 :
1. Efficient appends of new data: Data are only written once, it must be easy and efficient to write a new data unit to the
dataset.
2. Scalable storage: As we will store the whole constantly-growing dataset, the system needs to be scalable when data grow.
3. Support parallel processing: When pre-computing the views, we have to run on the whole dataset. Parallel processing is
needed to deal with large amount of data.
4. Tunable storage: Compressing could help reduce the storage but decompressing when computing would affect perfor-
mance. The batch layer has to be flexible in storing data to suit our needs.
5. Enforceable immutability: Enforcing mutable operations will prevent bugs and random errors on existing data.
From these requirements, filesystem is a natural fit for batch storage system. Files are sequences of bytes, reading and
writing is sequential. Moreover, we have full control on bytes, which means ability on how we compress them to tune storage
cost versus processing cost and how we grant permission to enforce immutability.
The problem to normal filesystem is that it exists on a single machine, hence it is limited on how we store and process data.
The solution to this is distributed filesystems, where storage and computation are spread across many servers. When the size of
data grow, the system can scale by adding more machine to the cluster. If one node fail, data are still accessible through other
nodes by replication.1
7
Computing arbitrary views
The batch layer run functions on the master dataset to pre-compute views which loaded by serving layer. Queries are then
answered by merging data from serving layer and speed layer views.
The master dataset is growing regularly, every time new data arrive, intermediate views need to be updated since they could
be out-of-date. There are two algorithms for updating, namely recomputation and incremental. Recomputation algorithm will
discard the old batch view and recompute over the entire appended dataset, while incremental algorithm simply update the
view directly when new data become available. For example, if the query is asking the total number of records, re-computation
algorithm would append new data, count the number and update the view. Alternatively, incremental algorithm count the
number of newly arrived records and add it to the existing number.
Obviously, incremental algorithm would be more efficient comparing to the other. There are trade-offs need to be considered
between these two algorithms:1
1. Performance: The intermediate views of incremental algorithm might need to store more data depends on the query, as it
cannot compute non-cumulative results simply.
2. Human-fault tolerance: If incorrect results are found, we only needs to fix the error and the result will be updated on
the next run of recomputation. However, such error cannot be solved easily with incremental, as one result affect its
following views. We need to determine where the result is calculated wrong, how it is wrong and correct each record.
3. Generality: The incremental algorithm needs to be tailored for each query.
As discussed in1 , recomputation is fundamental for a robust data processing system, while incremental algorithm can only
serve as a supplement to increase the efficiency of the system.
Stream Ingestion Layer

Studies working on instantiations of the Lambda-Architecture1 and more detailed frameworks4 agree on the fact that the Speed
Layers needs to deal primarily with Velocity. Batch procedures are by definitions not able to deal with a data moving model
just-in-time in order to get efficient and fast results. The target input is a continuous flow of incoming data, unbounded streams
of high timeliness that require new patterns and technologies to deal with this arrival rate. Data ingestion is only the first part of
a speed data pipeline including a dispatcher process and a analytical processing of the stream.
Even being three parts of the pipeline the speed nature of the layer makes them so interconnected and related. This is the
reason why in most of the tools in the market are offering all the roles at the same system.Stream ingestion usually starts with
the implementation of a message queue for pushed raw data streams (Internet of Things sensors, biometrics or social data).
Scalability is key in this component to be able to deal with high throughput rates without loosing any incoming data. The
dispatcher are in charge of the first data quality check (schema checking and data integrity) and also of event routing, moving
the streams to the desired destination in the stack: either batch processing to the storage level or speed processing through real
time processing and views.
Finally the stream processing component performs efficient algorithms over the streams. Over computing through the
data is sometimes not doable in real time processing streams which implies the execution of summarizations and synopsis
algorithms.
Message queues for stream ingestion

The concept of streams has become more and more popular with the disruption of Internet of Things, sensor data and big data
deployments. Formally a data stream8 is a sequence of data items read in an increasing order. The common properties a data
stream has are that arrival rate is not under control of the system and it is usually over the processing time of the system. The
system is always falling into unbounded memory requirements and the data is moving constantly.
In order to ingest correctly data streams for further stream processing, the system needs to implement message queues.
Message queues have been enriched with scalability systems to finally handle high ingestions rates of heterogeneous systems.
Message queues tools need to be ready to ingest high volumes of streams without loosing efficiency nor data and with high
performances. Tools like Apache Kafka are implementing partitioned queue which enables the system scale horizontally as it
requires.
Data streams models

• Synchronous and asynchronous streams capture
Synchronous streams capture does not rely on logging of resources, it directly captures changes when they are made. On
the other hand, asynchronous streams capture is receiving the complete timeliness list of streams with the changes made.
8
• On-demand and continuous streams
On demand speed layer is based in the idea to use batch processes as much as possible and switch on stream processing
only when batch processes are likely to exceed responses time. Most of the commercialized tools for streaming processes
are taking this approach by micro-batches and on-demand real-time engines for efficiency issues and sustainability. The
continuous streaming option requires more structured streams incoming the system and a high enough arrival throughput
to make the system performance worth.
Data pipeline: Tagging and routing

An essential features of stream ingestions systems is the ability to redirect the incoming data streams to the target destinations.
In some cases the raw streams would be stored in the storage layer for a batch processing and in other cases the streams will
feed a real time analytics data flow. In the very most, the stream will feed both flows at once in order to gain the best of it.
Under this assumption the system needs an engine able to understand heterogeneous data and tag the streams in order to route
them to the destination.
Most used tools for that purpose like Apache Kafka or Apache Storm have synchronized with other tools such as Apache
HBase or Apache Spark for a more efficient real time analysis of streaming data by aligning into the routing manners on ad data
stream pipeline:
1. Topics - User defined categories to route messages. Topics can be partitioned to improve performance and scalability.
2. Producers - Reporting messages to some of the topics.
3. Consumers - Consuming certain messages related to certain topics.
4. Brokers - Managing persistence and replication of data among topics.
Streaming data ingestion tools

In this section, several technological solutions for data ingestion are discussed.
Amazon Kinesis Data Firehose9 is a cloud-based service from Amazon Web Services. It can collect data from thousands
of sources such as website clickstreams, social media, IT logs and deliver to the preferred destination, such as Amazon S3,
ElasticSearch or Splunk.
Apache Flume10 is a distributed service for efficiently collecting, aggregating, and moving large amounts of log data. Its
architecture is based on streaming data flows, providing robustness and fault-tolerance with tunable reliability mechanisms and
recovery mechanisms. It allows micro-batching operations as well as low-latency continuous real-time ingestion.
Apache Kafka is an open source distributed stream processing platform to provide a unified, high-throughput, low-latency
platform for handling real-time data feeds.11 The system is fault-tolerant, high-throughput and horizontally scalable12 . Data are
ingested from processes called producers and routed to consumers.
Apache Beam13 a Java based unified model and set of SDKs for defining data processing workflows and data ingestion and
integration flows. It is open sourced by Google and the basis of Google Cloud Dataflow system14 .
Speed layer
Similar to what is introduced in the batch layer. The speed layer of the lambda layers is the concatenations of stream ingestion
and stream or real-time processing of information which ends with the creation of real-time views which are going to be
consumed by the serving layers. The speed layer is responsible for computing real-time view and maintaining them with
low-latency. The key difference with batch layer is that the views must be updated shortly, ranging from milliseconds to seconds.
Unlike batch layer which uses re computation algorithms to update views, speed layer update them by incremental algorithm in
order to keep the updating time as low as possible.
As data come in, views are updated with an incremental algorithm, using the previous views and newly arrived data. This
requires a database supporting 1) random-read and 2) random-write in order to alter existing views, beside 3) scalability and
4) fault-tolerance. These properties are common to NoSQL databases. They support a wide range of data models and index
types, thus the system designer can choose based on their needs of indexing requirements.
The architecture of the system also depends on whether views are updated synchronously or asynchronously. Synchronous
systems would issue an update request to the database and block until it finished, while asynchronous systems place requests in
a queue and run the update periodically. Asynchronous updates are slower because of delay between the request arrival and its
execution time, however, it can increase the throughput as multiple messages are read at a time and batch update can be used.
Moreover, synchronous update could overload the database and interrupt the application.
9
Summary
Lambda architecture and Big Data Architecture frameworks trust the combination of batch processes and streaming process
to achieve the high computational performance and heterogeneous semantic management required to cope with the Big Data
challenges. This paper has focused in the ingestion roles and responsibilities in both layers and finally presented the next
stages in the pipeline for storage layers, batch processing and real-time analytics. As any layered and pluggable framework,
each of the components needs to be detailed examined and developed while the connections and interactions also needs to be
technically supported. It has been stated that the management of the data sources and the process in charge of extracting their
data and feed the system with it are a key point in a Big Data ecosystem.
The main focus in research about Big Data ingestion systems believes that both ingestions processes in the batch and the
speed layer should not be managed completely independent. Instead they should coexist in the same tool or ecosystem to take
the best of each of the programming models for the system requirements.
In parallel there is a huge trend on research nowadays focused in the programming models of ingestion systems. Until
now, the most frequent used tools for Big Data Ingestion like Apache Flume, Apache Kafka or Apache Sqoop were relying
on developers code to define the pipelines and data flows of the system from sources. New research is trying to involve the
end-user in this decision and modeling, not only deciding the underlying layer where the data will flow and the components
needed but also the operations that the pipeline needs to perform on each stage.
Furthermore the scale between real-time continuous analysis and batch processing seems to be adding a new intermediate
component. Micro-batching implemented in systems such as Apache Spark Streaming or Apache Flume can produce barely
slower real-time while using the benefits of batch processing: data accuracy and fault tolerance. This approach seems to be
leading the directions of Big Data Ingestion systems. These systems should be identifying the incoming raw data and be able to
decide on the next step, real-time ingestion and processing, micro-batching processing or iterative batch process and storage.
10
References
1. Marz, N. & Warren, J. Big Data: Principles and Best Practices of Scalable Real-time Data Systems (Manning, 2015).
2. Santos, M. Y. et al. A big data analytics architecture for industry 4.0. In Rocha, Á., Correia, A. M., Adeli, H., Reis, L. P. &
Costanzo, S. (eds.) Recent Advances in Information Systems and Technologies, 175–184 (Springer International Publishing,
Cham, 2017).
3. Simmhan, Y. et al. Graywulf: Scalable software architecture for data intensive computing. In 2009 42nd Hawaii
International Conference on System Sciences, 1–10 (2009). DOI 10.1109/HICSS.2009.235.
4. Nadal, S. et al. A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol. 90, 75 – 92
(2017). DOI https://doi.org/10.1016/j.infsof.2017.06.001.
5. Nitin Sawant, H. S. a. Big Data Application Architecture Q & A: A Problem-Solution Approach.
6. Apache sqoop.
7. Apache gobblin.
8. Guha, S., Gunopulos, D. & Koudas, N. Correlating synchronous and asynchronous data streams. In Proceedings of the
Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, 529–534 (ACM,
New York, NY, USA, 2003). DOI 10.1145/956750.956814.
9. Amazon Kinesis Data Firehose (2018).
10. Apache Flume (2015).
11. Apache Kafka.
12. Namiot, D. On big data stream processing. Int. J. Open Inf. Technol. 3, 48–51 (2015).
13. Apache beam.
14. Cloud dataflow.
11

Ingestion Layer PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Ingestion Layer PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

Ingestion layer on Big Data Architectures.

Concepts and challenges of batch and streaming ingestion systems in

Marc Garnica Caparrós1 and Quang Duy Tran2

• Volume : Size of the data

• Validity: Data quality, governance, master and management on massive data

• Velocity : Speed of data generation

• Variability : Dynamic, evolving behavior of the data sources

• Variety : Heterogeneity of the data

• Venue : Distributed heterogeneity from different platforms

• Veracity : Accuracy of the data

• Vocabulary : Data modeling, semantics and structure

• Value : Usability of the data

• Vagueness : Broad range of tools and trends creating confusion

Figure 1. Lambda Architecture Diagram1

Figure 2. Big Data Framework powered by ElixirData

Big Data Ingestion

Challenges of Big Data Ingestion

Security and Data Accuracy

Massive Data sources

Figure 4. Variety of data sources5

Ingestion layer responsibilities

Data ingestion patterns

Batch main features

• Multiple and pluggable sources.

• Full load and bulk import of data.

• Incremental load of data by means of triggers and/or incremental snapshots.

• Compression of information if the size exceeds certain threshold.

• Security and integration.

• Scheduled jobs and tasks with user defined directives.

Batch ingestion tools

Figure 5. Batch and serving layer

Data storage system

3. Generality: The incremental algorithm needs to be tailored for each query.

Stream Ingestion Layer

Message queues for stream ingestion

Data streams models

Data pipeline: Tagging and routing

4. Brokers - Managing persistence and replication of data among topics.

Streaming data ingestion tools

Anda mungkin juga menyukai