Anda di halaman 1dari 17


As data volumes continue to grow exponentially, IT departments must evaluate new approaches to
ensure data management processes can be accomplished within finite time windows. New
strategies for enhancing scalability coupled with powerful and intuitive data integration technologies
ensure you can continue to achieve your objectives as your processing demands increase.
The goal of this white paper is to illustrate the SAP BusinessObjects Data Integrator performance and scalability for
different load strategies. We will explain both our push-down capabilities to the database as well as Data Integrators
own engine performance. Our objective is to help you learn from our best practices, which are illustrated with test
results, and help you size your Data Integrator architecture and estimate the time needed to load your data
The test scenarios we execute are based on real-world scenarios. This means:
Real-world hardware. We use two CPU dual-core machines for both source and target databases (Oracle) as
well as for the Data Integrator job server all running on Microsoft Windows 2003. For CPU-intensive jobs we
show scalability using a four CPU dual-core job server.
Representative data sets. Source data from the industry-standard TCP-C benchmark with tables from
6,000,000 records (customers table) to 60,000,000 records (order lines table).
J obs typically used in a data warehouse environment. Loading dimension tables with slowly changing
dimensions (type two) and fact tables with surrogate key lookups.
All tests were executed using Data Integrator 11.7 on Microsoft Windows servers.
This white paper concentrates on performance. But before we share the performance results, it is also important to
realize other factors, like ease-of-use and maintainability, are key elements in the evaluation of a data integration
product. All jobs used in our test scenarios are simple to develop, do not require technical database analysis skills, and
are easy to maintain and understand, thanks to the graphical representation. This is in sharp contrast to hand-coded
data integration projects which significantly increases development and maintenance efforts.
As a reference for our performance numbers, we first measured database (Oracle) and network performance by
copying a table without operations from the source to the target database using database links. This resulted in a
throughput of 54 GB/Hour, which is the maximum performance possible in our test environment and will be used as a
baseline to calculate overhead added by Data Integrator by adding more advanced transformations.
The main conclusions about Data Integrators performance are:
Data Integrator can leverage the power of the database server and do a full push down of the entire dataflow to
the source database, including loading into the target database via database links. Using this approach, no
data will be transferred between the database and the Data Integrator servers. Our tests illustrate that even
more complex data flows will be pushed down entirely to the database, resulting in fairly complex SQL
SAP AG 2009
statements generated by Data Integrators optimizer. In this scenario, Data Integrators performance will equal
the database performance.
For more complex transformations that cannot be translated to pure SQL for a full pushdown to the database,
Data Integrator will read from the source, process data in its engine, and write to the target. Data Integrators
optimizer will still push down as much as possible to the source database (joins, aggregations, functions), but
data will stream through the Data Integrator engine. For the initial load of a typical dimension table, including
data validation functions and lookups, we generated a throughput of 42 GB/Hour. In most cases, the database
and networknot the Data Integrator serverwere shown to be the bottlenecks. To illustrate the potential for
pure Data Integrator performance, without external constraints, artificial tests with random data generation and
no loading show pure engine performance of up to 142 GB/Hour.
For delta loads, Data Integrator can tightly integrate with CDC techniques provided by your database (such as
Oracle CDC) for optimal performance. In case this is not available on your source system, Data Integrators
built-in Table_Comparison transform can be used to detect changes by comparing the source with the target
table. In our 6,000,000 record customer table, we achieve a throughput of 12 GB/Hour for the delta load while
comparing each source record with the target and keeping history for changed records.
In the above cases, the Data Integrator engine is not the limiting factor. Most of the time Data Integrator is waiting on
the source or target database. Which is why CPU power on the Data Integrator server is not fully optimized. Tuning in
the above cases should be done at the database level (e.g. increase number of disks, more memory, CPU, etc.).
In the next test cases we strove to fully use Data Integrators engine performance. Whereas the common trend is to try
to push down as much as possible to the source database to improve performance, Data Integrators engine can
potentially equal or even exceed database performance (depending on database tuning) and will certainly put less
stress on your operational database sources.
SAP AG 2009
Test Conc ept
Our goal with this white paper is to provide data integration performance numbers for real-life customer scenarios,
rather than focusing on record-breaking performance numbers in an isolated and unrealistic test environment. Also
hardware and data sets used in our tests were chosen to reflect real-life scenarios. This approach will enable
customers to compare our approaches and results with their environment and use the numbers shown in this report to
estimate the time needed to meet their specific data loading requirements, which can be the foundation for a sizing
For the test scenarios weve chosen the example of loading a data warehouse from a transactional system. However
most examples can be applied to other data integration scenarios as well. For each scenario we show a number of
variations using different settings in Data Integrator and we change the degree of parallelism to show the performance
impact of these options. Where appropriate, we illustrate the execution results with the servers CPU and memory
Because the database and network are important factors that impact performance, we executed an initial test (scenario
0) to have a reference to compare other performance numbers to. In fact we will show that, as in most real-life data
integration projects, database and network tend to be the bottlenecks rather then the extracting, transforming, and
loading (ETL) tool itself.
The test scenarios we execute are based on real-world scenarios, this means:
Scenario 0: Straight Table Copy
To have reference numbers that will give an indication of database and network performance in our test environment,
we executed a simple data movement scenario where a table is copied from source to target database. We
recommend customers repeat this simple test in their environment to check what the baseline performance for their
environment is.
The scenario consists of three tests:
Load data from source to target directly, without going through the Data Integrator engine. This is
accomplished using a database link between source and target and an INSERT INTO ... SELECT ...
Read data via the Data Integrator engine and load into the target table using API bulk loading. This will
illustrate the overhead of having an extra server (hence an extra network hop) between source and target
The same test as above, without bulk loading but using INSERT statements. This technique is the only
alternative in case INSERTs and UPDATEs are mixed.
Scenario 1: Load Product Dimension
The goal of this scenario is to show our push-down capability, not only for simple data flows like scenario 0 (simple
table copy), but also for more complex scenarios. This way we leverage the power of the database and minimize the
amount of data send over the network. The job we designed joins the item table with the stock table (using an outer
join) and aggregates on products to get the total stock value (sum) and availability (count) in each warehouse. Both
initial and delta load are implemented, for the delta load we use slowly changing dimensions type 1, no history is kept.
Scenario 2: Straight Table Copy
To have reference numbers that will give an indication of database and network performance in our test environment,
we executed a simple data movement scenario where a table is copied from source to target database. We
SAP AG 2009
recommend customers repeat this simple test in their environment to check the baseline performance for their
In this scenario we load the customer dimension and need to do some additional operations:
Validate the customer data. In our example we defined two business rules stating customer name cannot be
NULL and the credit limit must be greater then 0. This is accomplished using Data Integrators validation
Lookup data warehouse keys for previously loaded dimensions like warehouse and district. You can use Data
Integrators lookup_ext function for this. Because lookup tables are relatively small, we can easily cache them
in memory (no paging will occur).
Generate a data warehouse key for each new customer (using the key_generation transform).
For delta loading, implement slowly changing dimensions type 2 (SCD2), building up history for changes in the
customers attributes. For each record, valid from and valid to dates are calculated and a current record
indicator is set. SCD2 is implemented using Table_Comparison and History_Preserving transforms.
Data will stream through the engine and we use bulk loading (API) at the target table if initial load, for delta load we
have a mixture of inserts and updates so need to fall back on parameterised SQL.
Scenario 3: J oin Tables to Load Fact Table
In this scenario we want to concentrate on Data Integrators own engine capabilities and illustrate the distributed data
flows feature in order to have better performance and maximize memory use. The scenario will load the fact table, by
joining tables from two different databases with no database link available, in this case the join will be calculated by
Data Integrator. After joining the two tables, a query is used to get data warehouse keys for the customer, district, and
warehouse dimension and provide default values for missing dimension keys. This job will require a lot of memory both
for executing the join (and caching the inner table in memory) and the lookup.
Scenario 4: Aggregate Data from File
In the last scenario we use a flat file as source (comma delimited) and aggregate the 60,000,000 lines (6 GB in file
size) to 2,000 groups calculating a sum for each group. With this example we limit the network and database
bottlenecks and show pure Data Integrator performance.
For the database, HP Proliant servers running Microsoft Windows 2003 SP2 with two dual core CPUs (which means
four logical CPUs) are used. Source and target database are Oracle 10g ( instances which are physically
located on different servers connected via a Gigabit Ethernet network.
The Data Integrator servers are located on separate servers, but these servers have the same characteristics as the
database servers. There are two different Data Integrator jobservers available: one with two dual core CPUs (hence
four logical CPUs), and one server with four dual core CPUs (eight logical CPUs). The 2 CPU machine is used for most
tests, only for the CPUintensive scenario four we show scalability with the 4 CPU server.
The table below lists the servers with their details:
SAP AG 2009
For the source data we used the TPC-C data model ( which represents a real life enterprise resource
planning-like (ERP) source system, representative data is generated automatically for all tables. Details about the
number of rows and total table size for the tables used in our scenarios are found below (W=200).
For the target schema, a star schema with fact table (ODER_FACT) and dimension tables (PRODUCT, CUSTOMER,
) is built. The customer dimension table implements a slowly changing dimension type two with additional columns to
specify valid from and valid to dates and a current record indicator.
SAP AG 2009
Below are the row counts and size of these tables after loading:
Between source and target schema a database link is set up, indexes are created and table partitioning is used. The
database is optimally tuned for the available hardware it runs on.
SAP AG 2009
Test Resul t s
This chapter show test results for the scenarios explained in the previous chapter. In the first section we show some
reference numbers about our environment (database and network). In the next two sections we make the distinction
between full push down, where the database will do all the work, and data streaming through the engine, where data
physically flows through the Data Integrator engine. The last section concentrates on the pure engine performance
illustrating that Data Intagrator can also execute memory and CPU intensive tasks.
Scenario 0: Straight Table Copy
In this section we show reference numbers for different types of loading in Data Integrator. There are no
transformations applied to the data, each of the tests is a straight copy. By copying the customers table from source to
target database via a database link, without passing through Data Integrator, we measure the maximum database and
network throughput. An INSERT INTO
SELECT statement is used to move the data which are 6,000,000 records or 3.7 GB and gives a throughput of 55
To measure the maximum performance with Data Integrator involved, we read the same table and bulk load it into the
target database using the Data Integrator engine. This results in a throughput of 54 GB/Hour which shows theres
hardly any overhead by introducing an extra server hop. We get this high performance because we use the bulk loader
technology in API mode which means we feed data directly into the target database files using Oracles Direct Path
Load API.
In the last reference test, we measured Data Integrators loading without bulk loaders enabled. In this case Data
Integrator generates insert statements and passes the values as parameters (parameterized SQL), the throughput
decreases to 27 GB/Hour, which shows for our environment (Oracle) the advantage of using bulk loading.
The tables below summarize the numbers with some extra execution details:
SAP AG 2009
Scenario 1: Load Product Dimension
A common technique used to optimize performance is to push down the full SQL statement to the database and letting
the database handle the load. This uses the database performance and results in tuning only needing to be done at the
database side. Data Integrators optimizer will always try to generate a full push down whenever this is possible,
generally when no transforms or functions are used that cannot be translated to SQL and when source and target
tables are in the same database or connected via database links.
The optimizer will always generate the SQL optimal for the database you are running against. Even the database
version can impact the SQL to insure optimal performance in your environment, so when a new database version
comes out, the generated SQL will be optimized for that version, without any change needed in the design of the data
flow. The data integration developer does not need to know advanced SQL options in order to write a high performing
data flow.
Database Links
Crucial to a full push down is the fact that all tables (sources and targets) are either in the same database or in
separate databases linked via a database link. Without this database link, only the SELECT can be pushed down to the
source and the INSERT will be handled by Data Integrator (see next scenario). In our environment, source and target
database are on separate machines and on the target database a database link is created to the source:
create database link SOURCE connect to ORABM_DI identified by <pwd>
using 'orcl_src'
This database link is imported in Data Integrator to link the target datastore to the source datastore.
For the initial load for the product dimension, the Data Integrator optimizer will generate an INSERT INTO statement
using Oracles /* APPEND */ hint for optimal performance. Operations like outer join, GROUP BY, and sum and count
aggregation functions are pushed down in the SELECT clause.
For the delta load the autocorrect load option at the target table is turned on in the design of the data flow. This option
corresponds to what is also know as UPSERTs: based on the key information a row will either be updated (if the record
SAP AG 2009
already exists in the target) or inserted (for new records). Using this option will result in a MERGE statement generated
by the optimizer. A developer can visualize the SQL generated by the optimizer in Data Integrator Designer using the
View optimized SQL functionality.
Below is the SQL generated by Data Integrator for the data flow in the screenshot above. Notice the MERGE statement
and the database links used:
WHERE ( "STOCK"."S_I_ID" (+) ="ITEM"."I_ID" ) and ( "STOCK"."S_QUANTITY" (+) >0)
) n
The performance for this scenario is pure Oracle performance. As illustrated in the reference scenario for our test
environment maximum throughput for a straight copy between the two servers is 55 GB/Hour.
Scenario 2: Load Customer Dimension (Initial Load)
When a full push down is not possible because more advanced transformations are needed which cannot be
translated to SQLonly the SELECT statement will be sent to the source. This SELECT is still optimized to push down
as much as possible to the database (e.g. joins, GROUP BY, database functions, ). The data will be passed from
source database via the Data Integrator server to the target database. At the target database, you can use bulk loading
for best performance. In this scenario, where we load the customer table, the validation transform needs to be
executed by Data Integrator and is blocking the full pushdown to the database, additionally we use a lookup function to
find data warehouse keys for district and warehouse dimensions.
For the initial load scenario, we get a throughput of 42 GB/Hour, CPU use on the Data Integrator server is low because
the bottleneck is database and network. To get an idea about the Data Integrator performance itself, we executed two
additional tests. For the first we used the same data flow but blocked all inserts before loading into the target (using
Map_operation transform), for the second test we additionally excluded the reading from the source by generating the
data in the engine with the same row size as the source data (using the Row_generation transform). The engine
throughput without data base and network overhead (so no reader and no loader) is 142 GB/Hour.
SAP AG 2009
Because the database is the bottleneck, CPU is not used 100% on the Data Integrator server. Tuning the Data
Integrator will not bring much performance improvements, throughput from the database side needs to be enhanced
first (e.g. by adding more disks for faster I/O). The graph below shows CPU use on the Data Integrator server during
the initial load indicating less then 50% of the available CPU power is used. This is typical behaviour for scenarios
where data streams through the engine.
Scenario 2: Load Customer Dimension (Delta Load)
For the delta load, you can use different techniques. Autocorrect load was already illustrated in the previous scenario,
which can result in a full push down. Another alternative is to use the database changed data capture (CDC)
capabilities. Data Integrator has native support for CDC on the database side: for Oracle, Data Integrator can subscribe
to CDC tables to get changed data only. In this case a throughput can be expected similar to an initial load because no
extra operations are needed to detect changes.
The third alternative is to use Data Integrators Table_comparison transform to compare source and target table in
order to detect changes. This is the most flexible solution, e.g. users can specify which columns to compare instead of
comparing all, and no special setup at the database side is needed. However, it will involve connections to both source
and target databases so performance will degrade due to extra bottlenecks.
SAP AG 2009
To tune performance for the Table_comparison transform, three comparison techniques can be used: row-by-row
select, cached comparison table, or sorted input.
Row-by-row select. For each row from the source, a query is executed on the target to verify if its a new, an
existing, or a changed row. This usually is the slowest method.
Cached comparison table. The compare columns from the target table are read into memory first and
comparison can be done in memory. Fast for smaller tables, for larger tables (>2 GB) paging will occur and
performance will degrade.
Sorted input. Requires incoming data be sorted on the compare keys from both source and target. This will be
fast from a Data Integrator perspective but the additional sorting on the database will put the bottleneck at that
In order to implement a slowly changing dimension type two, we keep history for each change made. The old record is
updated with a new valid to date and the current record indicator is reset. For the new version of the record, a new
data warehouse key is generated and valid from, valid to, and current record indicator are set.
The table below compares the three methods; sorted input is by far the fastest method but still significantly slower than
the initial load. Using cache does not bring performance benefit here because the cached table is 5 GB and access to
disk for paging slows down performance. The results (below) are with no changes detected between source and target,
so no inserts or updates at the target. Depending on the number of changes, more time will be needed by the database
to handle inserts and updates (bulk loader cannot be used here).
Scenario 3: J oin Tables to Load Fact Table
To load the fact table, we need to join the ORDER (6,000,000 records) and ORDER_LINE tables (60,000,000 records).
This join can be performed on the database or in the Data Integrator engine. In case the tables reside on different
databases (without dblink) or one of them is a text file, joining in the Data Integrator engine will be the only option.
Whereas the common trend is to try to push down as much as possible to the database, this scenario will proof that the
Data Integrator engine is capable of delivering similar performance as the database.
To join the two tables in Data Integrator we used the order table, which is the smallest, as inner table and cached it in
memory. The larger order_line table will not be cached in memory but will be read from the database and join each
record with the cached orders table. After joining the two tables also a lookup is performed to find the customers data
warehouse key. In order to do this the lookup columns from the customers table will be cached in memory.
The key factor to understand performance for this job is the memory usage : we will need about 1.2 GB to cache the
orders table and 1 GB to cache the customer lookup. So when doing everything in 1 process about 2.5 GB is needed
for caching and adding some additional memory for the data processing. For memory usage above 1.5 GB Data
Integrator will start paging to disk and memory access will de slower due to disk I/O. The larger the amount of memory
paged to disk, the slower the performance.
SAP AG 2009
To remedy this, it is possible to split a data flow in sub data flows where each sub data flow can access 1.5GB real
memory without starting to page. There are different techniques to create sub data flows:
Run as separate process. Memory intensive processes like lookup, sort, group by and others have this
option. The memory intensive process will run in a separate process and will benefit from more memory that
can be used.
Parall el join. A join can be split in smaller parts where each joined table is split up in smaller parts which are
individually joined and merged afterwards. By using the degree of parallelism (DOP) not only the join is plit up
in smaller parts, also other transforms in the dataflow will be parallelized.
Data_transfer transform. To break up a dataflow at any point in sub data flows. The data_tranfer transform
has the option to write to a table in order to use the database again for pushdown in downstream operations.
Each of these options can easily be set in an existing data flows without redesigning the data flow (e.g. runs as
separate process is just a checkbox). For our test we compare three execution plans:
1. Single process using pageable cache
2. Use lookup as separate process: one sub data flow for join and a second sub data flow for lookup. Each sub
data flow will use less then 1.5 GB so no paging will occur.
3. Split the join in two parallel joins, which will result in three sub data flows. Two of them performing each half of
the join, the third one merges the result and performs the lookup.
Below is a graph showing CPU use in the case lookup is executed as a separate process. This time CPU use is close
to the maximum (85%), total memory used is 2.4 GB.
SAP AG 2009
The next graph shows data for the same execution, this time for two individual sub data flows (using 1.1 GB and 1.3
GB of memory each).
The results for our three tests are very similar and result in a maximum throughput of 27 GB/Hour. When more memory
is used, pageable cache performance will be worse giving more benefit to splitting up the process in sub data flows that
fit into memory.
When the amount of memory needed by all sub data flows exceeds the total memory available on the server, paging
will occur at the operating system level (Windows). Also this will have a negative impact on performance. A solution
here might be to use grid computing and benefit from memory available on other servers in a Data Integrator server
SAP AG 2009
Scenario 4: Aggregate Data from File
We illustrate pure Data Integrator engine performance by aggregating a text file of 6 GB (60,000,000 records). to 2,000
groups. The parameters we tune in this scenario to improve performance are:
Parall el fil e reader. With the option parallel process threads (PPT) one can specify the number of parallel
threads that will be used to convert the text string read from the flat file to data organised in columns based on
the schema defined in the file format. This process is much more resource consuming then the reading from
disk itself.
Degree of parall elism (DOP). By setting a DOP greater than one for a dataflow, transforms in the dataflow will
be replicated and executed as parallel threads. Each replicated transform will treat only part of the records,
where data is distributed by a round-robin splitter to the different parallel transforms and merged again after the
last replicated transform.
The table below shows the results for various values for the two parameters. For our case the optimal combination is
found with four parallel threads for reading the file and no parallelism for the group by calculation (DOP =0). This gives
a throughput of 74 GB/Hr, doubling the throughput for the case without parallelism (33 GB/Hr).
In this case the most blocking factor is reading and interpreting the file. Increasing the degree of parallelism does not
make sense if the reader can not deliver the data fast enough. That is why, in this example, increasing the number of
parallel threads for reading improves performance dramatically, but degree of parallelism does not add extra value and
in fact decreases performance due to the extra overhead. The decrease in performance for high number of threads (>
4) is because the CPUs are already 100% used, adding more processes does not improve performance, overall
performance will go down because of the extra overhead (more processes to control, splitting and merging more data,
In this scenario there are no external bottlenecks like database or network, so we can further improve performance by
just adding more CPU power to the job server. The same job was executed on a 4 CPU dual core machine (eight
logical CPUs available). The maximum throughput in this case increases to 127 GB/Hr. Detailed results are in the table
SAP AG 2009
With more CPUs available, degree of parallelism starts to play a role when we have eight parallel reader processes.
Only in that case, the group by query gets data fast enough to make it efficient to handle the grouping in parallel.
As a general rule of thumb we can set the number of parallel threads equal to the number of (logical) CPUs available.
Parallel threads in this case can be file reader threads or parallel transforms (DOP).
With this rule, its easy to get significant performance gain (although probably not yet the optimal case) without
spending a lot of time in tuning. For our test cases this would give eight readers and DOP =eight for the 4 CPU dual
core server. This configuration comes close to the optimal configuration.
The results for both servers are summarized below in a chart comparing the original dataflow without parallelism, the
configuration with the best performance, and the configuration where we used the general rule of thumb.
SAP AG 2009
Finally we used this scenario to showcase the Data_Transfer transform. Using this transform, we can easily (within one
dataflow) copy the text file to a database table in order to pushdown the aggregation to the database (using INSERT
SELECT INTO). This might give good results in case the database server is much more powerful than the Data
Integrator server. In our test case, aggregation on database or inside Data Integrator takes about the same time. The
results for the pushdown to the database are worse in this case because it includes the additional time for loading the
file from the jobserver to the database first (using Data_Transfer).
SAP AG 2009
The idea of grid computing is to set up a group of Data Integrator job servers to act as a server grid. When a job is
launched on a server group, Data Integrators load balancing algorithm will send the job to the server in the group with
the lowest work load at that time. At the same time this is also a failover solution: when one of the servers in the grid
falls out, new jobs executed to the server group will get handled by the remaining servers. The load balancer is not a
separate process, but is built-in into the job launcher process to avoid a single point of failure.
When a job contains multiple data flows, the load balancer will, for each data flow, find the optimal job server,
spreading one job over any number of job servers. This will be very beneficial when data flows can be launched in
parallel (no dependencies defined). With the introduction of sub data flows in Data Integrator XI Release 2 Accelerated,
grid computing is extended with one extra level: even for sub data flows, you can use load balancing over a server
group. This can be useful for data flows requiring lots of memory with limited physical memory available on an
individual server. In that case, you can avoid paging by splitting the dataflow in sub data flows using the distributed
data flows options (e.g. run as a separate process) and executing each sub data flow on a separate machine. Each
sub data flow can use more memory, but there will be extra overhead because all data needs to be exchanged
between the different sub data flows. For fast (gigabit) networks, the overhead introduced by sending data from one
sub data flow over the network to another sub data flow can in be compensated by the fact that memory can be
accessed directly (without paging occurring).