Anda di halaman 1dari 1158

Table of Contents

Overview
Introduction to Azure Data Factory
Concepts
Pipelines and activities
Datasets
Scheduling and execution
Get Started
Tutorial: Create a pipeline to copy data
Copy Wizard
Azure portal
Visual Studio
PowerShell
Azure Resource Manager template
REST API
.NET API
Tutorial: Create a pipeline to transform data
Azure portal
Visual Studio
PowerShell
Azure Resource Manager template
REST API
Tutorial: Move data between on-premises and cloud
FAQ
How To
Move Data
Copy Activity Overview
Data Factory Copy Wizard
Performance and tuning guide
Fault tolerance
Security considerations
Connectors
Data Management Gateway
Transform Data
HDInsight Hive Activity
HDInsight Pig Activity
HDInsight MapReduce Activity
HDInsight Streaming Activity
HDInsight Spark Activity
Machine Learning Batch Execution Activity
Machine Learning Update Resource Activity
Stored Procedure Activity
Data Lake Analytics U-SQL Activity
.NET custom activity
Invoke R scripts
Reprocess models in Azure Analysis Services
Compute Linked Services
Develop
Azure Resource Manager template
Samples
Functions and system variables
Naming rules
.NET API change log
Monitor and Manage
Monitoring and Management app
Azure Data Factory pipelines
Using .NET SDK
Troubleshoot Data Factory issues
Troubleshoot issues with using Data Management Gateway
Reference
Code samples
PowerShell
.NET
REST
JSON
Resources
Azure Roadmap
Case Studies
Learning path
MSDN Forum
Pricing
Pricing calculator
Release notes for Data Management Gateway
Request a feature
Service updates
Stack Overflow
Videos
Customer Profiling
Process large-scale datasets using Data Factory and Batch
Product Recommendations
Introduction to Azure Data Factory
8/15/2017 10 min to read Edit Online

What is Azure Data Factory?


In the world of big data, how is existing data leveraged in business? Is it possible to enrich data generated in the
cloud by using reference data from on-premises data sources or other disparate data sources? For example, a
gaming company collects many logs produced by games in the cloud. It wants to analyze these logs to gain
insights in to customer preferences, demographics, usage behavior etc. to identify up-sell and cross-sell
opportunities, develop new compelling features to drive business growth, and provide a better experience to
customers.
To analyze these logs, the company needs to use the reference data such as customer information, game
information, marketing campaign information that is in an on-premises data store. Therefore, the company
wants to ingest log data from the cloud data store and reference data from the on-premises data store. Then,
process the data by using Hadoop in the cloud (Azure HDInsight) and publish the result data into a cloud data
warehouse such as Azure SQL Data Warehouse or an on-premises data store such as SQL Server. It wants this
workflow to run weekly once.
What is needed is a platform that allows the company to create a workflow that can ingest data from both on-
premises and cloud data stores, and transform or process data by using existing compute services such as
Hadoop, and publish the results to an on-premises or cloud data store for BI applications to consume.

Azure Data Factory is the platform for this kind of scenarios. It is a cloud-based data integration service that
allows you to create data-driven workflows in the cloud for orchestrating and automating data
movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven
workflows (called pipelines) that can ingest data from disparate data stores, process/transform the data by using
compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine
Learning, and publish output data to data stores such as Azure SQL Data Warehouse for business intelligence
(BI) applications to consume.
It's more of an Extract-and-Load (EL) and then Transform-and-Load (TL) platform rather than a traditional
Extract-Transform-and-Load (ETL) platform. The transformations that are performed are to transform/process
data by using compute services rather than to perform transformations like the ones for adding derived
columns, counting number of rows, sorting data, etc.
Currently, in Azure Data Factory, the data that is consumed and produced by workflows is time-sliced data
(hourly, daily, weekly, etc.). For example, a pipeline may read input data, process data, and produce output data
once a day. You can also run a workflow just one time.

How does it work?


The pipelines (data-driven workflows) in Azure Data Factory typically perform the following three steps:
Connect and collect
Enterprises have data of various types located in disparate sources. The first step in building an information
production system is to connect to all the required sources of data and processing, such as SaaS services, file
shares, FTP, web services, and move the data as-needed to a centralized location for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services to
integrate these data sources and processing. It is expensive and hard to integrate and maintain such systems,
and it often lacks the enterprise grade monitoring and alerting, and the controls that a fully managed service can
offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and
cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can
collect data in an Azure Data Lake Store and transform the data later by using an Azure Data Lake Analytics
compute service. Or, collect data in an Azure Blob Storage and transform data later by using an Azure HDInsight
Hadoop cluster.
Transform and enrich
Once data is present in a centralized data store in the cloud, you want the collected data to be processed or
transformed by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine
Learning. You want to reliably produce transformed data on a maintainable and controlled schedule to feed
production environments with trusted data.
Publish
Deliver transformed data from the cloud to on-premises sources like SQL Server, or keep it in your cloud storage
sources for consumption by business intelligence (BI) and analytics tools and other applications.

Key components
An Azure subscription may have one or more Azure Data Factory instances (or data factories). Azure Data
Factory is composed of four key components that work together to provide the platform on which you can
compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory may have one or more pipelines. A pipeline is a group of activities. Together, the activities in a
pipeline perform a task. For example, a pipeline could contain a group of activities that ingests data from an
Azure blob, and then run a Hive query on an HDInsight cluster to partition the data. The benefit of this is that the
pipeline allows you to manage the activities as a set instead of each one individually. For example, you can
deploy and schedule the pipeline, instead of the activities independently.
Activity
A pipeline may have one or more activities. Activities define the actions to perform on your data. For example,
you may use a Copy activity to copy data from one data store to another data store. Similarly, you may use a
Hive activity, which runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data
Factory supports two types of activities: data movement activities and data transformation activities.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the
following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy
data to and from that store.
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP


CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

For more information, see Data Movement Activities article.


Data transformation activities
Azure Data Factory supports the following transformation activities that can be added to pipelines either
individually or chained with another activity.

DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Machine Learning activities: Batch Execution and Update Azure VM


Resource

Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server

Data Lake Analytics U-SQL Azure Data Lake Analytics

DotNet HDInsight [Hadoop] or Azure Batch

NOTE
You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark programs from
Azure Data Factory for details. You can create a custom activity to run R scripts on your HDInsight cluster with R installed.
See Run R Script using Azure Data Factory.

For more information, see Data Transformation Activities article.


Custom .NET activities
If you need to move data to/from a data store that Copy Activity doesn't support, or transform data using your
own logic, create a custom .NET activity. For details on creating and using a custom activity, see Use custom
activities in an Azure Data Factory pipeline.
Datasets
An activity takes zero or more datasets as inputs and one or more datasets as outputs. Datasets represent data
structures within the data stores, which simply point or reference the data you want to use in your activities as
inputs or outputs. For example, an Azure Blob dataset specifies the blob container and folder in the Azure Blob
Storage from which the pipeline should read the data. Or, an Azure SQL Table dataset specifies the table to which
the output data is written by the activity.
Linked services
Linked services are much like connection strings, which define the connection information needed for Data
Factory to connect to external resources. Think of it this way - a linked service defines the connection to the data
source and a dataset represents the structure of the data. For example, an Azure Storage linked service specifies
connection string to connect to the Azure Storage account. And, an Azure Blob dataset specifies the blob
container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:
To represent a data store including, but not limited to, an on-premises SQL Server, Oracle database, file
share, or an Azure Blob Storage account. See the Data movement activities section for a list of supported data
stores.
To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive
activity runs on an HDInsight Hadoop cluster. See Data transformation activities section for a list of supported
compute environments.
Relationship between Data Factory entities

Figure 2. Relationships between Dataset, Activity, Pipeline, and Linked service

Supported regions
Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data
factory can access data stores and compute services in other Azure regions to move data between data stores or
process data using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate
movement of data between supported data stores and processing of data using compute services in other
regions or in an on-premises environment. It also allows you to monitor and manage workflows using both
programmatic and UI mechanisms.
Even though Data Factory is available in only West US, East US, and North Europe regions, the service
powering the data movement in Data Factory is available globally in several regions. If a data store is behind a
firewall, then a Data Management Gateway installed in your on-premises environment moves the data instead.
For an example, let us assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are running out of West Europe region. You can create and use an Azure Data Factory instance
in North Europe and use it to schedule jobs on your compute environments in West Europe. It takes a few
milliseconds for Data Factory to trigger the job on your compute environment but the time for running the job
on your computing environment does not change.

Get started with creating a pipeline


You can use one of these tools or APIs to create data pipelines in Azure Data Factory:
Azure portal
Visual Studio
PowerShell
.NET API
REST API
Azure Resource Manager template.
To learn how to build data factories with data pipelines, follow step-by-step instructions in the following
tutorials:

TUTORIAL DESCRIPTION

Move data between two cloud data stores In this tutorial, you create a data factory with a pipeline that
moves data from Blob storage to SQL database.

Transform data using Hadoop cluster In this tutorial, you build your first Azure data factory with a
data pipeline that processes data by running Hive script on
an Azure HDInsight (Hadoop) cluster.

Move data between an on-premises data store and a cloud In this tutorial, you build a data factory with a pipeline that
data store using Data Management Gateway moves data from an on-premises SQL Server database to
an Azure blob. As part of the walkthrough, you install and
configure the Data Management Gateway on your machine.
Pipelines and Activities in Azure Data Factory
8/15/2017 16 min to read Edit Online

This article helps you understand pipelines and activities in Azure Data Factory and use them to
construct end-to-end data-driven workflows for your data movement and data processing scenarios.

NOTE
This article assumes that you have gone through Introduction to Azure Data Factory. If you do not have
hands-on-experience with creating data factories, going through data transformation tutorial and/or data
movement tutorial would help you understand this article better.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. The activities in a pipeline define actions to perform on your data. For
example, you may use a copy activity to copy data from an on-premises SQL Server to an Azure Blob
Storage. Then, use a Hive activity that runs a Hive script on an Azure HDInsight cluster to
process/transform data from the blob storage to produce output data. Finally, use a second copy
activity to copy the output data to an Azure SQL Data Warehouse on top of which business intelligence
(BI) reporting solutions are built.
An activity can take zero or more input datasets and produce one or more output datasets. The
following diagram shows the relationship between pipeline, activity, and dataset in Data Factory:

A pipeline allows you to manage activities as a set instead of each one individually. For example, you
can deploy, schedule, suspend, and resume a pipeline, instead of dealing with activities in the pipeline
independently.
Data Factory supports two types of activities: data movement activities and data transformation
activities. Each activity can have zero or more input datasets and produce one or more output datasets.
An input dataset represents the input for an activity in the pipeline and an output dataset represents
the output for the activity. Datasets identify data within different data stores, such as tables, files,
folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For
example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity. For
more information about datasets, see Datasets in Azure Data Factory article.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory
supports the following data stores. Data from any source can be written to any sink. Click a data store
to learn how to copy data to and from that store.

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage


CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data


Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business
Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP


CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

NOTE
Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway
on an on-premises/Azure IaaS machine.

For more information, see Data Movement Activities article.


Data transformation activities
Azure Data Factory supports the following transformation activities that can be added to pipelines
either individually or chained with another activity.

DATA TRANSFORMATION ACTIVITY COMPUTE ENVIRONMENT

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Machine Learning activities: Batch Execution and Azure VM


Update Resource

Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server

Data Lake Analytics U-SQL Azure Data Lake Analytics

DotNet HDInsight [Hadoop] or Azure Batch

NOTE
You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See Invoke Spark
programs from Azure Data Factory for details. You can create a custom activity to run R scripts on your
HDInsight cluster with R installed. See Run R Script using Azure Data Factory.

For more information, see Data Transformation Activities article.


Custom .NET activities
If you need to move data to/from a data store that the Copy Activity doesn't support, or transform data
using your own logic, create a custom .NET activity. For details on creating and using a custom
activity, see Use custom activities in an Azure Data Factory pipeline.

Schedule pipelines
A pipeline is active only between its start time and end time. It is not executed before the start time or
after the end time. If the pipeline is paused, it does not get executed irrespective of its start and end
time. For a pipeline to run, it should not be paused. See Scheduling and Execution to understand how
scheduling and execution works in Azure Data Factory.

Pipeline JSON
Let us take a closer look on how a pipeline is defined in JSON format. The generic structure for a
pipeline looks as follows:

{
"name": "PipelineName",
"properties":
{
"description" : "pipeline description",
"activities":
[

],
"start": "<start date-time>",
"end": "<end date-time>",
"isPaused": true/false,
"pipelineMode": "scheduled/onetime",
"expirationTime": "15.00:00:00",
"datasets":
[
]
}
}

TAG DESCRIPTION REQUIRED

name Name of the pipeline. Specify a Yes


name that represents the action
that the pipeline performs.
Maximum number of
characters: 260
Must start with a letter
number, or an underscore
(_)
Following characters are not
allowed: ., +, ?, /,
<,>,*,%,&,:,\

description Specify the text describing what the Yes


pipeline is used for.
TAG DESCRIPTION REQUIRED

activities The activities section can have one Yes


or more activities defined within it.
See the next section for details
about the activities JSON element.

start Start date-time for the pipeline. No


Must be in ISO format. For
example: 2016-10-14T16:32:41Z . If you specify a value for the end
property, you must specify value
It is possible to specify a local time, for the start property.
for example an EST time. Here is an
example: The start and end times can both
2016-02-27T06:00:00-05:00 ", be empty to create a pipeline. You
which is 6 AM EST. must specify both values to set an
active period for the pipeline to
The start and end properties run. If you do not specify start and
together specify active period for end times when creating a pipeline,
the pipeline. Output slices are only you can set them using the Set-
produced with in this active period. AzureRmDataFactoryPipelineActive
Period cmdlet later.

end End date-time for the pipeline. If No


specified must be in ISO format.
For example: If you specify a value for the start
2016-10-14T17:32:41Z property, you must specify value
for the end property.
It is possible to specify a local time,
for example an EST time. Here is an See notes for the start property.
example:
2016-02-27T06:00:00-05:00 ,
which is 6 AM EST.

To run the pipeline indefinitely,


specify 9999-09-09 as the value
for the end property.

A pipeline is active only between its


start time and end time. It is not
executed before the start time or
after the end time. If the pipeline is
paused, it does not get executed
irrespective of its start and end
time. For a pipeline to run, it
should not be paused. See
Scheduling and Execution to
understand how scheduling and
execution works in Azure Data
Factory.

isPaused If set to true, the pipeline does not No


run. It's in the paused state.
Default value = false. You can use
this property to enable or disable a
pipeline.
TAG DESCRIPTION REQUIRED

pipelineMode The method for scheduling runs for No


the pipeline. Allowed values are:
scheduled (default), onetime.

Scheduled indicates that the


pipeline runs at a specified time
interval according to its active
period (start and end time).
Onetime indicates that the
pipeline runs only once. Onetime
pipelines once created cannot be
modified/updated currently. See
Onetime pipeline for details about
onetime setting.

expirationTime Duration of time after creation for No


which the one-time pipeline is valid
and should remain provisioned. If it
does not have any active, failed, or
pending runs, the pipeline is
automatically deleted once it
reaches the expiration time. The
default value:
"expirationTime":
"3.00:00:00"

datasets List of datasets to be used by No


activities defined in the pipeline.
This property can be used to define
datasets that are specific to this
pipeline and not defined within the
data factory. Datasets defined
within this pipeline can only be
used by this pipeline and cannot be
shared. See Scoped datasets for
details.

Activity JSON
The activities section can have one or more activities defined within it. Each activity has the following
top-level structure:
{
"name": "ActivityName",
"description": "description",
"type": "<ActivityType>",
"inputs": "[]",
"outputs": "[]",
"linkedServiceName": "MyLinkedService",
"typeProperties":
{

},
"policy":
{
},
"scheduler":
{
}
}

Following table describes properties in the activity JSON definition:

TAG DESCRIPTION REQUIRED

name Name of the activity. Specify a Yes


name that represents the action
that the activity performs.
Maximum number of
characters: 260
Must start with a letter
number, or an underscore
(_)
Following characters are not
allowed: ., +, ?, /,
<,>,*,%,&,:,\

description Text describing what the activity or Yes


is used for

type Type of the activity. See the Data Yes


Movement Activities and Data
Transformation Activities sections
for different types of activities.

inputs Input tables used by the activity Yes

// one input table


"inputs": [ { "name":
"inputtable1" } ],

// two input tables


"inputs": [ { "name":
"inputtable1" }, { "name":
"inputtable2" } ],
TAG DESCRIPTION REQUIRED

outputs Output tables used by the activity. Yes

// one output table


"outputs": [ { "name":
"outputtable1" } ],

//two output tables


"outputs": [ { "name":
"outputtable1" }, { "name":
"outputtable2" } ],

linkedServiceName Name of the linked service used by Yes for HDInsight Activity and
the activity. Azure Machine Learning Batch
Scoring Activity
An activity may require that you
specify the linked service that links No for all others
to the required compute
environment.

typeProperties Properties in the typeProperties No


section depend on type of the
activity. To see type properties for
an activity, click links to the activity
in the previous section.

policy Policies that affect the run-time No


behavior of the activity. If it is not
specified, default policies are used.

scheduler scheduler property is used to No


define desired scheduling for the
activity. Its subproperties are the
same as the ones in the availability
property in a dataset.

Policies
Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed.
The following table provides the details.

PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION

concurrency Integer 1 Number of concurrent


executions of the activity.
Max value: 10
It determines the
number of parallel
activity executions that
can happen on different
slices. For example, if an
activity needs to go
through a large set of
available data, having a
larger concurrency value
speeds up the data
processing.
PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION

executionPriorityOrder NewestFirst OldestFirst Determines the ordering


of data slices that are
OldestFirst being processed.

For example, if you have


2 slices (one happening
at 4pm, and another one
at 5pm), and both are
pending execution. If you
set the
executionPriorityOrder to
be NewestFirst, the slice
at 5 PM is processed
first. Similarly if you set
the
executionPriorityORder
to be OldestFIrst, then
the slice at 4 PM is
processed.

retry Integer 0 Number of retries before


the data processing for
Max value can be 10 the slice is marked as
Failure. Activity execution
for a data slice is retried
up to the specified retry
count. The retry is done
as soon as possible after
the failure.

timeout TimeSpan 00:00:00 Timeout for the activity.


Example: 00:10:00
(implies timeout 10 mins)

If a value is not specified


or is 0, the timeout is
infinite.

If the data processing


time on a slice exceeds
the timeout value, it is
canceled, and the system
attempts to retry the
processing. The number
of retries depends on the
retry property. When
timeout occurs, the
status is set to
TimedOut.
PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION

delay TimeSpan 00:00:00 Specify the delay before


data processing of the
slice starts.

The execution of activity


for a data slice is started
after the Delay is past
the expected execution
time.

Example: 00:10:00
(implies delay of 10 mins)

longRetry Integer 1 The number of long retry


attempts before the slice
Max value: 10 execution is failed.

longRetry attempts are


spaced by
longRetryInterval. So if
you need to specify a
time between retry
attempts, use longRetry.
If both Retry and
longRetry are specified,
each longRetry attempt
includes Retry attempts
and the max number of
attempts is Retry *
longRetry.

For example, if we have


the following settings in
the activity policy:
Retry: 3
longRetry: 2
longRetryInterval:
01:00:00

Assume there is only one


slice to execute (status is
Waiting) and the activity
execution fails every time.
Initially there would be 3
consecutive execution
attempts. After each
attempt, the slice status
would be Retry. After first
3 attempts are over, the
slice status would be
LongRetry.

After an hour (that is,


longRetryIntevals value),
there would be another
set of 3 consecutive
execution attempts. After
that, the slice status
would be Failed and no
more retries would be
attempted. Hence overall
6 attempts were made.
PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION
If any execution
succeeds, the slice status
would be Ready and no
more retries are
attempted.

longRetry may be used in


situations where
dependent data arrives
at non-deterministic
times or the overall
environment is flaky
under which data
processing occurs. In
such cases, doing retries
one after another may
not help and doing so
after an interval of time
results in the desired
output.

Word of caution: do not


set high values for
longRetry or
longRetryInterval.
Typically, higher values
imply other systemic
issues.

longRetryInterval TimeSpan 00:00:00 The delay between long


retry attempts

Sample copy pipeline


In the following sample pipeline, there is one activity of type Copy in the activities section. In this
sample, the copy activity copies data from an Azure Blob storage to an Azure SQL database.
{
"name": "CopyPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2016-07-12T00:00:00Z",
"end": "2016-07-13T00:00:00Z"
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to Copy.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset. See
Datasets article for defining datasets in JSON.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified
as the sink type. In the Data movement activities section, click the data store that you want to use as
a source or a sink to learn more about moving data to/from that data store.
For a complete walkthrough of creating this pipeline, see Tutorial: Copy data from Blob Storage to SQL
Database.

Sample transformation pipeline


In the following sample pipeline, there is one activity of type HDInsightHive in the activities section.
In this sample, the HDInsight Hive activity transforms data from an Azure Blob storage by running a
Hive script file on an Azure HDInsight Hadoop cluster.
{
"name": "TransformPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
],
"start": "2016-04-01T00:00:00Z",
"end": "2016-04-02T00:00:00Z",
"isPaused": false
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to HDInsightHive.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container
adfgetstarted.
The defines section is used to specify the runtime settings that are passed to the hive script as
Hive configuration values (e.g ${hiveconf:inputtable} , ${hiveconf:partitionedtable} ).

The typeProperties section is different for each transformation activity. To learn about type properties
supported for a transformation activity, click the transformation activity in the Data transformation
activities table.
For a complete walkthrough of creating this pipeline, see Tutorial: Build your first pipeline to process
data using Hadoop cluster.

Multiple activities in a pipeline


The previous two sample pipelines have only one activity in them. You can have more than one activity
in a pipeline.
If you have multiple activities in a pipeline and output of an activity is not an input of another activity,
the activities may run in parallel if input data slices for the activities are ready.
You can chain two activities by having the output dataset of one activity as the input dataset of the
other activity. The second activity executes only when the first one completes successfully.

In this sample, the pipeline has two activities: Activity1 and Activity2. The Activity1 takes Dataset1 as an
input and produces an output Dataset2. The Activity takes Dataset2 as an input and produces an output
Dataset3. Since the output of Activity1 (Dataset2) is the input of Activity2, the Activity2 runs only after
the Activity completes successfully and produces the Dataset2 slice. If the Activity1 fails for some
reason and does not produce the Dataset2 slice, the Activity 2 does not run for that slice (for example:
9 AM to 10 AM).
You can also chain activities that are in different pipelines.

In this sample, Pipeline1 has only one activity that takes Dataset1 as an input and produces Dataset2 as
an output. The Pipeline2 also has only one activity that takes Dataset2 as an input and Dataset3 as an
output.
For more information, see scheduling and execution.

Create and monitor pipelines


You can create pipelines by using one of these tools or SDKs.
Copy Wizard.
Azure portal
Visual Studio
Azure PowerShell
Azure Resource Manager template
REST API
.NET API
See the following tutorials for step-by-step instructions for creating pipelines by using one of these
tools or SDKs.
Build a pipeline with a data transformation activity
Build a pipeline with a data movement activity
Once a pipeline is created/deployed, you can manage and monitor your pipelines by using the Azure
portal blades or Monitor and Manage App. See the following topics for step-by-step instructions.
Monitor and manage pipelines by using Azure portal blades.
Monitor and manage pipelines by using Monitor and Manage App
Onetime pipeline
You can create and schedule a pipeline to run periodically (for example: hourly or daily) within the start
and end times you specify in the pipeline definition. See Scheduling activities for details. You can also
create a pipeline that runs only once. To do so, you set the pipelineMode property in the pipeline
definition to onetime as shown in the following JSON sample. The default value for this property is
scheduled.

{
"name": "CopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
]
"name": "CopyActivity-0"
}
]
"pipelineMode": "OneTime"
}
}

Note the following:


Start and end times for the pipeline are not specified.
Availability of input and output datasets is specified (frequency and interval), even though Data
Factory does not use the values.
Diagram view does not show one-time pipelines. This behavior is by design.
One-time pipelines cannot be updated. You can clone a one-time pipeline, rename it, update
properties, and deploy it to create another one.

Next Steps
For more information about datasets, see Create datasets article.
For more information about how pipelines are scheduled and executed, see Scheduling and
execution in Azure Data Factory article.
Datasets in Azure Data Factory
8/8/2017 15 min to read Edit Online

This article describes what datasets are, how they are defined in JSON format, and how they are
used in Azure Data Factory pipelines. It provides details about each section (for example, structure,
availability, and policy) in the dataset JSON definition. The article also provides examples for using
the offset, anchorDateTime, and style properties in a dataset JSON definition.

NOTE
If you are new to Data Factory, see Introduction to Azure Data Factory for an overview. If you do not have
hands-on experience with creating data factories, you can gain a better understanding by reading the data
transformation tutorial and the data movement tutorial.

Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that
together perform a task. The activities in a pipeline define actions to perform on your data. For
example, you might use a copy activity to copy data from an on-premises SQL Server to Azure Blob
storage. Then, you might use a Hive activity that runs a Hive script on an Azure HDInsight cluster to
process data from Blob storage to produce output data. Finally, you might use a second copy
activity to copy the output data to Azure SQL Data Warehouse, on top of which business
intelligence (BI) reporting solutions are built. For more information about pipelines and activities,
see Pipelines and activities in Azure Data Factory.
An activity can take zero or more input datasets, and produce one or more output datasets. An
input dataset represents the input for an activity in the pipeline, and an output dataset represents
the output for the activity. Datasets identify data within different data stores, such as tables, files,
folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder
in Blob storage from which the pipeline should read the data.
Before you create a dataset, create a linked service to link your data store to the data factory.
Linked services are much like connection strings, which define the connection information needed
for Data Factory to connect to external resources. Datasets identify data within the linked data
stores, such as SQL tables, files, folders, and documents. For example, an Azure Storage linked
service links a storage account to the data factory. An Azure Blob dataset represents the blob
container and the folder that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL database, you create two linked
services: Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset
(which refers to the Azure Storage linked service) and Azure SQL Table dataset (which refers to the
Azure SQL Database linked service). The Azure Storage and Azure SQL Database linked services
contain connection strings that Data Factory uses at runtime to connect to your Azure Storage and
Azure SQL Database, respectively. The Azure Blob dataset specifies the blob container and blob
folder that contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies the
SQL table in your SQL database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service
in Data Factory:
Dataset JSON
A dataset in Data Factory is defined in JSON format as follows:

{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: AzureBlob, AzureSql etc...>",
"external": <boolean flag to indicate external data. only for input datasets>,
"linkedServiceName": "<Name of the linked service that refers to a data store.>",
"structure": [
{
"name": "<Name of the column>",
"type": "<Name of the type>"
}
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
},
"availability": {
"frequency": "<Specifies the time unit for data slice production. Supported
frequency: Minute, Hour, Day, Week, Month>",
"interval": "<Specifies the interval within the defined frequency. For example,
frequency set to 'Hour' and interval set to 1 indicates that new data slices should be produced
hourly>"
},
"policy":
{
}
}
}

The following table describes properties in the above JSON:

PROPERTY DESCRIPTION REQUIRED DEFAULT

name Name of the dataset. Yes NA


See Azure Data Factory
- Naming rules for
naming rules.
PROPERTY DESCRIPTION REQUIRED DEFAULT

type Type of the dataset. Yes NA


Specify one of the types
supported by Data
Factory (for example:
AzureBlob,
AzureSqlTable).

For details, see Dataset


type.

structure Schema of the dataset. No NA

For details, see Dataset


structure.

typeProperties The type properties are Yes NA


different for each type
(for example: Azure
Blob, Azure SQL table).
For details on the
supported types and
their properties, see
Dataset type.

external Boolean flag to specify No false


whether a dataset is
explicitly produced by a
data factory pipeline or
not. If the input dataset
for an activity is not
produced by the current
pipeline, set this flag to
true. Set this flag to true
for the input dataset of
the first activity in the
pipeline.

availability Defines the processing Yes NA


window (for example,
hourly or daily) or the
slicing model for the
dataset production.
Each unit of data
consumed and
produced by an activity
run is called a data slice.
If the availability of an
output dataset is set to
daily (frequency - Day,
interval - 1), a slice is
produced daily.

For details, see Dataset


availability.

For details on the


dataset slicing model,
see the Scheduling and
execution article.
PROPERTY DESCRIPTION REQUIRED DEFAULT

policy Defines the criteria or No NA


the condition that the
dataset slices must
fulfill.

For details, see the


Dataset policy section.

Dataset example
In the following example, the dataset represents a table named MyTable in a SQL database.

{
"name": "DatasetSample",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties":
{
"tableName": "MyTable"
},
"availability":
{
"frequency": "Day",
"interval": 1
}
}
}

Note the following points:


type is set to AzureSqlTable.
tableName type property (specific to AzureSqlTable type) is set to MyTable.
linkedServiceName refers to a linked service of type AzureSqlDatabase, which is defined in
the next JSON snippet.
availability frequency is set to Day, and interval is set to 1. This means that the dataset slice
is produced daily.
AzureSqlLinkedService is defined as follows:

{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial
Catalog=<databasename>;User ID=<username>@<servername>;Password=<password>;Integrated
Security=False;Encrypt=True;Connect Timeout=30"
}
}
}

In the preceding JSON snippet:


type is set to AzureSqlDatabase.
connectionString type property specifies information to connect to a SQL database.
As you can see, the linked service defines how to connect to a SQL database. The dataset defines
what table is used as an input and output for the activity in a pipeline.

IMPORTANT
Unless a dataset is being produced by the pipeline, it should be marked as external. This setting generally
applies to inputs of first activity in a pipeline.

Dataset type
The type of the dataset depends on the data store you use. See the following table for a list of data
stores supported by Data Factory. Click a data store to learn how to create a linked service and a
dataset for that data store.

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data


Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business
Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

NOTE
Data stores with * can be on-premises or on Azure infrastructure as a service (IaaS). These data stores
require you to install Data Management Gateway.

In the example in the previous section, the type of the dataset is set to AzureSqlTable. Similarly,
for an Azure Blob dataset, the type of the dataset is set to AzureBlob, as shown in the following
JSON:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}

Dataset structure
The structure section is optional. It defines the schema of the dataset by containing a collection of
names and data types of columns. You use the structure section to provide type information that is
used to convert types and map columns from the source to the destination. In the following
example, the dataset has three columns: slicetimestamp , projectname , and pageviews . They are of
type String, String, and Decimal, respectively.

structure:
[
{ "name": "slicetimestamp", "type": "String"},
{ "name": "projectname", "type": "String"},
{ "name": "pageviews", "type": "Decimal"}
]

Each column in the structure contains the following properties:

PROPERTY DESCRIPTION REQUIRED

name Name of the column. Yes

type Data type of the column. No

culture .NET-based culture to be used No


when the type is a .NET type:
Datetime or Datetimeoffset .
The default is en-us .

format Format string to be used when No


the type is a .NET type:
Datetime or Datetimeoffset .

The following guidelines help you determine when to include structure information, and what to
include in the structure section.
For structured data sources, specify the structure section only if you want map source
columns to sink columns, and their names are not the same. This kind of structured data
source stores data schema and type information along with the data itself. Examples of
structured data sources include SQL Server, Oracle, and Azure table.
As type information is already available for structured data sources, you should not include
type information when you do include the structure section.
For schema on read data sources (specifically Blob storage), you can choose to store
data without storing any schema or type information with the data. For these types of data
sources, include structure when you want to map source columns to sink columns. Also
include structure when the dataset is an input for a copy activity, and data types of source
dataset should be converted to native types for the sink.
Data Factory supports the following values for providing type information in structure:
Int16, Int32, Int64, Single, Double, Decimal, Byte[], Boolean, String, Guid, Datetime,
Datetimeoffset, and Timespan. These values are Common Language Specification (CLS)-
compliant, .NET-based type values.
Data Factory automatically performs type conversions when moving data from a source data store
to a sink data store.

Dataset availability
The availability section in a dataset defines the processing window (for example, hourly, daily, or
weekly) for the dataset. For more information about activity windows, see Scheduling and
execution.
The following availability section specifies that the output dataset is either produced hourly, or the
input dataset is available hourly:

"availability":
{
"frequency": "Hour",
"interval": 1
}

If the pipeline has the following start and end times:

"start": "2016-08-25T00:00:00Z",
"end": "2016-08-25T05:00:00Z",

The output dataset is produced hourly within the pipeline start and end times. Therefore, there are
five dataset slices produced by this pipeline, one for each activity window (12 AM - 1 AM, 1 AM - 2
AM, 2 AM - 3 AM, 3 AM - 4 AM, 4 AM - 5 AM).
The following table describes properties you can use in the availability section:

PROPERTY DESCRIPTION REQUIRED DEFAULT

frequency Specifies the time unit Yes NA


for dataset slice
production.

Supported frequency:
Minute, Hour, Day,
Week, Month
PROPERTY DESCRIPTION REQUIRED DEFAULT

interval Specifies a multiplier for Yes NA


frequency.

"Frequency x interval"
determines how often
the slice is produced.
For example, if you need
the dataset to be sliced
on an hourly basis, you
set frequency to Hour,
and interval to 1.

Note that if you specify


frequency as Minute,
you should set the
interval to no less than
15.

style Specifies whether the No EndOfInterval


slice should be
produced at the start or
end of the interval.
StartOfInterval
EndOfInterval
If frequency is set to
Month, and style is set
to EndOfInterval, the
slice is produced on the
last day of month. If
style is set to
StartOfInterval, the
slice is produced on the
first day of month.

If frequency is set to
Day, and style is set to
EndOfInterval, the slice
is produced in the last
hour of the day.

If frequency is set to
Hour, and style is set
to EndOfInterval, the
slice is produced at the
end of the hour. For
example, for a slice for
the 1 PM - 2 PM
period, the slice is
produced at 2 PM.
PROPERTY DESCRIPTION REQUIRED DEFAULT

anchorDateTime Defines the absolute No 01/01/0001


position in time used by
the scheduler to
compute dataset slice
boundaries.

Note that if this


propoerty has date
parts that are more
granular than the
specified frequency, the
more granular parts are
ignored. For example, if
the interval is hourly
(frequency: hour and
interval: 1), and the
anchorDateTime
contains minutes and
seconds, then the
minutes and seconds
parts of
anchorDateTime are
ignored.

offset Timespan by which the No NA


start and end of all
dataset slices are
shifted.

Note that if both


anchorDateTime and
offset are specified, the
result is the combined
shift.

offset example
By default, daily ( "frequency": "Day", "interval": 1 ) slices start at 12 AM (midnight) Coordinated
Universal Time (UTC). If you want the start time to be 6 AM UTC time instead, set the offset as
shown in the following snippet:

"availability":
{
"frequency": "Day",
"interval": 1,
"offset": "06:00:00"
}

anchorDateTime example
In the following example, the dataset is produced once every 23 hours. The first slice starts at the
time specified by anchorDateTime, which is set to 2017-04-19T08:00:00 (UTC).
"availability":
{
"frequency": "Hour",
"interval": 23,
"anchorDateTime":"2017-04-19T08:00:00"
}

offset/style example
The following dataset is monthly, and is produced on the 3rd of every month at 8:00 AM (
3.08:00:00 ):

"availability": {
"frequency": "Month",
"interval": 1,
"offset": "3.08:00:00",
"style": "StartOfInterval"
}

Dataset policy
The policy section in the dataset definition defines the criteria or the condition that the dataset
slices must fulfill.
Validation policies
POLICY NAME DESCRIPTION APPLIED TO REQUIRED DEFAULT

minimumSizeMB Validates that the Azure Blob No NA


data in Azure storage
Blob storage
meets the
minimum size
requirements (in
megabytes).

minimumRows Validates that the Azure SQL No NA


data in an Azure database
SQL database or Azure table
an Azure table
contains the
minimum number
of rows.

Examples
minimumSizeMB:

"policy":

{
"validation":
{
"minimumSizeMB": 10.0
}
}

minimumRows:
"policy":
{
"validation":
{
"minimumRows": 100
}
}

External datasets
External datasets are the ones that are not produced by a running pipeline in the data factory. If the
dataset is marked as external, the ExternalData policy may be defined to influence the behavior
of the dataset slice availability.
Unless a dataset is being produced by Data Factory, it should be marked as external. This setting
generally applies to the inputs of first activity in a pipeline, unless activity or pipeline chaining is
being used.

NAME DESCRIPTION REQUIRED DEFAULT VALUE

dataDelay The time to delay the No 0


check on the availability
of the external data for
the given slice. For
example, you can delay
an hourly check by
using this setting.

The setting only applies


to the present time. For
example, if it is 1:00 PM
right now and this value
is 10 minutes, the
validation starts at 1:10
PM.

Note that this setting


does not affect slices in
the past. Slices with
Slice End Time +
dataDelay < Now are
processed without any
delay.

Times greater than


23:59 hours should be
specified by using the
day.hours:minutes:seconds
format. For example, to
specify 24 hours, don't
use 24:00:00. Instead,
use 1.00:00:00. If you
use 24:00:00, it is
treated as 24 days
(24.00:00:00). For 1 day
and 4 hours, specify
1:04:00:00.
NAME DESCRIPTION REQUIRED DEFAULT VALUE

retryInterval The wait time between a No 00:01:00 (1 minute)


failure and the next
attempt. This setting
applies to present time.
If the previous try failed,
the next try is after the
retryInterval period.

If it is 1:00 PM right
now, we begin the first
try. If the duration to
complete the first
validation check is 1
minute and the
operation failed, the
next retry is at 1:00 +
1min (duration) + 1min
(retry interval) = 1:02
PM.

For slices in the past,


there is no delay. The
retry happens
immediately.

retryTimeout The timeout for each No 00:10:00 (10 minutes)


retry attempt.

If this property is set to


10 minutes, the
validation should be
completed within 10
minutes. If it takes
longer than 10 minutes
to perform the
validation, the retry
times out.

If all attempts for the


validation time out, the
slice is marked as
TimedOut.

maximumRetry The number of times to No 3


check for the availability
of the external data. The
maximum allowed value
is 10.

Create datasets
You can create datasets by using one of these tools or SDKs:
Copy Wizard
Azure portal
Visual Studio
PowerShell
Azure Resource Manager template
REST API
.NET API
See the following tutorials for step-by-step instructions for creating pipelines and datasets by
using one of these tools or SDKs:
Build a pipeline with a data transformation activity
Build a pipeline with a data movement activity
After a pipeline is created and deployed, you can manage and monitor your pipelines by using the
Azure portal blades, or the Monitoring and Management app. See the following topics for step-by-
step instructions:
Monitor and manage pipelines by using Azure portal blades
Monitor and manage pipelines by using the Monitoring and Management app

Scoped datasets
You can create datasets that are scoped to a pipeline by using the datasets property. These
datasets can only be used by activities within this pipeline, not by activities in other pipelines. The
following example defines a pipeline with two datasets (InputDataset-rdc and OutputDataset-rdc)
to be used within the pipeline.

IMPORTANT
Scoped datasets are supported only with one-time pipelines (where pipelineMode is set to OneTime).
See Onetime pipeline for details.

{
"name": "CopyPipeline-rdc",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset-rdc"
}
],
"outputs": [
{
"name": "OutputDataset-rdc"
}
],
"scheduler": {
"frequency": "Day",
"interval": 1,
"style": "StartOfInterval"
},
"name": "CopyActivity-0"
"name": "CopyActivity-0"
}
],
"start": "2016-02-28T00:00:00Z",
"end": "2016-02-28T00:00:00Z",
"isPaused": false,
"pipelineMode": "OneTime",
"expirationTime": "15.00:00:00",
"datasets": [
{
"name": "InputDataset-rdc",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "InputLinkedService-rdc",
"typeProperties": {
"fileName": "emp.txt",
"folderPath": "adftutorial/input",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
},
{
"name": "OutputDataset-rdc",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "OutputLinkedService-rdc",
"typeProperties": {
"fileName": "emp.txt",
"folderPath": "adftutorial/output",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
}
]
}
}

Next steps
For more information about pipelines, see Create pipelines.
For more information about how pipelines are scheduled and executed, see Scheduling and
execution in Azure Data Factory.
Data Factory scheduling and execution
7/10/2017 22 min to read Edit Online

This article explains the scheduling and execution aspects of the Azure Data Factory application model. This
article assumes that you understand basics of Data Factory application model concepts, including activity,
pipelines, linked services, and datasets. For basic concepts of Azure Data Factory, see the following articles:
Introduction to Data Factory
Pipelines
Datasets

Start and end times of pipeline


A pipeline is active only between its start time and end time. It is not executed before the start time or after the
end time. If the pipeline is paused, it is not executed irrespective of its start and end time. For a pipeline to run,
it should not be paused. You find these settings (start, end, paused) in the pipeline definition:

"start": "2017-04-01T08:00:00Z",
"end": "2017-04-01T11:00:00Z"
"isPaused": false

For more information these properties, see create pipelines article.

Specify schedule for an activity


It is not the pipeline that is executed. It is the activities in the pipeline that are executed in the overall context of
the pipeline. You can specify a recurring schedule for an activity by using the scheduler section of activity
JSON. For example, you can schedule an activity to run hourly as follows:

"scheduler": {
"frequency": "Hour",
"interval": 1
},

As shown in the following diagram, specifying a schedule for an activity creates a series of tumbling windows
with in the pipeline start and end times. Tumbling windows are a series of fixed-size non-overlapping,
contiguous time intervals. These logical tumbling windows for an activity are called activity windows.
The scheduler property for an activity is optional. If you do specify this property, it must match the cadence
you specify in the definition of output dataset for the activity. Currently, output dataset is what drives the
schedule. Therefore, you must create an output dataset even if the activity does not produce any output.

Specify schedule for a dataset


An activity in a Data Factory pipeline can take zero or more input datasets and produce one or more output
datasets. For an activity, you can specify the cadence at which the input data is available or the output data is
produced by using the availability section in the dataset definitions.
Frequency in the availability section specifies the time unit. The allowed values for frequency are: Minute,
Hour, Day, Week, and Month. The interval property in the availability section specifies a multiplier for
frequency. For example: if the frequency is set to Day and interval is set to 1 for an output dataset, the output
data is produced daily. If you specify the frequency as minute, we recommend that you set the interval to no
less than 15.
In the following example, the input data is available hourly and the output data is produced hourly (
"frequency": "Hour", "interval": 1 ).

Input dataset:

{
"name": "AzureSqlInput",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
Output dataset

{
"name": "AzureBlobOutput",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mypath/{Year}/{Month}/{Day}/{Hour}",
"format": {
"type": "TextFormat"
},
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" }
},
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" }}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

Currently, output dataset drives the schedule. In other words, the schedule specified for the output dataset
is used to run an activity at runtime. Therefore, you must create an output dataset even if the activity does not
produce any output. If the activity doesn't take any input, you can skip creating the input dataset.
In the following pipeline definition, the scheduler property is used to specify schedule for the activity. This
property is optional. Currently, the schedule for the activity must match the schedule specified for the output
dataset.
{
"name": "SamplePipeline",
"properties": {
"description": "copy activity",
"activities": [
{
"type": "Copy",
"name": "AzureSQLtoBlob",
"description": "copy activity",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 100000,
"writeBatchTimeout": "00:05:00"
}
},
"inputs": [
{
"name": "AzureSQLInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 1
}
}
],
"start": "2017-04-01T08:00:00Z",
"end": "2017-04-01T11:00:00Z"
}
}

In this example, the activity runs hourly between the start and end times of the pipeline. The output data is
produced hourly for three-hour windows (8 AM - 9 AM, 9 AM - 10 AM, and 10 AM - 11 AM).
Each unit of data consumed or produced by an activity run is called a data slice. The following diagram shows
an example of an activity with one input dataset and one output dataset:
The diagram shows the hourly data slices for the input and output dataset. The diagram shows three input
slices that are ready for processing. The 10-11 AM activity is in progress, producing the 10-11 AM output slice.
You can access the time interval associated with the current slice in the dataset JSON by using variables:
SliceStart and SliceEnd. Similarly, you can access the time interval associated with an activity window by using
the WindowStart and WindowEnd. The schedule of an activity must match the schedule of the output dataset
for the activity. Therefore, the SliceStart and SliceEnd values are the same as WindowStart and WindowEnd
values respectively. For more information on these variables, see Data Factory functions and system variables
articles.
You can use these variables for different purposes in your activity JSON. For example, you can use them to
select data from input and output datasets representing time series data (for example: 8 AM to 9 AM). This
example also uses WindowStart and WindowEnd to select relevant data for an activity run and copy it to a
blob with the appropriate folderPath. The folderPath is parameterized to have a separate folder for every
hour.
In the preceding example, the schedule specified for input and output datasets is the same (hourly). If the input
dataset for the activity is available at a different frequency, say every 15 minutes, the activity that produces this
output dataset still runs once an hour as the output dataset is what drives the activity schedule. For more
information, see Model datasets with different frequencies.

Dataset availability and policies


You have seen the usage of frequency and interval properties in the availability section of dataset definition.
There are a few other properties that affect the scheduling and execution of an activity.
Dataset availability
The following table describes properties you can use in the availability section:

PROPERTY DESCRIPTION REQUIRED DEFAULT

frequency Specifies the time unit for Yes NA


dataset slice production.

Supported frequency:
Minute, Hour, Day, Week,
Month
PROPERTY DESCRIPTION REQUIRED DEFAULT

interval Specifies a multiplier for Yes NA


frequency

Frequency x interval
determines how often the
slice is produced.

If you need the dataset to


be sliced on an hourly
basis, you set Frequency
to Hour, and interval to 1.

Note: If you specify


Frequency as Minute, we
recommend that you set
the interval to no less than
15

style Specifies whether the slice No EndOfInterval


should be produced at the
start/end of the interval.
StartOfInterval
EndOfInterval

If Frequency is set to
Month and style is set to
EndOfInterval, the slice is
produced on the last day of
month. If the style is set to
StartOfInterval, the slice is
produced on the first day
of month.

If Frequency is set to Day


and style is set to
EndOfInterval, the slice is
produced in the last hour
of the day.

If Frequency is set to Hour


and style is set to
EndOfInterval, the slice is
produced at the end of the
hour. For example, for a
slice for 1 PM 2 PM
period, the slice is produced
at 2 PM.
PROPERTY DESCRIPTION REQUIRED DEFAULT

anchorDateTime Defines the absolute No 01/01/0001


position in time used by
scheduler to compute
dataset slice boundaries.

Note: If the
AnchorDateTime has date
parts that are more
granular than the
frequency then the more
granular parts are ignored.

For example, if the interval


is hourly (frequency: hour
and interval: 1) and the
AnchorDateTime contains
minutes and seconds,
then the minutes and
seconds parts of the
AnchorDateTime are
ignored.

offset Timespan by which the No NA


start and end of all dataset
slices are shifted.

Note: If both
anchorDateTime and offset
are specified, the result is
the combined shift.

offset example
By default, daily ( "frequency": "Day", "interval": 1 ) slices start at 12 AM UTC time (midnight). If you want the
start time to be 6 AM UTC time instead, set the offset as shown in the following snippet:

"availability":
{
"frequency": "Day",
"interval": 1,
"offset": "06:00:00"
}

anchorDateTime example
In the following example, the dataset is produced once every 23 hours. The first slice starts at the time specified
by the anchorDateTime, which is set to 2017-04-19T08:00:00 (UTC time).

"availability":
{
"frequency": "Hour",
"interval": 23,
"anchorDateTime":"2017-04-19T08:00:00"
}

offset/style Example
The following dataset is a monthly dataset and is produced on 3rd of every month at 8:00 AM ( 3.08:00:00 ):
"availability": {
"frequency": "Month",
"interval": 1,
"offset": "3.08:00:00",
"style": "StartOfInterval"
}

Dataset policy
A dataset can have a validation policy defined that specifies how the data generated by a slice execution can be
validated before it is ready for consumption. In such cases, after the slice has finished execution, the output slice
status is changed to Waiting with a substatus of Validation. After the slices are validated, the slice status
changes to Ready. If a data slice has been produced but did not pass the validation, activity runs for
downstream slices that depend on this slice are not processed. Monitor and manage pipelines covers the
various states of data slices in Data Factory.
The policy section in dataset definition defines the criteria or the condition that the dataset slices must fulfill.
The following table describes properties you can use in the policy section:

POLICY NAME DESCRIPTION APPLIED TO REQUIRED DEFAULT

minimumSizeMB Validates that the Azure Blob No NA


data in an Azure
blob meets the
minimum size
requirements (in
megabytes).

minimumRows Validates that the Azure SQL No NA


data in an Azure Database
SQL database or an Azure Table
Azure table
contains the
minimum number of
rows.

Examples
minimumSizeMB:

"policy":

{
"validation":
{
"minimumSizeMB": 10.0
}
}

minimumRows

"policy":
{
"validation":
{
"minimumRows": 100
}
}
For more information about these properties and examples, see Create datasets article.

Activity policies
Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed. The
following table provides the details.

PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION

concurrency Integer 1 Number of concurrent


executions of the activity.
Max value: 10
It determines the number
of parallel activity
executions that can happen
on different slices. For
example, if an activity needs
to go through a large set of
available data, having a
larger concurrency value
speeds up the data
processing.

executionPriorityOrder NewestFirst OldestFirst Determines the ordering of


data slices that are being
OldestFirst processed.

For example, if you have 2


slices (one happening at
4pm, and another one at
5pm), and both are
pending execution. If you
set the
executionPriorityOrder to
be NewestFirst, the slice at
5 PM is processed first.
Similarly if you set the
executionPriorityORder to
be OldestFIrst, then the
slice at 4 PM is processed.

retry Integer 0 Number of retries before


the data processing for the
Max value can be 10 slice is marked as Failure.
Activity execution for a data
slice is retried up to the
specified retry count. The
retry is done as soon as
possible after the failure.
PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION

timeout TimeSpan 00:00:00 Timeout for the activity.


Example: 00:10:00 (implies
timeout 10 mins)

If a value is not specified or


is 0, the timeout is infinite.

If the data processing time


on a slice exceeds the
timeout value, it is canceled,
and the system attempts to
retry the processing. The
number of retries depends
on the retry property.
When timeout occurs, the
status is set to TimedOut.

delay TimeSpan 00:00:00 Specify the delay before


data processing of the slice
starts.

The execution of activity for


a data slice is started after
the Delay is past the
expected execution time.

Example: 00:10:00 (implies


delay of 10 mins)

longRetry Integer 1 The number of long retry


attempts before the slice
Max value: 10 execution is failed.

longRetry attempts are


spaced by
longRetryInterval. So if you
need to specify a time
between retry attempts,
use longRetry. If both Retry
and longRetry are specified,
each longRetry attempt
includes Retry attempts and
the max number of
attempts is Retry *
longRetry.

For example, if we have the


following settings in the
activity policy:
Retry: 3
longRetry: 2
longRetryInterval: 01:00:00

Assume there is only one


slice to execute (status is
Waiting) and the activity
execution fails every time.
Initially there would be 3
consecutive execution
attempts. After each
attempt, the slice status
would be Retry. After first 3
PROPERTY PERMITTED VALUES DEFAULT VALUE DESCRIPTION
attempts are over, the slice
status would be LongRetry.

After an hour (that is,


longRetryIntevals value),
there would be another set
of 3 consecutive execution
attempts. After that, the
slice status would be Failed
and no more retries would
be attempted. Hence
overall 6 attempts were
made.

If any execution succeeds,


the slice status would be
Ready and no more retries
are attempted.

longRetry may be used in


situations where dependent
data arrives at non-
deterministic times or the
overall environment is flaky
under which data
processing occurs. In such
cases, doing retries one
after another may not help
and doing so after an
interval of time results in
the desired output.

Word of caution: do not set


high values for longRetry or
longRetryInterval. Typically,
higher values imply other
systemic issues.

longRetryInterval TimeSpan 00:00:00 The delay between long


retry attempts

For more information, see Pipelines article.

Parallel processing of data slices


You can set the start date for the pipeline in the past. When you do so, Data Factory automatically calculates
(back fills) all data slices in the past and begins processing them. For example: if you create a pipeline with start
date 2017-04-01 and the current date is 2017-04-10. If the cadence of the output dataset is daily, then Data
Factory starts processing all the slices from 2017-04-01 to 2017-04-09 immediately because the start date is in
the past. The slice from 2017-04-10 is not processed yet because the value of style property in the availability
section is EndOfInterval by default. The oldest slice is processed first as the default value of
executionPriorityOrder is OldestFirst. For a description of the style property, see dataset availability section. For
a description of the executionPriorityOrder section, see the activity policies section.
You can configure back-filled data slices to be processed in parallel by setting the concurrency property in the
policy section of the activity JSON. This property determines the number of parallel activity executions that can
happen on different slices. The default value for the concurrency property is 1. Therefore, one slice is processed
at a time by default. The maximum value is 10. When a pipeline needs to go through a large set of available
data, having a larger concurrency value speeds up the data processing.
Rerun a failed data slice
When an error occurs while processing a data slice, you can find out why the processing of a slice failed by
using Azure portal blades or Monitor and Manage App. See Monitoring and managing pipelines using Azure
portal blades or Monitoring and Management app for details.
Consider the following example, which shows two activities. Activity1 and Activity 2. Activity1 consumes a slice
of Dataset1 and produces a slice of Dataset2, which is consumed as an input by Activity2 to produce a slice of
the Final Dataset.

The diagram shows that out of three recent slices, there was a failure producing the 9-10 AM slice for Dataset2.
Data Factory automatically tracks dependency for the time series dataset. As a result, it does not start the
activity run for the 9-10 AM downstream slice.
Data Factory monitoring and management tools allow you to drill into the diagnostic logs for the failed slice to
easily find the root cause for the issue and fix it. After you have fixed the issue, you can easily start the activity
run to produce the failed slice. For more information on how to rerun and understand state transitions for data
slices, see Monitoring and managing pipelines using Azure portal blades or Monitoring and Management app.
After you rerun the 9-10 AM slice for Dataset2, Data Factory starts the run for the 9-10 AM dependent slice on
the final dataset.
Multiple activities in a pipeline
You can have more than one activity in a pipeline. If you have multiple activities in a pipeline and the output of
an activity is not an input of another activity, the activities may run in parallel if input data slices for the
activities are ready.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the
input dataset of the other activity. The activities can be in the same pipeline or in different pipelines. The second
activity executes only when the first one finishes successfully.
For example, consider the following case where a pipeline has two activities:
1. Activity A1 that requires external input dataset D1, and produces output dataset D2.
2. Activity A2 that requires input from dataset D2, and produces output dataset D3.
In this scenario, activities A1 and A2 are in the same pipeline. The activity A1 runs when the external data is
available and the scheduled availability frequency is reached. The activity A2 runs when the scheduled slices
from D2 become available and the scheduled availability frequency is reached. If there is an error in one of the
slices in dataset D2, A2 does not run for that slice until it becomes available.
The Diagram view with both activities in the same pipeline would look like the following diagram:

As mentioned earlier, the activities could be in different pipelines. In such a scenario, the diagram view would
look like the following diagram:

See the copy sequentially section in the appendix for an example.

Model datasets with different frequencies


In the samples, the frequencies for input and output datasets and the activity schedule window were the same.
Some scenarios require the ability to produce output at a frequency different than the frequencies of one or
more inputs. Data Factory supports modeling these scenarios.
Sample 1: Produce a daily output report for input data that is available every hour
Consider a scenario in which you have input measurement data from sensors available every hour in Azure
Blob storage. You want to produce a daily aggregate report with statistics such as mean, maximum, and
minimum for the day with Data Factory hive activity.
Here is how you can model this scenario with Data Factory:
Input dataset
The hourly input files are dropped in the folder for the given day. Availability for input is set at Hour
(frequency: Hour, interval: 1).
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}},
{ "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}},
{ "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}}
],
"format": {
"type": "TextFormat"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

Output dataset
One output file is created every day in the day's folder. Availability of output is set at Day (frequency: Day and
interval: 1).

{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}},
{ "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}},
{ "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}}
],
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}

Activity: hive activity in a pipeline


The hive script receives the appropriate DateTime information as parameters that use the WindowStart
variable as shown in the following snippet. The hive script uses this variable to load the data from the correct
folder for the day and run the aggregation to generate the output.
{
"name":"SamplePipeline",
"properties":{
"start":"2015-01-01T08:00:00",
"end":"2015-01-01T11:00:00",
"description":"hive activity",
"activities": [
{
"name": "SampleHiveActivity",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"linkedServiceName": "HDInsightLinkedService",
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adftutorial\\hivequery.hql",
"scriptLinkedService": "StorageLinkedService",
"defines": {
"Year": "$$Text.Format('{0:yyyy}',WindowStart)",
"Month": "$$Text.Format('{0:MM}',WindowStart)",
"Day": "$$Text.Format('{0:dd}',WindowStart)"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 2,
"timeout": "01:00:00"
}
}
]
}
}

The following diagram shows the scenario from a data-dependency point of view.
The output slice for every day depends on 24 hourly slices from an input dataset. Data Factory computes these
dependencies automatically by figuring out the input data slices that fall in the same time period as the output
slice to be produced. If any of the 24 input slices is not available, Data Factory waits for the input slice to be
ready before starting the daily activity run.
Sample 2: Specify dependency with expressions and Data Factory functions
Lets consider another scenario. Suppose you have a hive activity that processes two input datasets. One of
them has new data daily, but one of them gets new data every week. Suppose you wanted to do a join across
the two inputs and produce an output every day.
The simple approach in which Data Factory automatically figures out the right input slices to process by
aligning to the output data slices time period does not work.
You must specify that for every activity run, the Data Factory should use last weeks data slice for the weekly
input dataset. You use Azure Data Factory functions as shown in the following snippet to implement this
behavior.
Input1: Azure blob
The first input is the Azure blob being updated daily.
{
"name": "AzureBlobInputDaily",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}},
{ "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}},
{ "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}}
],
"format": {
"type": "TextFormat"
}
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 1
}
}
}

Input2: Azure blob


Input2 is the Azure blob being updated weekly.

{
"name": "AzureBlobInputWeekly",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}},
{ "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}},
{ "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}}
],
"format": {
"type": "TextFormat"
}
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 7
}
}
}

Output: Azure blob


One output file is created every day in the folder for the day. Availability of output is set to day (frequency: Day,
interval: 1).
{
"name": "AzureBlobOutputDaily",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/",
"partitionedBy": [
{ "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}},
{ "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}},
{ "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}}
],
"format": {
"type": "TextFormat"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}

Activity: hive activity in a pipeline


The hive activity takes the two inputs and produces an output slice every day. You can specify every days
output slice to depend on the previous weeks input slice for weekly input as follows.
{
"name":"SamplePipeline",
"properties":{
"start":"2015-01-01T08:00:00",
"end":"2015-01-01T11:00:00",
"description":"hive activity",
"activities": [
{
"name": "SampleHiveActivity",
"inputs": [
{
"name": "AzureBlobInputDaily"
},
{
"name": "AzureBlobInputWeekly",
"startTime": "Date.AddDays(SliceStart, - Date.DayOfWeek(SliceStart))",
"endTime": "Date.AddDays(SliceEnd, -Date.DayOfWeek(SliceEnd))"
}
],
"outputs": [
{
"name": "AzureBlobOutputDaily"
}
],
"linkedServiceName": "HDInsightLinkedService",
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adftutorial\\hivequery.hql",
"scriptLinkedService": "StorageLinkedService",
"defines": {
"Year": "$$Text.Format('{0:yyyy}',WindowStart)",
"Month": "$$Text.Format('{0:MM}',WindowStart)",
"Day": "$$Text.Format('{0:dd}',WindowStart)"
}
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 2,
"timeout": "01:00:00"
}
}
]
}
}

See Data Factory functions and system variables for a list of functions and system variables that Data Factory
supports.

Appendix
Example: copy sequentially
It is possible to run multiple copy operations one after another in a sequential/ordered manner. For example,
you might have two copy activities in a pipeline (CopyActivity1 and CopyActivity2) with the following input
data output datasets:
CopyActivity1
Input: Dataset. Output: Dataset2.
CopyActivity2
Input: Dataset2. Output: Dataset3.
CopyActivity2 would run only if the CopyActivity1 has run successfully and Dataset2 is available.
Here is the sample pipeline JSON:

{
"name": "ChainActivities",
"properties": {
"description": "Run activities in sequence",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "Dataset1"
}
],
"outputs": [
{
"name": "Dataset2"
}
],
"policy": {
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlob1ToBlob2",
"description": "Copy data from a blob to another"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "Dataset2"
}
],
"outputs": [
{
"name": "Dataset3"
}
],
],
"policy": {
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlob2ToBlob3",
"description": "Copy data from a blob to another"
}
],
"start": "2016-08-25T01:00:00Z",
"end": "2016-08-25T01:00:00Z",
"isPaused": false
}
}

Notice that in the example, the output dataset of the first copy activity (Dataset2) is specified as input for the
second activity. Therefore, the second activity runs only when the output dataset from the first activity is ready.
In the example, CopyActivity2 can have a different input, such as Dataset3, but you specify Dataset2 as an input
to CopyActivity2, so the activity does not run until CopyActivity1 finishes. For example:
CopyActivity1
Input: Dataset1. Output: Dataset2.
CopyActivity2
Inputs: Dataset3, Dataset2. Output: Dataset4.

{
"name": "ChainActivities",
"properties": {
"description": "Run activities in sequence",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "Dataset1"
}
],
"outputs": [
{
"name": "Dataset2"
}
],
"policy": {
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
},
"name": "CopyFromBlobToBlob",
"description": "Copy data from a blob to another"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "Dataset3"
},
{
"name": "Dataset2"
}
],
"outputs": [
{
"name": "Dataset4"
}
],
"policy": {
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "CopyFromBlob3ToBlob4",
"description": "Copy data from a blob to another"
}
],
"start": "2017-04-25T01:00:00Z",
"end": "2017-04-25T01:00:00Z",
"isPaused": false
}
}

Notice that in the example, two input datasets are specified for the second copy activity. When multiple inputs
are specified, only the first input dataset is used for copying data, but other datasets are used as dependencies.
CopyActivity2 would start only after the following conditions are met:
CopyActivity1 has successfully completed and Dataset2 is available. This dataset is not used when copying
data to Dataset4. It only acts as a scheduling dependency for CopyActivity2.
Dataset3 is available. This dataset represents the data that is copied to the destination.
Tutorial: Copy data from Blob Storage to SQL
Database using Data Factory
8/21/2017 4 min to read Edit Online

In this tutorial, you create a data factory with a pipeline to copy data from Blob storage to SQL
database.
The Copy Activity performs the data movement in Azure Data Factory. It is powered by a globally
available service that can copy data between various data stores in a secure, reliable, and scalable way.
See Data Movement Activities article for details about the Copy Activity.

NOTE
For a detailed overview of the Data Factory service, see the Introduction to Azure Data Factory article.

Prerequisites for the tutorial


Before you begin this tutorial, you must have the following prerequisites:
Azure subscription. If you don't have a subscription, you can create a free trial account in just a
couple of minutes. See the Free Trial article for details.
Azure Storage Account. You use the blob storage as a source data store in this tutorial. if you
don't have an Azure storage account, see the Create a storage account article for steps to create
one.
Azure SQL Database. You use an Azure SQL database as a destination data store in this tutorial.
If you don't have an Azure SQL database that you can use in the tutorial, See How to create and
configure an Azure SQL Database to create one.
SQL Server 2012/2014 or Visual Studio 2013. You use SQL Server Management Studio or Visual
Studio to create a sample database and to view the result data in the database.

Collect blob storage account name and key


You need the account name and account key of your Azure storage account to do this tutorial. Note
down account name and account key for your Azure storage account.
1. Log in to the Azure portal.
2. Click More services on the left menu and select Storage Accounts.
3. In the Storage Accounts blade, select the Azure storage account that you want to use in this
tutorial.
4. Select Access keys link under SETTINGS.
5. Click copy (image) button next to Storage account name text box and save/paste it somewhere
(for example: in a text file).
6. Repeat the previous step to copy or note down the key1.

7. Close all the blades by clicking X.

Collect SQL server, database, user names


You need the names of Azure SQL server, database, and user to do this tutorial. Note down names of
server, database, and user for your Azure SQL database.
1. In the Azure portal, click More services on the left and select SQL databases.
2. In the SQL databases blade, select the database that you want to use in this tutorial. Note down
the database name.
3. In the SQL database blade, click Properties under SETTINGS.
4. Note down the values for SERVER NAME and SERVER ADMIN LOGIN.
5. Close all the blades by clicking X.

Allow Azure services to access SQL server


Ensure that Allow access to Azure services setting turned ON for your Azure SQL server so that the
Data Factory service can access your Azure SQL server. To verify and turn on this setting, do the
following steps:
1. Click More services hub on the left and click SQL servers.
2. Select your server, and click Firewall under SETTINGS.
3. In the Firewall settings blade, click ON for Allow access to Azure services.
4. Close all the blades by clicking X.

Prepare Blob Storage and SQL Database


Now, prepare your Azure blob storage and Azure SQL database for the tutorial by performing the
following steps:
1. Launch Notepad. Copy the following text and save it as emp.txt to C:\ADFGetStarted folder
on your hard drive.

John, Doe
Jane, Doe

2. Use tools such as Azure Storage Explorer to create the adftutorial container and to upload the
emp.txt file to the container.

3. Use the following SQL script to create the emp table in your Azure SQL Database.

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50),
)
GO

CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);

If you have SQL Server 2012/2014 installed on your computer: follow instructions from
Managing Azure SQL Database using SQL Server Management Studio to connect to your Azure
SQL server and run the SQL script. This article uses the classic Azure portal, not the new Azure
portal, to configure firewall for an Azure SQL server.
If your client is not allowed to access the Azure SQL server, you need to configure firewall for
your Azure SQL server to allow access from your machine (IP Address). See this article for steps
to configure the firewall for your Azure SQL server.
Create a data factory
You have completed the prerequisites. You can create a data factory using one of the following ways.
Click one of the options in the drop-down list at the top or the following links to perform the tutorial.
Copy Wizard
Azure portal
Visual Studio
PowerShell
Azure Resource Manager template
REST API
.NET API

NOTE
The data pipeline in this tutorial copies data from a source data store to a destination data store. It does not
transform input data to produce output data. For a tutorial on how to transform data using Azure Data
Factory, see Tutorial: Build your first pipeline to transform data using Hadoop cluster.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the
input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information.
Tutorial: Create a pipeline with Copy Activity using
Data Factory Copy Wizard
7/10/2017 6 min to read Edit Online

This tutorial shows you how to use the Copy Wizard to copy data from an Azure blob storage to an Azure
SQL database.
The Azure Data Factory Copy Wizard allows you to quickly create a data pipeline that copies data from a
supported source data store to a supported destination data store. Therefore, we recommend that you use the
wizard as a first step to create a sample pipeline for your data movement scenario. For a list of data stores
supported as sources and as destinations, see supported data stores.
This tutorial shows you how to create an Azure data factory, launch the Copy Wizard, go through a series of
steps to provide details about your data ingestion/movement scenario. When you finish steps in the wizard,
the wizard automatically creates a pipeline with a Copy Activity to copy data from an Azure blob storage to an
Azure SQL database. For more information about Copy Activity, see data movement activities.

Prerequisites
Complete prerequisites listed in the Tutorial Overview article before performing this tutorial.

Create data factory


In this step, you use the Azure portal to create an Azure data factory named ADFTutorialDataFactory.
1. Log in to Azure portal.
2. Click + NEW from the top-left corner, click Data + analytics, and click Data Factory.
3. In the New data factory blade:
a. Enter ADFTutorialDataFactory for the name. The name of the Azure data factory must be
globally unique. If you receive the error:
Data factory name ADFTutorialDataFactory is not available , change the name of the data
factory (for example, yournameADFTutorialDataFactoryYYYYMMDD) and try creating again. See
Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.

b. Select your Azure subscription.


c. For Resource Group, do one of the following steps:
Select Use existing to select an existing resource group.
Select Create new to enter a name for a resource group.
Some of the steps in this tutorial assume that you use the name:
ADFTutorialResourceGroup for the resource group. To learn about resource groups,
see Using resource groups to manage your Azure resources.
d. Select a location for the data factory.
e. Select Pin to dashboard check box at the bottom of the blade.
f. Click Create.

4. After the creation is complete, you see the Data Factory blade as shown in the following image:

Launch Copy Wizard


1. On the Data Factory blade, click Copy data [PREVIEW] to launch the Copy Wizard.
NOTE
If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third-party cookies and
site data setting in the browser settings (or) keep it enabled and create an exception for
login.microsoftonline.com and then try launching the wizard again.

2. In the Properties page:


a. Enter CopyFromBlobToAzureSql for Task name
b. Enter description (optional).
c. Change the Start date time and the End date time so that the end date is set to today and start
date to five days earlier.
d. Click Next.

3. On the Source data store page, click Azure Blob Storage tile. You use this page to specify the source
data store for the copy task.
4. On the Specify the Azure Blob storage account page:
a. Enter AzureStorageLinkedService for Linked service name.
b. Confirm that From Azure subscriptions option is selected for Account selection method.
c. Select your Azure subscription.
d. Select an Azure storage account from the list of Azure storage accounts available in the
selected subscription. You can also choose to enter storage account settings manually by
selecting Enter manually option for the Account selection method, and then click Next.

5. On Choose the input file or folder page:


a. Double-click adftutorial (folder).
b. Select emp.txt, and click Choose

6. On the Choose the input file or folder page, click Next. Do not select Binary copy.

7. On the File format settings page, you see the delimiters and the schema that is auto-detected by the
wizard by parsing the file. You can also enter the delimiters manually for the copy wizard to stop auto-
detecting or to override. Click Next after you review the delimiters and preview data.
8. On the Destination data store page, select Azure SQL Database, and click Next.

9. On Specify the Azure SQL database page:


a. Enter AzureSqlLinkedService for the Connection name field.
b. Confirm that From Azure subscriptions option is selected for Server / database selection
method.
c. Select your Azure subscription.
d. Select Server name and Database.
e. Enter User name and Password.
f. Click Next.

10. On the Table mapping page, select emp for the Destination field from the drop-down list, click
down arrow (optional) to see the schema and to preview the data.
11. On the Schema mapping page, click Next.

12. On the Performance settings page, click Next.


13. Review information in the Summary page, and click Finish. The wizard creates two linked services, two
datasets (input and output), and one pipeline in the data factory (from where you launched the Copy
Wizard).

Launch Monitor and Manage application


1. On the Deployment page, click the link: Click here to monitor copy pipeline .

2. The monitoring application is launched in a separate tab in your web browser.

3. To see the latest status of hourly slices, click Refresh button in the ACTIVITY WINDOWS list at the
bottom. You see five activity windows for five days between start and end times for the pipeline. The list is
not automatically refreshed, so you may need to click Refresh a couple of times before you see all the
activity windows in the Ready state.
4. Select an activity window in the list. See the details about it in the Activity Window Explorer on the
right.
Notice that the dates 11, 12, 13, 14, and 15 are in green color, which means that the daily output slices
for these dates have already been produced. You also see this color coding on the pipeline and the
output dataset in the diagram view. In the previous step, notice that two slices have already been
produced, one slice is currently being processed, and the other two are waiting to be processed (based
on the color coding).
For more information on using this application, see Monitor and manage pipeline using Monitoring
App article.

Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a
destination data store in a copy operation. The following table provides a list of data stores supported as
sources and destinations by the copy activity:

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data


Warehouse
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*
For details about fields/properties that you see in the copy wizard for a data store, click the link for the data
store in the table.
Tutorial: Use Azure portal to create a Data Factory
pipeline to copy data
7/10/2017 18 min to read Edit Online

In this article, you learn how to use Azure portal to create a data factory with a pipeline that copies data from
an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the
Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks,
see supported data stores. The activity is powered by a globally available service that can copy data between
various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see
Data Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.

NOTE
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how
to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.

Prerequisites
Complete prerequisites listed in the tutorial prerequisites article before performing this tutorial.

Steps
Here are the steps you perform as part of this tutorial:
1. Create an Azure data factory. In this step, you create a data factory named ADFTutorialDataFactory.
2. Create linked services in the data factory. In this step, you create two linked services of types: Azure
Storage and Azure SQL Database.
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a
container and uploaded data to this storage account as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from
the blob storage is stored in this database. You created a SQL table in this database as part of
prerequisites.
3. Create input and output datasets in the data factory.
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. And, the input blob dataset specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory
service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset
specifies the table in the database to which the data from the blob storage is copied.
4. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity.
The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL
database. You can use a copy activity in a pipeline to copy data from any supported source to any
supported destination. For a list of supported data stores, see data movement activities article.
5. Monitor the pipeline. In this step, you monitor the slices of input and output datasets by using Azure
portal.

Create data factory


IMPORTANT
Complete prerequisites for the tutorial if you haven't already done so.

A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a
Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive
script to transform input data to product output data. Let's start with creating the data factory in this step.
1. After logging in to the Azure portal, click New on the left menu, click Data + Analytics, and click Data
Factory.

2. In the New data factory blade:


a. Enter ADFTutorialDataFactory for the name.
The name of the Azure data factory must be globally unique. If you receive the following error,
change the name of the data factory (for example, yournameADFTutorialDataFactory) and try
creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.

Data factory name ADFTutorialDataFactory is not available

b. Select your Azure subscription in which you want to create the data factory.
c. For the Resource Group, do one of the following steps:
Select Use existing, and select an existing resource group from the drop-down list.
Select Create new, and enter the name of a resource group.
Some of the steps in this tutorial assume that you use the name:
ADFTutorialResourceGroup for the resource group. To learn about resource groups, see
Using resource groups to manage your Azure resources.
d. Select the location for the data factory. Only regions supported by the Data Factory service are
shown in the drop-down list.
e. Select Pin to dashboard.
f. Click Create.

IMPORTANT
To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
The name of the data factory may be registered as a DNS name in the future and hence become
publically visible.
3. On the dashboard, you see the following tile with status: Deploying data factory.

4. After the creation is complete, you see the Data Factory blade as shown in the image.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use
two data stores of type Azure Storage (source) and Azure SQL Database (destination).
Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of
types: AzureStorage and AzureSqlDatabase.
The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is
the one in which you created a container and uploaded the data as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the
blob storage is stored in this database. You created the emp table in this database as part of prerequisites.
Create Azure Storage linked service
In this step, you link your Azure storage account to your data factory. You specify the name and key of your
Azure storage account in this section.
1. In the Data Factory blade, click Author and deploy tile.

2. You see the Data Factory Editor as shown in the following image:

3. In the editor, click New data store button on the toolbar and select Azure storage from the drop-
down menu. You should see the JSON template for creating an Azure storage linked service in the right
pane.

4. Replace <accountname> and <accountkey> with the account name and account key values for your
Azure storage account.

5. Click Deploy on the toolbar. You should see the deployed AzureStorageLinkedService in the tree
view now.

For more information about JSON properties in the linked service definition, see Azure Blob Storage
connector article.
Create a linked service for the Azure SQL Database
In this step, you link your Azure SQL database to your data factory. You specify the Azure SQL server name,
database name, user name, and user password in this section.
1. In the Data Factory Editor, click New data store button on the toolbar and select Azure SQL Database
from the drop-down menu. You should see the JSON template for creating the Azure SQL linked service in
the right pane.
2. Replace <servername> , <databasename> , <username>@<servername> , and <password> with names of your
Azure SQL server, database, user account, and password.
3. Click Deploy on the toolbar to create and deploy the AzureSqlLinkedService.
4. Confirm that you see AzureSqlLinkedService in the tree view under Linked services.
For more information about these JSON properties, see Azure SQL Database connector.

Create datasets
In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to
your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent
input and output data that is stored in the data stores referred by AzureStorageLinkedService and
AzureSqlLinkedService respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses
at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies
the table in the database to which the data from the blob storage is copied.
Create input dataset
In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a
blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service.
If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the
destination. In this tutorial, you specify a value for the fileName.
1. In the Editor for the Data Factory, click ... More, click New dataset, and click Azure Blob storage from
the drop-down menu.

2. Replace JSON in the right pane with the following JSON snippet:
{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/",
"fileName": "emp.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureBlob because data


resides in an Azure blob storage.

linkedServiceName Refers to the AzureStorageLinkedService that you


created earlier.

folderPath Specifies the blob container and the folder that


contains input blobs. In this tutorial, adftutorial is the
blob container and folder is the root folder.

fileName This property is optional. If you omit this property, all


files from the folderPath are picked. In this tutorial,
emp.txt is specified for the fileName, so only that file is
picked up for processing.

format -> type The input file is in the text format, so we use
TextFormat.

columnDelimiter The columns in the input file are delimited by comma


character ( , ).
PROPERTY DESCRIPTION

frequency/interval The frequency is set to Hour and interval is set to 1,


which means that the input slices are available hourly.
In other words, the Data Factory service looks for input
data every hour in the root folder of blob container
(adftutorial) you specified. It looks for the data within
the pipeline start and end times, not before or after
these times.

external This property is set to true if the data is not generated


by this pipeline. The input data in this tutorial is in the
emp.txt file, which is not generated by this pipeline, so
we set this property to true.

For more information about these JSON properties, see Azure Blob connector article.
3. Click Deploy on the toolbar to create and deploy the InputDataset dataset. Confirm that you see the
InputDataset in the tree view.
Create output dataset
The Azure SQL Database linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure SQL database. The output SQL table dataset (OututDataset) you create in this
step specifies the table in the database to which the data from the blob storage is copied.
1. In the Editor for the Data Factory, click ... More, click New dataset, and click Azure SQL from the drop-
down menu.
2. Replace JSON in the right pane with the following JSON snippet:

{
"name": "OutputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "emp"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION
PROPERTY DESCRIPTION

type The type property is set to AzureSqlTable because


data is copied to a table in an Azure SQL database.

linkedServiceName Refers to the AzureSqlLinkedService that you created


earlier.

tableName Specified the table to which the data is copied.

frequency/interval The frequency is set to Hour and interval is 1, which


means that the output slices are produced hourly
between the pipeline start and end times, not before or
after these times.

There are three columns ID, FirstName, and LastName in the emp table in the database. ID is an
identity column, so you need to specify only FirstName and LastName here.
For more information about these JSON properties, see Azure SQL connector article.
3. Click Deploy on the toolbar to create and deploy the OutputDataset dataset. Confirm that you see the
OutputDataset in the tree view under Datasets.

Create pipeline
In this step, you create a pipeline with a copy activity that uses InputDataset as an input and
OutputDataset as an output.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a
slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours.
Therefore, 24 slices of output dataset are produced by the pipeline.
1. In the Editor for the Data Factory, click ... More, and click New pipeline. Alternatively, you can right-click
Pipelines in the tree view and click New pipeline.
2. Replace JSON in the right pane with the following JSON snippet:
{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}

Note the following points:


In the activities section, there is only one activity whose type is set to Copy. For more information
about the copy activity, see data movement activities. In Data Factory solutions, you can also use
data transformation activities.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified
as the sink type. For a complete list of data stores supported by the copy activity as sources and
sinks, see supported data stores. To learn how to use a specific supported data store as a
source/sink, click the link in the table.
Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The
end time is optional, but we use it in this tutorial. If you do not specify value for the end
property, it is calculated as "start + 48 hours". To run the pipeline indefinitely, specify 9999-09-
09 as the value for the end property.
In the preceding example, there are 24 data slices as each data slice is produced hourly.
For descriptions of JSON properties in a pipeline definition, see create pipelines article. For
descriptions of JSON properties in a copy activity definition, see data movement activities. For
descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For
descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector article.
3. Click Deploy on the toolbar to create and deploy the ADFTutorialPipeline. Confirm that you see the
pipeline in the tree view.
4. Now, close the Editor blade by clicking X. Click X again to see the Data Factory home page for the
ADFTutorialDataFactory.
Congratulations! You have successfully created an Azure data factory with a pipeline to copy data from an
Azure blob storage to an Azure SQL database.

Monitor pipeline
In this step, you use the Azure portal to monitor whats going on in an Azure data factory.
Monitor pipeline using Monitor & Manage App
The following steps show you how to monitor pipelines in your data factory by using the Monitor & Manage
application:
1. Click Monitor & Manage tile on the home page for your data factory.

2. You should see Monitor & Manage application in a separate tab.

NOTE
If you see that the web browser is stuck at "Authorizing...", do one of the following: clear the Block third-party
cookies and site data check box (or) create an exception for login.microsoftonline.com, and then try to
open the app again.
3. Change the Start time and End time to include start (2017-05-11) and end times (2017-05-12) of your
pipeline, and click Apply.
4. You see the activity windows associated with each hour between pipeline start and end times in the list in
the middle pane.
5. To see details about an activity window, select the activity window in the Activity Windows list.

In Activity Window Explorer on the right, you see that the slices up to the current UTC time (8:12 PM)
are all processed (in green color). The 8-9 PM, 9 - 10 PM, 10 - 11 PM, 11 PM - 12 AM slices are not
processed yet.
The Attempts section in the right pane provides information about the activity run for the data slice. If
there was an error, it provides details about the error. For example, if the input folder or container does
not exist and the slice processing fails, you see an error message stating that the container or folder
does not exist.
6. Launch SQL Server Management Studio, connect to the Azure SQL Database, and verify that the rows
are inserted in to the emp table in the database.

For detailed information about using this application, see Monitor and manage Azure Data Factory pipelines
using Monitoring and Management App.
Monitor pipeline using Diagram View
You can also monitor data pipelines by using the diagram view.
1. In the Data Factory blade, click Diagram.
2. You should see the diagram similar to the following image:

3. In the diagram view, double-click InputDataset to see slices for the dataset.
4. Click See more link to see all the data slices. You see 24 hourly slices between pipeline start and end
times.
Notice that all the data slices up to the current UTC time are Ready because the emp.txt file exists all
the time in the blob container: adftutorial\input. The slices for the future times are not in ready state
yet. Confirm that no slices show up in the Recently failed slices section at the bottom.
5. Close the blades until you see the diagram view (or) scroll left to see the diagram view. Then, double-click
OutputDataset.
6. Click See more link on the Table blade for OutputDataset to see all the slices.
7. Notice that all the slices up to the current UTC time move from pending execution state => In progress
==> Ready state. The slices from the past (before current time) are processed from latest to oldest by
default. For example, if the current time is 8:12 PM UTC, the slice for 7 PM - 8 PM is processed ahead of the
6 PM - 7 PM slice. The 8 PM - 9 PM slice is processed at the end of the time interval by default, that is after
9 PM.
8. Click any data slice from the list and you should see the Data slice blade. A piece of data associated
with an activity window is called a slice. A slice can be one file or multiple files.
If the slice is not in the Ready state, you can see the upstream slices that are not Ready and are
blocking the current slice from executing in the Upstream slices that are not ready list.
9. In the DATA SLICE blade, you should see all activity runs in the list at the bottom. Click an activity run
to see the Activity run details blade.

In this blade, you see how long the copy operation took, what throughput is, how many bytes of data
were read and written, run start time, run end time etc.
10. Click X to close all the blades until you get back to the home blade for the ADFTutorialDataFactory.
11. (optional) click the Datasets tile or Pipelines tile to get the blades you have seen the preceding steps.
12. Launch SQL Server Management Studio, connect to the Azure SQL Database, and verify that the rows
are inserted in to the emp table in the database.
Summary
In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database.
You used the Azure portal to create the data factory, linked services, datasets, and a pipeline. Here are the
high-level steps you performed in this tutorial:
1. Created an Azure data factory.
2. Created linked services:
a. An Azure Storage linked service to link your Azure Storage account that holds input data.
b. An Azure SQL linked service to link your Azure SQL database that holds the output data.
3. Created datasets that describe input data and output data for pipelines.
4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink.

Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data


Warehouse

Azure Search Index


CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Create a pipeline with Copy Activity using
Visual Studio
7/10/2017 19 min to read Edit Online

In this article, you learn how to use the Microsoft Visual Studio to create a data factory with a pipeline that
copies data from an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read
through the Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks,
see supported data stores. The activity is powered by a globally available service that can copy data between
various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see
Data Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.

NOTE
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how
to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.

Prerequisites
1. Read through Tutorial Overview article and complete the prerequisite steps.
2. To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
3. You must have the following installed on your computer:
Visual Studio 2013 or Visual Studio 2015
Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to Azure Download
Page and click VS 2013 or VS 2015 in the .NET section.
Download the latest Azure Data Factory plugin for Visual Studio: VS 2013 or VS 2015. You can also
update the plugin by doing the following steps: On the menu, click Tools -> Extensions and
Updates -> Online -> Visual Studio Gallery -> Microsoft Azure Data Factory Tools for Visual
Studio -> Update.

Steps
Here are the steps you perform as part of this tutorial:
1. Create linked services in the data factory. In this step, you create two linked services of types: Azure
Storage and Azure SQL Database.
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a
container and uploaded data to this storage account as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from
the blob storage is stored in this database. You created a SQL table in this database as part of
prerequisites.
2. Create input and output datasets in the data factory.
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. And, the input blob dataset specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory
service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset
specifies the table in the database to which the data from the blob storage is copied.
3. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity.
The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL
database. You can use a copy activity in a pipeline to copy data from any supported source to any
supported destination. For a list of supported data stores, see data movement activities article.
4. Create an Azure data factory when deploying Data Factory entities (linked services, datasets/tables, and
pipelines).

Create Visual Studio project


1. Launch Visual Studio 2015. Click File, point to New, and click Project. You should see the New Project
dialog box.
2. In the New Project dialog, select the DataFactory template, and click Empty Data Factory Project.

3. Specify the name of the project, location for the solution, and name of the solution, and then click OK.
Create linked services
You create linked services in a data factory to link your data stores and compute services to the data factory. In
this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use
two data stores of type Azure Storage (source) and Azure SQL Database (destination).
Therefore, you create two linked services of types: AzureStorage and AzureSqlDatabase.
The Azure Storage linked service links your Azure storage account to the data factory. This storage account is
the one in which you created a container and uploaded the data as part of prerequisites.
Azure SQL linked service links your Azure SQL database to the data factory. The data that is copied from the
blob storage is stored in this database. You created the emp table in this database as part of prerequisites.
Linked services link data stores or compute services to an Azure data factory. See supported data stores for all
the sources and sinks supported by the Copy Activity. See compute linked services for the list of compute
services supported by Data Factory. In this tutorial, you do not use any compute service.
Create the Azure Storage linked service
1. In Solution Explorer, right-click Linked Services, point to Add, and click New Item.
2. In the Add New Item dialog box, select Azure Storage Linked Service from the list, and click Add.

3. Replace <accountname> and <accountkey> * with the name of your Azure storage account and its key.

4. Save the AzureStorageLinkedService1.json file.


For more information about JSON properties in the linked service definition, see Azure Blob Storage
connector article.
Create the Azure SQL linked service
1. Right-click on Linked Services node in the Solution Explorer again, point to Add, and click New Item.
2. This time, select Azure SQL Linked Service, and click Add.
3. In the AzureSqlLinkedService1.json file, replace <servername> , <databasename> , <username@servername> ,
and <password> with names of your Azure SQL server, database, user account, and password.
4. Save the AzureSqlLinkedService1.json file.
For more information about these JSON properties, see Azure SQL Database connector.

Create datasets
In the previous step, you created linked services to link your Azure Storage account and Azure SQL database
to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that
represent input and output data that is stored in the data stores referred by AzureStorageLinkedService1 and
AzureSqlLinkedService1 respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses
at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies
the table in the database to which the data from the blob storage is copied.
Create input dataset
In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a
blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService1 linked
service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are
copied to the destination. In this tutorial, you specify a value for the fileName.
Here, you use the term "tables" rather than "datasets". A table is a rectangular dataset and is the only type of
dataset supported right now.
1. Right-click Tables in the Solution Explorer, point to Add, and click New Item.
2. In the Add New Item dialog box, select Azure Blob, and click Add.
3. Replace the JSON text with the following text and save the AzureBlobLocation1.json file.
{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService1",
"typeProperties": {
"folderPath": "adftutorial/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureBlob because data


resides in an Azure blob storage.

linkedServiceName Refers to the AzureStorageLinkedService that you


created earlier.

folderPath Specifies the blob container and the folder that


contains input blobs. In this tutorial, adftutorial is the
blob container and folder is the root folder.

fileName This property is optional. If you omit this property, all


files from the folderPath are picked. In this tutorial,
emp.txt is specified for the fileName, so only that file is
picked up for processing.

format -> type The input file is in the text format, so we use
TextFormat.

columnDelimiter The columns in the input file are delimited by comma


character ( , ).
PROPERTY DESCRIPTION

frequency/interval The frequency is set to Hour and interval is set to 1,


which means that the input slices are available hourly.
In other words, the Data Factory service looks for input
data every hour in the root folder of blob container
(adftutorial) you specified. It looks for the data within
the pipeline start and end times, not before or after
these times.

external This property is set to true if the data is not generated


by this pipeline. The input data in this tutorial is in the
emp.txt file, which is not generated by this pipeline, so
we set this property to true.

For more information about these JSON properties, see Azure Blob connector article.
Create output dataset
In this step, you create an output dataset named OutputDataset. This dataset points to a SQL table in the
Azure SQL database represented by AzureSqlLinkedService1.
1. Right-click Tables in the Solution Explorer again, point to Add, and click New Item.
2. In the Add New Item dialog box, select Azure SQL, and click Add.
3. Replace the JSON text with the following JSON and save the AzureSqlTableLocation1.json file.

{
"name": "OutputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService1",
"typeProperties": {
"tableName": "emp"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureSqlTable because


data is copied to a table in an Azure SQL database.

linkedServiceName Refers to the AzureSqlLinkedService that you created


earlier.
PROPERTY DESCRIPTION

tableName Specified the table to which the data is copied.

frequency/interval The frequency is set to Hour and interval is 1, which


means that the output slices are produced hourly
between the pipeline start and end times, not before or
after these times.

There are three columns ID, FirstName, and LastName in the emp table in the database. ID is an
identity column, so you need to specify only FirstName and LastName here.
For more information about these JSON properties, see Azure SQL connector article.

Create pipeline
In this step, you create a pipeline with a copy activity that uses InputDataset as an input and
OutputDataset as an output.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a
slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours.
Therefore, 24 slices of output dataset are produced by the pipeline.
1. Right-click Pipelines in the Solution Explorer, point to Add, and click New Item.
2. Select Copy Data Pipeline in the Add New Item dialog box and click Add.
3. Replace the JSON with the following JSON and save the CopyActivity1.json file.
{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z",
"isPaused": false
}
}

In the activities section, there is only one activity whose type is set to Copy. For more information
about the copy activity, see data movement activities. In Data Factory solutions, you can also use
data transformation activities.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is
specified as the sink type. For a complete list of data stores supported by the copy activity as
sources and sinks, see supported data stores. To learn how to use a specific supported data store
as a source/sink, click the link in the table.
Replace the value of the start property with the current day and end value with the next day.
You can specify only the date part and skip the time part of the date time. For example, "2016-
02-03", which is equivalent to "2016-02-03T00:00:00Z"
Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The
end time is optional, but we use it in this tutorial.
If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the
pipeline indefinitely, specify 9999-09-09 as the value for the end property.
In the preceding example, there are 24 data slices as each data slice is produced hourly.
For descriptions of JSON properties in a pipeline definition, see create pipelines article. For
descriptions of JSON properties in a copy activity definition, see data movement activities. For
descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For
descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector
article.

Publish/deploy Data Factory entities


In this step, you publish Data Factory entities (linked services, datasets, and pipeline) you created earlier. You
also specify the name of the new data factory to be created to hold these entities.
1. Right-click project in the Solution Explorer, and click Publish.
2. If you see Sign in to your Microsoft account dialog box, enter your credentials for the account that has
Azure subscription, and click sign in.
3. You should see the following dialog box:

4. In the Configure data factory page, do the following steps:


a. select Create New Data Factory option.
b. Enter VSTutorialFactory for Name.

IMPORTANT
The name of the Azure data factory must be globally unique. If you receive an error about the name of
data factory when publishing, change the name of the data factory (for example,
yournameVSTutorialFactory) and try publishing again. See Data Factory - Naming Rules topic for naming
rules for Data Factory artifacts.

c. Select your Azure subscription for the Subscription field.


IMPORTANT
If you do not see any subscription, ensure that you logged in using an account that is an admin or co-
admin of the subscription.

d. Select the resource group for the data factory to be created.


e. Select the region for the data factory. Only regions supported by the Data Factory service are
shown in the drop-down list.
f. Click Next to switch to the Publish Items page.

5. In the Publish Items page, ensure that all the Data Factories entities are selected, and click Next to
switch to the Summary page.
6. Review the summary and click Next to start the deployment process and view the Deployment
Status.

7. In the Deployment Status page, you should see the status of the deployment process. Click Finish
after the deployment is done.

Note the following points:


If you receive the error: "This subscription is not registered to use namespace Microsoft.DataFactory",
do one of the following and try publishing again:
In Azure PowerShell, run the following command to register the Data Factory provider.
Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

You can run the following command to confirm that the Data Factory provider is registered.

Get-AzureRmResourceProvider

Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or)
create a data factory in the Azure portal. This action automatically registers the provider for you.
The name of the data factory may be registered as a DNS name in the future and hence become publically
visible.

IMPORTANT
To create Data Factory instances, you need to be a admin/co-admin of the Azure subscription

Monitor pipeline
Navigate to the home page for your data factory:
1. Log in to Azure portal.
2. Click More services on the left menu, and click Data factories.
3. Start typing the name of your data factory.
4. Click your data factory in the results list to see the home page for your data factory.

5. Follow instructions from Monitor datasets and pipeline to monitor the pipeline and datasets you have
created in this tutorial. Currently, Visual Studio does not support monitoring Data Factory pipelines.

Summary
In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database.
You used Visual Studio to create the data factory, linked services, datasets, and a pipeline. Here are the high-
level steps you performed in this tutorial:
1. Created an Azure data factory.
2. Created linked services:
a. An Azure Storage linked service to link your Azure Storage account that holds input data.
b. An Azure SQL linked service to link your Azure SQL database that holds the output data.
3. Created datasets, which describe input data and output data for pipelines.
4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink.
To see how to use a HDInsight Hive Activity to transform data by using Azure HDInsight cluster, see Tutorial:
Build your first pipeline to transform data using Hadoop cluster.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the
input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information.

View all data factories in Server Explorer


This section describes how to use the Server Explorer in Visual Studio to view all the data factories in your
Azure subscription and create a Visual Studio project based on an existing data factory.
1. In Visual Studio, click View on the menu, and click Server Explorer.
2. In the Server Explorer window, expand Azure and expand Data Factory. If you see Sign in to Visual
Studio, enter the account associated with your Azure subscription and click Continue. Enter
password, and click Sign in. Visual Studio tries to get information about all Azure data factories in
your subscription. You see the status of this operation in the Data Factory Task List window.

Create a Visual Studio project for an existing data factory


Right-click a data factory in Server Explorer, and select Export Data Factory to New Project to create
a Visual Studio project based on an existing data factory.
Update Data Factory tools for Visual Studio
To update Azure Data Factory tools for Visual Studio, do the following steps:
1. Click Tools on the menu and select Extensions and Updates.
2. Select Updates in the left pane and then select Visual Studio Gallery.
3. Select Azure Data Factory tools for Visual Studio and click Update. If you do not see this entry, you
already have the latest version of the tools.

Use configuration files


You can use configuration files in Visual Studio to configure properties for linked services/tables/pipelines
differently for each environment.
Consider the following JSON definition for an Azure Storage linked service. To specify connectionString with
different values for accountname and accountkey based on the environment (Dev/Test/Production) to which
you are deploying Data Factory entities. You can achieve this behavior by using separate configuration file for
each environment.

{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"description": "",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Add a configuration file


Add a configuration file for each environment by performing the following steps:
1. Right-click the Data Factory project in your Visual Studio solution, point to Add, and click New item.
2. Select Config from the list of installed templates on the left, select Configuration File, enter a name
for the configuration file, and click Add.
3. Add configuration parameters and their values in the following format:

{
"$schema":
"http://datafactories.schema.management.azure.com/vsschemas/V1/Microsoft.DataFactory.Config.json",
"AzureStorageLinkedService1": [
{
"name": "$.properties.typeProperties.connectionString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
],
"AzureSqlLinkedService1": [
{
"name": "$.properties.typeProperties.connectionString",
"value": "Server=tcp:spsqlserver.database.windows.net,1433;Database=spsqldb;User
ID=spelluru;Password=Sowmya123;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
]
}

This example configures connectionString property of an Azure Storage linked service and an Azure
SQL linked service. Notice that the syntax for specifying name is JsonPath.
If JSON has a property that has an array of values as shown in the following code:

"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],

Configure properties as shown in the following configuration file (use zero-based indexing):
{
"name": "$.properties.structure[0].name",
"value": "FirstName"
}
{
"name": "$.properties.structure[0].type",
"value": "String"
}
{
"name": "$.properties.structure[1].name",
"value": "LastName"
}
{
"name": "$.properties.structure[1].type",
"value": "String"
}

Property names with spaces


If a property name has spaces in it, use square brackets as shown in the following example (Database server
name):

{
"name": "$.properties.activities[1].typeProperties.webServiceParameters.['Database server name']",
"value": "MyAsqlServer.database.windows.net"
}

Deploy solution using a configuration


When you are publishing Azure Data Factory entities in VS, you can specify the configuration that you want to
use for that publishing operation.
To publish entities in an Azure Data Factory project using configuration file:
1. Right-click Data Factory project and click Publish to see the Publish Items dialog box.
2. Select an existing data factory or specify values for creating a data factory on the Configure data factory
page, and click Next.
3. On the Publish Items page: you see a drop-down list with available configurations for the Select
Deployment Config field.

4. Select the configuration file that you would like to use and click Next.
5. Confirm that you see the name of JSON file in the Summary page and click Next.
6. Click Finish after the deployment operation is finished.
When you deploy, the values from the configuration file are used to set values for properties in the JSON files
before the entities are deployed to Azure Data Factory service.

Use Azure Key Vault


It is not advisable and often against security policy to commit sensitive data such as connection strings to the
code repository. See ADF Secure Publish sample on GitHub to learn about storing sensitive information in
Azure Key Vault and using it while publishing Data Factory entities. The Secure Publish extension for Visual
Studio allows the secrets to be stored in Key Vault and only references to them are specified in linked services/
deployment configurations. These references are resolved when you publish Data Factory entities to Azure.
These files can then be committed to source repository without exposing any secrets.

Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data


Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*

SAP HANA*

SQL Server*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Create a Data Factory pipeline that moves
data by using Azure PowerShell
7/10/2017 17 min to read Edit Online

In this article, you learn how to use PowerShell to create a data factory with a pipeline that copies data from an
Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the
Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks,
see supported data stores. The activity is powered by a globally available service that can copy data between
various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see
Data Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.

NOTE
This article does not cover all the Data Factory cmdlets. See Data Factory Cmdlet Reference for comprehensive
documentation on these cmdlets.
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how
to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.

Prerequisites
Complete prerequisites listed in the tutorial prerequisites article.
Install Azure PowerShell. Follow the instructions in How to install and configure Azure PowerShell.

Steps
Here are the steps you perform as part of this tutorial:
1. Create an Azure data factory. In this step, you create a data factory named ADFTutorialDataFactoryPSH.
2. Create linked services in the data factory. In this step, you create two linked services of types: Azure
Storage and Azure SQL Database.
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a
container and uploaded data to this storage account as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from
the blob storage is stored in this database. You created a SQL table in this database as part of
prerequisites.
3. Create input and output datasets in the data factory.
The Azure storage linked service specifies the connection string that Data Factory service uses at run
time to connect to your Azure storage account. And, the input blob dataset specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory
service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset
specifies the table in the database to which the data from the blob storage is copied.
4. Create a pipeline in the data factory. In this step, you create a pipeline with a copy activity.
The copy activity copies data from a blob in the Azure blob storage to a table in the Azure SQL
database. You can use a copy activity in a pipeline to copy data from any supported source to any
supported destination. For a list of supported data stores, see data movement activities article.
5. Monitor the pipeline. In this step, you monitor the slices of input and output datasets by using PowerShell.

Create a data factory


IMPORTANT
Complete prerequisites for the tutorial if you haven't already done so.

A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a
Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive
script to transform input data to product output data. Let's start with creating the data factory in this step.
1. Launch PowerShell. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen,
you need to run the commands again.
Run the following command, and enter the user name and password that you use to sign in to the
Azure portal:

Login-AzureRmAccount

Run the following command to view all the subscriptions for this account:

Get-AzureRmSubscription

Run the following command to select the subscription that you want to work with. Replace
<NameOfAzureSubscription> with the name of your Azure subscription:

Get-AzureRmSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzureRmContext

2. Create an Azure resource group named ADFTutorialResourceGroup by running the following


command:

New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US"

Some of the steps in this tutorial assume that you use the resource group named
ADFTutorialResourceGroup. If you use a different resource group, you need to use it in place of
ADFTutorialResourceGroup in this tutorial.
3. Run the New-AzureRmDataFactory cmdlet to create a data factory named
ADFTutorialDataFactoryPSH:
$df=New-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name
ADFTutorialDataFactoryPSH Location "West US"

This name may already have been taken. Therefore, make the name of the data factory unique by
adding a prefix or suffix (for example: ADFTutorialDataFactoryPSH05152017) and run the command
again.
Note the following points:
The name of the Azure data factory must be globally unique. If you receive the following error, change
the name (for example, yournameADFTutorialDataFactoryPSH). Use this name in place of
ADFTutorialFactoryPSH while performing steps in this tutorial. See Data Factory - Naming Rules for
Data Factory artifacts.

Data factory name ADFTutorialDataFactoryPSH is not available

To create Data Factory instances, you must be a contributor or administrator of the Azure subscription.
The name of the data factory may be registered as a DNS name in the future, and hence become publicly
visible.
You may receive the following error: "This subscription is not registered to use namespace
Microsoft.DataFactory." Do one of the following, and try publishing again:
In Azure PowerShell, run the following command to register the Data Factory provider:

Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

Run the following command to confirm that the Data Factory provider is registered:

Get-AzureRmResourceProvider

Sign in by using the Azure subscription to the Azure portal. Go to a Data Factory blade, or create a
data factory in the Azure portal. This action automatically registers the provider for you.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use
two data stores of type Azure Storage (source) and Azure SQL Database (destination).
Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of
types: AzureStorage and AzureSqlDatabase.
The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is
the one in which you created a container and uploaded the data as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the
blob storage is stored in this database. You created the emp table in this database as part of prerequisites.
Create a linked service for an Azure storage account
In this step, you link your Azure storage account to your data factory.
1. Create a JSON file named AzureStorageLinkedService.json in C:\ADFGetStartedPSH folder with
the following content: (Create the folder ADFGetStartedPSH if it does not already exist.)
IMPORTANT
Replace <accountname> and <accountkey> with name and key of your Azure storage account before saving
the file.

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=
<accountname>;AccountKey=<accountkey>"
}
}
}

2. In Azure PowerShell, switch to the ADFGetStartedPSH folder.


3. Run the New-AzureRmDataFactoryLinkedService cmdlet to create the linked service:
AzureStorageLinkedService. This cmdlet, and other Data Factory cmdlets you use in this tutorial
requires you to pass values for the ResourceGroupName and DataFactoryName parameters.
Alternatively, you can pass the DataFactory object returned by the New-AzureRmDataFactory cmdlet
without typing ResourceGroupName and DataFactoryName each time you run a cmdlet.

New-AzureRmDataFactoryLinkedService $df -File .\AzureStorageLinkedService.json

Here is the sample output:

LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Properties : Microsoft.Azure.Management.DataFactories.Models.LinkedServiceProperties
ProvisioningState : Succeeded

Other way of creating this linked service is to specify resource group name and data factory name
instead of specifying the DataFactory object.

New-AzureRmDataFactoryLinkedService -ResourceGroupName ADFTutorialResourceGroup -DataFactoryName


<Name of your data factory> -File .\AzureStorageLinkedService.json

Create a linked service for an Azure SQL database


In this step, you link your Azure SQL database to your data factory.
1. Create a JSON file named AzureSqlLinkedService.json in C:\ADFGetStartedPSH folder with the
following content:

IMPORTANT
Replace <servername>, <databasename>, &lt;username@servername&gt;, and <password> with names of
your Azure SQL server, database, user account, and password.
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<server>.database.windows.net,1433;Database=
<databasename>;User ID=<user>@<server>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}

2. Run the following command to create a linked service:

New-AzureRmDataFactoryLinkedService $df -File .\AzureSqlLinkedService.json

Here is the sample output:

LinkedServiceName : AzureSqlLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Properties : Microsoft.Azure.Management.DataFactories.Models.LinkedServiceProperties
ProvisioningState : Succeeded

Confirm that Allow access to Azure services setting is turned on for your SQL database server. To
verify and turn it on, do the following steps:
a. Log in to the Azure portal
b. Click More services > on the left, and click SQL servers in the DATABASES category.
c. Select your server in the list of SQL servers.
d. On the SQL server blade, click Show firewall settings link.
e. In the Firewall settings blade, click ON for Allow access to Azure services.
f. Click Save on the toolbar.

Create datasets
In the previous step, you created linked services to link your Azure Storage account and Azure SQL database
to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that
represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and
AzureSqlLinkedService respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses
at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies
the table in the database to which the data from the blob storage is copied.
Create an input dataset
In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a
blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service.
If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the
destination. In this tutorial, you specify a value for the fileName.
1. Create a JSON file named InputDataset.json in the C:\ADFGetStartedPSH folder, with the following
content:

{
"name": "InputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "emp.txt",
"folderPath": "adftutorial/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureBlob because data


resides in an Azure blob storage.

linkedServiceName Refers to the AzureStorageLinkedService that you


created earlier.

folderPath Specifies the blob container and the folder that


contains input blobs. In this tutorial, adftutorial is the
blob container and folder is the root folder.

fileName This property is optional. If you omit this property, all


files from the folderPath are picked. In this tutorial,
emp.txt is specified for the fileName, so only that file is
picked up for processing.

format -> type The input file is in the text format, so we use
TextFormat.

columnDelimiter The columns in the input file are delimited by comma


character ( , ).
PROPERTY DESCRIPTION

frequency/interval The frequency is set to Hour and interval is set to 1,


which means that the input slices are available hourly.
In other words, the Data Factory service looks for input
data every hour in the root folder of blob container
(adftutorial) you specified. It looks for the data within
the pipeline start and end times, not before or after
these times.

external This property is set to true if the data is not generated


by this pipeline. The input data in this tutorial is in the
emp.txt file, which is not generated by this pipeline, so
we set this property to true.

For more information about these JSON properties, see Azure Blob connector article.
2. Run the following command to create the Data Factory dataset.

New-AzureRmDataFactoryDataset $df -File .\InputDataset.json

Here is the sample output:

DatasetName : InputDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Availability : Microsoft.Azure.Management.DataFactories.Common.Models.Availability
Location : Microsoft.Azure.Management.DataFactories.Models.AzureBlobDataset
Policy : Microsoft.Azure.Management.DataFactories.Common.Models.Policy
Structure : {FirstName, LastName}
Properties : Microsoft.Azure.Management.DataFactories.Models.DatasetProperties
ProvisioningState : Succeeded

Create an output dataset


In this part of the step, you create an output dataset named OutputDataset. This dataset points to a SQL table
in the Azure SQL database represented by AzureSqlLinkedService.
1. Create a JSON file named OutputDataset.json in the C:\ADFGetStartedPSH folder with the following
content:
{
"name": "OutputDataset",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "emp"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureSqlTable because


data is copied to a table in an Azure SQL database.

linkedServiceName Refers to the AzureSqlLinkedService that you created


earlier.

tableName Specified the table to which the data is copied.

frequency/interval The frequency is set to Hour and interval is 1, which


means that the output slices are produced hourly
between the pipeline start and end times, not before or
after these times.

There are three columns ID, FirstName, and LastName in the emp table in the database. ID is an
identity column, so you need to specify only FirstName and LastName here.
For more information about these JSON properties, see Azure SQL connector article.
2. Run the following command to create the data factory dataset.

New-AzureRmDataFactoryDataset $df -File .\OutputDataset.json

Here is the sample output:


DatasetName : OutputDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Availability : Microsoft.Azure.Management.DataFactories.Common.Models.Availability
Location : Microsoft.Azure.Management.DataFactories.Models.AzureSqlTableDataset
Policy :
Structure : {FirstName, LastName}
Properties : Microsoft.Azure.Management.DataFactories.Models.DatasetProperties
ProvisioningState : Succeeded

Create a pipeline
In this step, you create a pipeline with a copy activity that uses InputDataset as an input and
OutputDataset as an output.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a
slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours.
Therefore, 24 slices of output dataset are produced by the pipeline.
1. Create a JSON file named ADFTutorialPipeline.json in the C:\ADFGetStartedPSH folder, with the
following content:

{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}
Note the following points:
In the activities section, there is only one activity whose type is set to Copy. For more information
about the copy activity, see data movement activities. In Data Factory solutions, you can also use
data transformation activities.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is
specified as the sink type. For a complete list of data stores supported by the copy activity as
sources and sinks, see supported data stores. To learn how to use a specific supported data store
as a source/sink, click the link in the table.
Replace the value of the start property with the current day and end value with the next day.
You can specify only the date part and skip the time part of the date time. For example, "2016-
02-03", which is equivalent to "2016-02-03T00:00:00Z"
Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The
end time is optional, but we use it in this tutorial.
If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the
pipeline indefinitely, specify 9999-09-09 as the value for the end property.
In the preceding example, there are 24 data slices as each data slice is produced hourly.
For descriptions of JSON properties in a pipeline definition, see create pipelines article. For
descriptions of JSON properties in a copy activity definition, see data movement activities. For
descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For
descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector
article.
2. Run the following command to create the data factory table.

New-AzureRmDataFactoryPipeline $df -File .\ADFTutorialPipeline.json

Here is the sample output:

PipelineName : ADFTutorialPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
Properties : Microsoft.Azure.Management.DataFactories.Models.PipelinePropertie
ProvisioningState : Succeeded

Congratulations! You have successfully created an Azure data factory with a pipeline to copy data from an
Azure blob storage to an Azure SQL database.

Monitor the pipeline


In this step, you use Azure PowerShell to monitor whats going on in an Azure data factory.
1. Replace <DataFactoryName> with the name of your data factory and run Get-AzureRmDataFactory,
and assign the output to a variable $df.

$df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name <DataFactoryName>

For example:
$df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name
ADFTutorialDataFactoryPSH0516

Then, run print the contents of $df to see the following output:

PS C:\ADFGetStartedPSH> $df

DataFactoryName : ADFTutorialDataFactoryPSH0516
DataFactoryId : 6f194b34-03b3-49ab-8f03-9f8a7b9d3e30
ResourceGroupName : ADFTutorialResourceGroup
Location : West US
Tags : {}
Properties : Microsoft.Azure.Management.DataFactories.Models.DataFactoryProperties
ProvisioningState : Succeeded

2. Run Get-AzureRmDataFactorySlice to get details about all slices of the OutputDataset, which is the
output dataset of the pipeline.

Get-AzureRmDataFactorySlice $df -DatasetName OutputDataset -StartDateTime 2017-05-11T00:00:00Z

This setting should match the Start value in the pipeline JSON. You should see 24 slices, one for each
hour from 12 AM of the current day to 12 AM of the next day.
Here are three sample slices from the output:

ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
DatasetName : OutputDataset
Start : 5/11/2017 11:00:00 PM
End : 5/12/2017 12:00:00 AM
RetryCount : 0
State : Ready
SubState :
LatencyStatus :
LongRetryCount : 0

ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
DatasetName : OutputDataset
Start : 5/11/2017 9:00:00 PM
End : 5/11/2017 10:00:00 PM
RetryCount : 0
State : InProgress
SubState :
LatencyStatus :
LongRetryCount : 0

ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
DatasetName : OutputDataset
Start : 5/11/2017 8:00:00 PM
End : 5/11/2017 9:00:00 PM
RetryCount : 0
State : Waiting
SubState : ConcurrencyLimit
LatencyStatus :
LongRetryCount : 0

3. Run Get-AzureRmDataFactoryRun to get the details of activity runs for a specific slice. Copy the
date-time value from the output of the previous command to specify the value for the StartDateTime
parameter.

Get-AzureRmDataFactoryRun $df -DatasetName OutputDataset -StartDateTime "5/11/2017 09:00:00 PM"

Here is the sample output:

Id : c0ddbd75-d0c7-4816-a775-
704bbd7c7eab_636301332000000000_636301368000000000_OutputDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : ADFTutorialDataFactoryPSH0516
DatasetName : OutputDataset
ProcessingStartTime : 5/16/2017 8:00:33 PM
ProcessingEndTime : 5/16/2017 8:01:36 PM
PercentComplete : 100
DataSliceStart : 5/11/2017 9:00:00 PM
DataSliceEnd : 5/11/2017 10:00:00 PM
Status : Succeeded
Timestamp : 5/16/2017 8:00:33 PM
RetryAttempt : 0
Properties : {}
ErrorMessage :
ActivityName : CopyFromBlobToSQL
PipelineName : ADFTutorialPipeline
Type : Copy

For comprehensive documentation on Data Factory cmdlets, see Data Factory Cmdlet Reference.

Summary
In this tutorial, you created an Azure data factory to copy data from an Azure blob to an Azure SQL database.
You used PowerShell to create the data factory, linked services, datasets, and a pipeline. Here are the high-
level steps you performed in this tutorial:
1. Created an Azure data factory.
2. Created linked services:
a. An Azure Storage linked service to link your Azure storage account that holds input data.
b. An Azure SQL linked service to link your SQL database that holds the output data.
3. Created datasets that describe input data and output data for pipelines.
4. Created a pipeline with Copy Activity, with BlobSource as the source and SqlSink as the sink.

Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store


CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure SQL Database

Azure SQL Data


Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Web Table (table from


HTML)

GE Historian*

To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Use Azure Resource Manager template to
create a Data Factory pipeline to copy data
7/10/2017 13 min to read Edit Online

This tutorial shows you how to use an Azure Resource Manager template to create an Azure data factory. The
data pipeline in this tutorial copies data from a source data store to a destination data store. It does not transform
input data to produce output data. For a tutorial on how to transform data using Azure Data Factory, see Tutorial:
Build a pipeline to transform data using Hadoop cluster.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see
supported data stores. The activity is powered by a globally available service that can copy data between various
data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data
Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.

NOTE
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to
transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.

Prerequisites
Go through Tutorial Overview and Prerequisites and complete the prerequisite steps.
Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure
PowerShell on your computer. In this tutorial, you use PowerShell to deploy Data Factory entities.
(optional) See Authoring Azure Resource Manager Templates to learn about Azure Resource Manager
templates.

In this tutorial
In this tutorial, you create a data factory with the following Data Factory entities:

ENTITY DESCRIPTION

Azure Storage linked service Links your Azure Storage account to the data factory. Azure
Storage is the source data store and Azure SQL database is
the sink data store for the copy activity in the tutorial. It
specifies the storage account that contains the input data for
the copy activity.

Azure SQL Database linked service Links your Azure SQL database to the data factory. It
specifies the Azure SQL database that holds the output data
for the copy activity.
ENTITY DESCRIPTION

Azure Blob input dataset Refers to the Azure Storage linked service. The linked service
refers to an Azure Storage account and the Azure Blob
dataset specifies the container, folder, and file name in the
storage that holds the input data.

Azure SQL output dataset Refers to the Azure SQL linked service. The Azure SQL linked
service refers to an Azure SQL server and the Azure SQL
dataset specifies the name of the table that holds the output
data.

Data pipeline The pipeline has one activity of type Copy that takes the
Azure blob dataset as an input and the Azure SQL dataset as
an output. The copy activity copies data from an Azure blob
to a table in the Azure SQL database.

A data factory can have one or more pipelines. A pipeline can have one or more activities in it. There are two
types of activities: data movement activities and data transformation activities. In this tutorial, you create a
pipeline with one activity (copy activity).

The following section provides the complete Resource Manager template for defining Data Factory entities so
that you can quickly run through the tutorial and test the template. To understand how each Data Factory entity
is defined, see Data Factory entities in the template section.

Data Factory JSON template


The top-level Resource Manager template for defining a data factory is:
{
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": { ...
},
"variables": { ...
},
"resources": [
{
"name": "[parameters('dataFactoryName')]",
"apiVersion": "[variables('apiVersion')]",
"type": "Microsoft.DataFactory/datafactories",
"location": "westus",
"resources": [
{ ... },
{ ... },
{ ... },
{ ... }
]
}
]
}

Create a JSON file named ADFCopyTutorialARM.json in C:\ADFGetStarted folder with the following content:

{
"contentVersion": "1.0.0.0",
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
"storageAccountName": { "type": "string", "metadata": { "description": "Name of the Azure storage
account that contains the data to be copied." } },
"storageAccountKey": { "type": "securestring", "metadata": { "description": "Key for the Azure storage
account." } },
"sourceBlobContainer": { "type": "string", "metadata": { "description": "Name of the blob container in
the Azure Storage account." } },
"sourceBlobName": { "type": "string", "metadata": { "description": "Name of the blob in the container
that has the data to be copied to Azure SQL Database table" } },
"sqlServerName": { "type": "string", "metadata": { "description": "Name of the Azure SQL Server that
will hold the output/copied data." } },
"databaseName": { "type": "string", "metadata": { "description": "Name of the Azure SQL Database in
the Azure SQL server." } },
"sqlServerUserName": { "type": "string", "metadata": { "description": "Name of the user that has
access to the Azure SQL server." } },
"sqlServerPassword": { "type": "securestring", "metadata": { "description": "Password for the user." }
},
"targetSQLTable": { "type": "string", "metadata": { "description": "Table in the Azure SQL Database
that will hold the copied data." }
}
},
"variables": {
"dataFactoryName": "[concat('AzureBlobToAzureSQLDatabaseDF', uniqueString(resourceGroup().id))]",
"azureSqlLinkedServiceName": "AzureSqlLinkedService",
"azureStorageLinkedServiceName": "AzureStorageLinkedService",
"blobInputDatasetName": "BlobInputDataset",
"sqlOutputDatasetName": "SQLOutputDataset",
"pipelineName": "Blob2SQLPipeline"
},
"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "West US",
"resources": [
{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parame
ters('storageAccountKey'))]"
}
}
},
{
"type": "linkedservices",
"name": "[variables('azureSqlLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureSqlDatabase",
"description": "Azure SQL linked service",
"typeProperties": {
"connectionString": "
[concat('Server=tcp:',parameters('sqlServerName'),'.database.windows.net,1433;Database=',
parameters('databaseName'), ';User
ID=',parameters('sqlServerUserName'),';Password=',parameters('sqlServerPassword'),';Trusted_Connection=False
;Encrypt=True;Connection Timeout=30')]"
}
}
},
{
"type": "datasets",
"name": "[variables('blobInputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"structure": [
{
"name": "Column0",
"type": "String"
},
{
"name": "Column1",
"type": "String"
}
],
"typeProperties": {
"folderPath": "[concat(parameters('sourceBlobContainer'), '/')]",
"fileName": "[parameters('sourceBlobName')]",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}
},
{
"type": "datasets",
"name": "[variables('sqlOutputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureSqlLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "[variables('azureSqlLinkedServiceName')]",
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"typeProperties": {
"tableName": "[parameters('targetSQLTable')]"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
},
{
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('azureSqlLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('sqlOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"activities": [
{
"name": "CopyFromAzureBlobToAzureSQL",
"description": "Copy data frm Azure blob to Azure SQL",
"type": "Copy",
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
{
"name": "[variables('sqlOutputDatasetName')]"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "Column0:FirstName,Column1:LastName"
"columnMappings": "Column0:FirstName,Column1:LastName"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 3,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}
]
}
]
}

Parameters JSON
Create a JSON file named ADFCopyTutorialARM-Parameters.json that contains parameters for the Azure
Resource Manager template.

IMPORTANT
Specify name and key of your Azure Storage account for storageAccountName and storageAccountKey parameters.
Specify Azure SQL server, database, user, and password for sqlServerName, databaseName, sqlServerUserName, and
sqlServerPassword parameters.

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageAccountName": { "value": "<Name of the Azure storage account>" },
"storageAccountKey": {
"value": "<Key for the Azure storage account>"
},
"sourceBlobContainer": { "value": "adftutorial" },
"sourceBlobName": { "value": "emp.txt" },
"sqlServerName": { "value": "<Name of the Azure SQL server>" },
"databaseName": { "value": "<Name of the Azure SQL database>" },
"sqlServerUserName": { "value": "<Name of the user who has access to the Azure SQL database>" },
"sqlServerPassword": { "value": "<password for the user>" },
"targetSQLTable": { "value": "emp" }
}
}

IMPORTANT
You may have separate parameter JSON files for development, testing, and production environments that you can use
with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying Data Factory
entities in these environments.

Create data factory


1. Start Azure PowerShell and run the following command:
Run the following command and enter the user name and password that you use to sign in to the
Azure portal.

Login-AzureRmAccount

Run the following command to view all the subscriptions for this account.

Get-AzureRmSubscription

Run the following command to select the subscription that you want to work with.

Get-AzureRmSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzureRmContext

2. Run the following command to deploy Data Factory entities using the Resource Manager template you
created in Step 1.

New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile C:\ADFGetStarted\ADFCopyTutorialARM.json -TemplateParameterFile
C:\ADFGetStarted\ADFCopyTutorialARM-Parameters.json

Monitor pipeline
1. Log in to the Azure portal using your Azure account.
2. Click Data factories on the left menu (or) click More services and click Data factories under
INTELLIGENCE + ANALYTICS category.
3. In the Data factories page, search for and find your data factory (AzureBlobToAzureSQLDatabaseDF).

4. Click your Azure data factory. You see the home page for the data factory.
5. Follow instructions from Monitor datasets and pipeline to monitor the pipeline and datasets you have created
in this tutorial. Currently, Visual Studio does not support monitoring Data Factory pipelines.
6. When a slice is in the Ready state, verify that the data is copied to the emp table in the Azure SQL database.
For more information on how to use Azure portal blades to monitor pipeline and datasets you have created in
this tutorial, see Monitor datasets and pipeline .
For more information on how to use the Monitor & Manage application to monitor your data pipelines, see
Monitor and manage Azure Data Factory pipelines using Monitoring App.

Data Factory entities in the template


Define data factory
You define a data factory in the Resource Manager template as shown in the following sample:

"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "West US"
}

The dataFactoryName is defined as:

"dataFactoryName": "[concat('AzureBlobToAzureSQLDatabaseDF', uniqueString(resourceGroup().id))]"

It is a unique string based on the resource group ID.


Defining Data Factory entities
The following Data Factory entities are defined in the JSON template:
1. Azure Storage linked service
2. Azure SQL linked service
3. Azure blob dataset
4. Azure SQL dataset
5. Data pipeline with a copy activity
Azure Storage linked service
The AzureStorageLinkedService links your Azure storage account to the data factory. You created a container and
uploaded data to this storage account as part of prerequisites. You specify the name and key of your Azure
storage account in this section. See Azure Storage linked service for details about JSON properties used to define
an Azure Storage linked service.

{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parame
ters('storageAccountKey'))]"
}
}
}

The connectionString uses the storageAccountName and storageAccountKey parameters. The values for these
parameters passed by using a configuration file. The definition also uses variables: azureStroageLinkedService
and dataFactoryName defined in the template.
Azure SQL Database linked service
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob
storage is stored in this database. You created the emp table in this database as part of prerequisites. You specify
the Azure SQL server name, database name, user name, and user password in this section. See Azure SQL linked
service for details about JSON properties used to define an Azure SQL linked service.
{
"type": "linkedservices",
"name": "[variables('azureSqlLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureSqlDatabase",
"description": "Azure SQL linked service",
"typeProperties": {
"connectionString": "
[concat('Server=tcp:',parameters('sqlServerName'),'.database.windows.net,1433;Database=',
parameters('databaseName'), ';User
ID=',parameters('sqlServerUserName'),';Password=',parameters('sqlServerPassword'),';Trusted_Connection=False
;Encrypt=True;Connection Timeout=30')]"
}
}
}

The connectionString uses sqlServerName, databaseName, sqlServerUserName, and sqlServerPassword


parameters whose values are passed by using a configuration file. The definition also uses the following variables
from the template: azureSqlLinkedServiceName, dataFactoryName.
Azure blob dataset
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. In Azure blob dataset definition, you specify names of blob container,
folder, and file that contains the input data. See Azure Blob dataset properties for details about JSON properties
used to define an Azure Blob dataset.
{
"type": "datasets",
"name": "[variables('blobInputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"structure": [
{
"name": "Column0",
"type": "String"
},
{
"name": "Column1",
"type": "String"
}
],
"typeProperties": {
"folderPath": "[concat(parameters('sourceBlobContainer'), '/')]",
"fileName": "[parameters('sourceBlobName')]",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}

Azure SQL dataset


You specify the name of the table in the Azure SQL database that holds the copied data from the Azure Blob
storage. See Azure SQL dataset properties for details about JSON properties used to define an Azure SQL
dataset.
{
"type": "datasets",
"name": "[variables('sqlOutputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureSqlLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "[variables('azureSqlLinkedServiceName')]",
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"typeProperties": {
"tableName": "[parameters('targetSQLTable')]"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

Data pipeline
You define a pipeline that copies data from the Azure blob dataset to the Azure SQL dataset. See Pipeline JSON
for descriptions of JSON elements used to define a pipeline in this example.
{
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('azureSqlLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('sqlOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"activities": [
{
"name": "CopyFromAzureBlobToAzureSQL",
"description": "Copy data frm Azure blob to Azure SQL",
"type": "Copy",
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
{
"name": "[variables('sqlOutputDatasetName')]"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "Column0:FirstName,Column1:LastName"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 3,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}

Reuse the template


In the tutorial, you created a template for defining Data Factory entities and a template for passing values for
parameters. The pipeline copies data from an Azure Storage account to an Azure SQL database specified via
parameters. To use the same template to deploy Data Factory entities to different environments, you create a
parameter file for each environment and use it when deploying to that environment.
Example:
New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -
TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Dev.json

New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Test.json

New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Production.json

Notice that the first command uses parameter file for the development environment, second one for the test
environment, and the third one for the production environment.
You can also reuse the template to perform repeated tasks. For example, you need to create many data factories
with one or more pipelines that implement the same logic but each data factory uses different Storage and SQL
Database accounts. In this scenario, you use the same template in the same environment (dev, test, or
production) with different parameter files to create data factories.

Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*


CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Use REST API to create an Azure Data
Factory pipeline to copy data
8/21/2017 17 min to read Edit Online

In this article, you learn how to use REST API to create a data factory with a pipeline that copies data from an
Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the
Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see
supported data stores. The activity is powered by a globally available service that can copy data between various
data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data
Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.

NOTE
This article does not cover all the Data Factory REST API. See Data Factory REST API Reference for comprehensive
documentation on Data Factory cmdlets.
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to
transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.

Prerequisites
Go through Tutorial Overview and complete the prerequisite steps.
Install Curl on your machine. You use the Curl tool with REST commands to create a data factory.
Follow instructions from this article to:
1. Create a Web application named ADFCopyTutorialApp in Azure Active Directory.
2. Get client ID and secret key.
3. Get tenant ID.
4. Assign the ADFCopyTutorialApp application to the Data Factory Contributor role.
Install Azure PowerShell.
Launch PowerShell and do the following steps. Keep Azure PowerShell open until the end of this tutorial.
If you close and reopen, you need to run the commands again.
1. Run the following command and enter the user name and password that you use to sign in to the
Azure portal:

Login-AzureRmAccount

2. Run the following command to view all the subscriptions for this account:

Get-AzureRmSubscription
3. Run the following command to select the subscription that you want to work with. Replace
<NameOfAzureSubscription> with the name of your Azure subscription.

Get-AzureRmSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzureRmContext

4. Create an Azure resource group named ADFTutorialResourceGroup by running the following


command in the PowerShell:

New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US"

If the resource group already exists, you specify whether to update it (Y) or keep it as (N).
Some of the steps in this tutorial assume that you use the resource group named
ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of
your resource group in place of ADFTutorialResourceGroup in this tutorial.

Create JSON definitions


Create following JSON files in the folder where curl.exe is located.
datafactory.json

IMPORTANT
Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name.

{
"name": "ADFCopyTutorialDF",
"location": "WestUS"
}

azurestoragelinkedservice.json

IMPORTANT
Replace accountname and accountkey with name and key of your Azure storage account. To learn how to get your
storage access key, see View, copy and regenerate storage access keys.

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

For details about JSON properties, see Azure Storage linked service.
azuersqllinkedservice.json
IMPORTANT
Replace servername, databasename, username, and password with name of your Azure SQL server, name of SQL
database, user account, and password for the account.

{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;User ID=<username>;Password=<password>;Integrated Security=False;Encrypt=True;Connect
Timeout=30"
}
}
}

For details about JSON properties, see Azure SQL linked service.
inputdataset.json

{
"name": "AzureBlobInput",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/",
"fileName": "emp.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureBlob because data resides


in an Azure blob storage.
PROPERTY DESCRIPTION

linkedServiceName Refers to the AzureStorageLinkedService that you created


earlier.

folderPath Specifies the blob container and the folder that contains
input blobs. In this tutorial, adftutorial is the blob container
and folder is the root folder.

fileName This property is optional. If you omit this property, all files
from the folderPath are picked. In this tutorial, emp.txt is
specified for the fileName, so only that file is picked up for
processing.

format -> type The input file is in the text format, so we use TextFormat.

columnDelimiter The columns in the input file are delimited by comma


character ( , ).

frequency/interval The frequency is set to Hour and interval is set to 1, which


means that the input slices are available hourly. In other
words, the Data Factory service looks for input data every
hour in the root folder of blob container (adftutorial) you
specified. It looks for the data within the pipeline start and
end times, not before or after these times.

external This property is set to true if the data is not generated by


this pipeline. The input data in this tutorial is in the emp.txt
file, which is not generated by this pipeline, so we set this
property to true.

For more information about these JSON properties, see Azure Blob connector article.
outputdataset.json

{
"name": "AzureSqlOutput",
"properties": {
"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "emp"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION

type The type property is set to AzureSqlTable because data is


copied to a table in an Azure SQL database.

linkedServiceName Refers to the AzureSqlLinkedService that you created


earlier.

tableName Specified the table to which the data is copied.

frequency/interval The frequency is set to Hour and interval is 1, which means


that the output slices are produced hourly between the
pipeline start and end times, not before or after these times.

There are three columns ID, FirstName, and LastName in the emp table in the database. ID is an identity
column, so you need to specify only FirstName and LastName here.
For more information about these JSON properties, see Azure SQL connector article.
pipeline.json

{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from a blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"description": "Push Regional Effectiveness Campaign data to Azure SQL database",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureSqlOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "60:00:00"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-05-11T00:00:00Z",
"end": "2017-05-12T00:00:00Z"
}
}
Note the following points:
In the activities section, there is only one activity whose type is set to Copy. For more information about the
copy activity, see data movement activities. In Data Factory solutions, you can also use data transformation
activities.
Input for the activity is set to AzureBlobInput and output for the activity is set to AzureSqlOutput.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink
type. For a complete list of data stores supported by the copy activity as sources and sinks, see supported data
stores. To learn how to use a specific supported data store as a source/sink, click the link in the table.
Replace the value of the start property with the current day and end value with the next day. You can specify
only the date part and skip the time part of the date time. For example, "2017-02-03", which is equivalent to
"2017-02-03T00:00:00Z"
Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The end time is
optional, but we use it in this tutorial.
If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline
indefinitely, specify 9999-09-09 as the value for the end property.
In the preceding example, there are 24 data slices as each data slice is produced hourly.
For descriptions of JSON properties in a pipeline definition, see create pipelines article. For descriptions of JSON
properties in a copy activity definition, see data movement activities. For descriptions of JSON properties
supported by BlobSource, see Azure Blob connector article. For descriptions of JSON properties supported by
SqlSink, see Azure SQL Database connector article.

Set global variables


In Azure PowerShell, execute the following commands after replacing the values with your own:

IMPORTANT
See Prerequisites section for instructions on getting client ID, client secret, tenant ID, and subscription ID.

$client_id = "<client ID of application in AAD>"


$client_secret = "<client key of application in AAD>"
$tenant = "<Azure tenant ID>";
$subscription_id="<Azure subscription ID>";

$rg = "ADFTutorialResourceGroup"

Run the following command after updating the name of the data factory you are using:

$adf = "ADFCopyTutorialDF"

Authenticate with AAD


Run the following command to authenticate with Azure Active Directory (AAD):
$cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token -F
grant_type=client_credentials -F resource=https://management.core.windows.net/ -F client_id=$client_id -F
client_secret=$client_secret };
$responseToken = Invoke-Command -scriptblock $cmd;
$accessToken = (ConvertFrom-Json $responseToken).access_token;

(ConvertFrom-Json $responseToken)

Create data factory


In this step, you create an Azure Data Factory named ADFCopyTutorialDF. A data factory can have one or more
pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source
to a destination data store. A HDInsight Hive activity to run a Hive script to transform input data to product
output data. Run the following commands to create the data factory:
1. Assign the command to variable named cmd.

IMPORTANT
Confirm that the name of the data factory you specify here (ADFCopyTutorialDF) matches the name specified in the
datafactory.json.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data @datafactory.json
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/ADFCopyTutorialDF0411?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in
the results; otherwise, you see an error message.

Write-Host $results

Note the following points:


The name of the Azure Data Factory must be globally unique. If you see the error in results: Data factory
name ADFCopyTutorialDF is not available, do the following steps:
1. Change the name (for example, yournameADFCopyTutorialDF) in the datafactory.json file.
2. In the first command where the $cmd variable is assigned a value, replace ADFCopyTutorialDF with the
new name and run the command.
3. Run the next two commands to invoke the REST API to create the data factory and print the results
of the operation.
See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription
The name of the data factory may be registered as a DNS name in the future and hence become publicly
visible.
If you receive the error: "This subscription is not registered to use namespace
Microsoft.DataFactory", do one of the following and try publishing again:
In Azure PowerShell, run the following command to register the Data Factory provider:

Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

You can run the following command to confirm that the Data Factory provider is registered.

Get-AzureRmResourceProvider

Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or)
create a data factory in the Azure portal. This action automatically registers the provider for you.
Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to
link source and destination data stores to your data store. Then, define input and output datasets to represent
data in linked data stores. Finally, create the pipeline with an activity that uses these datasets.

Create linked services


You create linked services in a data factory to link your data stores and compute services to the data factory. In
this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use
two data stores of type Azure Storage (source) and Azure SQL Database (destination). Therefore, you create two
linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and
AzureSqlDatabase.
The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the
one in which you created a container and uploaded the data as part of prerequisites.
AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob
storage is stored in this database. You created the emp table in this database as part of prerequisites.
Create Azure Storage linked service
In this step, you link your Azure storage account to your data factory. You specify the name and key of your Azure
storage account in this section. See Azure Storage linked service for details about JSON properties used to define
an Azure Storage linked service.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data "@azurestoragelinkedservice.json"
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/linkedservices/AzureStorageLinkedService?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the linked service has been successfully created, you see the JSON for the linked service
in the results; otherwise, you see an error message.

Write-Host $results

Create Azure SQL linked service


In this step, you link your Azure SQL database to your data factory. You specify the Azure SQL server name,
database name, user name, and user password in this section. See Azure SQL linked service for details about
JSON properties used to define an Azure SQL linked service.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data @azuresqllinkedservice.json
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/linkedservices/AzureSqlLinkedService?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the linked service has been successfully created, you see the JSON for the linked service
in the results; otherwise, you see an error message.

Write-Host $results

Create datasets
In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to
your data factory. In this step, you define two datasets named AzureBlobInput and AzureSqlOutput that represent
input and output data that is stored in the data stores referred by AzureStorageLinkedService and
AzureSqlLinkedService respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time to
connect to your Azure storage account. And, the input blob dataset (AzureBlobInput) specifies the container and
the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at
run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies the
table in the database to which the data from the blob storage is copied.
Create input dataset
In this step, you create a dataset named AzureBlobInput that points to a blob file (emp.txt) in the root folder of a
blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If
you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the
destination. In this tutorial, you specify a value for the fileName.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data "@inputdataset.json"
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/datasets/AzureBlobInput?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.

Write-Host $results

Create output dataset


The Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time
to connect to your Azure SQL database. The output SQL table dataset (OututDataset) you create in this step
specifies the table in the database to which the data from the blob storage is copied.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data "@outputdataset.json"
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/datasets/AzureSqlOutput?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.

Write-Host $results

Create pipeline
In this step, you create a pipeline with a copy activity that uses AzureBlobInput as an input and
AzureSqlOutput as an output.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a
slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours.
Therefore, 24 slices of output dataset are produced by the pipeline.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data "@pipeline.json"
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/datapipelines/MyFirstPipeline?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.

Write-Host $results

Congratulations! You have successfully created an Azure data factory, with a pipeline that copies data from
Azure Blob Storage to Azure SQL database.

Monitor pipeline
In this step, you use Data Factory REST API to monitor slices being produced by the pipeline.

$ds ="AzureSqlOutput"

IMPORTANT
Make sure that the start and end times specified in the following command match the start and end times of the pipeline.

$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken"


https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactor
y/datafactories/$adf/datasets/$ds/slices?start=2017-05-11T00%3a00%3a00.0000000Z"&"end=2017-05-
12T00%3a00%3a00.0000000Z"&"api-version=2015-10-01};

$results2 = Invoke-Command -scriptblock $cmd;

IF ((ConvertFrom-Json $results2).value -ne $NULL) {


ConvertFrom-Json $results2 | Select-Object -Expand value | Format-Table
} else {
(convertFrom-Json $results2).RemoteException
}

Run the Invoke-Command and the next one until you see a slice in Ready state or Failed state. When the slice is
in Ready state, check the emp table in your Azure SQL database for the output data.
For each slice, two rows of data from the source file are copied to the emp table in the Azure SQL database.
Therefore, you see 24 new records in the emp table when all the slices are successfully processed (in Ready
state).

Summary
In this tutorial, you used REST API to create an Azure data factory to copy data from an Azure blob to an Azure
SQL database. Here are the high-level steps you performed in this tutorial:
1. Created an Azure data factory.
2. Created linked services:
a. An Azure Storage linked service to link your Azure Storage account that holds input data.
b. An Azure SQL linked service to link your Azure SQL database that holds the output data.
3. Created datasets, which describe input data and output data for pipelines.
4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink.

Next steps
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP


CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Create a pipeline with Copy Activity using
.NET API
7/11/2017 14 min to read Edit Online

In this article, you learn how to use .NET API to create a data factory with a pipeline that copies data from an
Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the
Introduction to Azure Data Factory article before doing this tutorial.
In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a
supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see
supported data stores. The activity is powered by a globally available service that can copy data between various
data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data
Movement Activities.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by
setting the output dataset of one activity as the input dataset of the other activity. For more information, see
multiple activities in a pipeline.

NOTE
For complete documentation on .NET API for Data Factory, see Data Factory .NET API Reference.
The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to
transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.

Prerequisites
Go through Tutorial Overview and Pre-requisites to get an overview of the tutorial and complete the
prerequisite steps.
Visual Studio 2012 or 2013 or 2015
Download and install Azure .NET SDK
Azure PowerShell. Follow instructions in How to install and configure Azure PowerShell article to install Azure
PowerShell on your computer. You use Azure PowerShell to create an Azure Active Directory application.
Create an application in Azure Active Directory
Create an Azure Active Directory application, create a service principal for the application, and assign it to the
Data Factory Contributor role.
1. Launch PowerShell.
2. Run the following command and enter the user name and password that you use to sign in to the Azure
portal.

Login-AzureRmAccount

3. Run the following command to view all the subscriptions for this account.

Get-AzureRmSubscription
4. Run the following command to select the subscription that you want to work with. Replace
<NameOfAzureSubscription> with the name of your Azure subscription.

Get-AzureRmSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzureRmContext

IMPORTANT
Note down SubscriptionId and TenantId from the output of this command.

5. Create an Azure resource group named ADFTutorialResourceGroup by running the following command
in the PowerShell.

New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US"

If the resource group already exists, you specify whether to update it (Y) or keep it as (N).
If you use a different resource group, you need to use the name of your resource group in place of
ADFTutorialResourceGroup in this tutorial.
6. Create an Azure Active Directory application.

$azureAdApplication = New-AzureRmADApplication -DisplayName "ADFCopyTutotiralApp" -HomePage


"https://www.contoso.org" -IdentifierUris "https://www.adfcopytutorialapp.org/example" -Password
"Pass@word1"

If you get the following error, specify a different URL and run the command again.

Another object with the same value for property identifierUris already exists.

7. Create the AD service principal.

New-AzureRmADServicePrincipal -ApplicationId $azureAdApplication.ApplicationId

8. Add service principal to the Data Factory Contributor role.

New-AzureRmRoleAssignment -RoleDefinitionName "Data Factory Contributor" -ServicePrincipalName


$azureAdApplication.ApplicationId.Guid

9. Get the application ID.

$azureAdApplication

Note down the application ID (applicationID) from the output.


You should have following four values from these steps:
Tenant ID
Subscription ID
Application ID
Password (specified in the first command)
Walkthrough
1. Using Visual Studio 2012/2013/2015, create a C# .NET console application.
a. Launch Visual Studio 2012/2013/2015.
b. Click File, point to New, and click Project.
c. Expand Templates, and select Visual C#. In this walkthrough, you use C#, but you can use any .NET
language.
d. Select Console Application from the list of project types on the right.
e. Enter DataFactoryAPITestApp for the Name.
f. Select C:\ADFGetStarted for the Location.
g. Click OK to create the project.
2. Click Tools, point to NuGet Package Manager, and click Package Manager Console.
3. In the Package Manager Console, do the following steps:
a. Run the following command to install Data Factory package:
Install-Package Microsoft.Azure.Management.DataFactories
b. Run the following command to install Azure Active Directory package (you use Active Directory API in
the code): Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.19.208020213
4. Add the following appSetttings section to the App.config file. These settings are used by the helper
method: GetAuthorizationHeader.
Replace values for <Application ID>, <Password>, <Subscription ID>, and <tenant ID> with your
own values.

<?xml version="1.0" encoding="utf-8" ?>


<configuration>
<appSettings>
<add key="ActiveDirectoryEndpoint" value="https://login.microsoftonline.com/" />
<add key="ResourceManagerEndpoint" value="https://management.azure.com/" />
<add key="WindowsManagementUri" value="https://management.core.windows.net/" />

<add key="ApplicationId" value="your application ID" />


<add key="Password" value="Password you used while creating the AAD application" />
<add key="SubscriptionId" value= "Subscription ID" />
<add key="ActiveDirectoryTenantId" value="Tenant ID" />
</appSettings>
</configuration>

5. Add the following using statements to the source file (Program.cs) in the project.

using System.Configuration;
using System.Collections.ObjectModel;
using System.Threading;
using System.Threading.Tasks;

using Microsoft.Azure;
using Microsoft.Azure.Management.DataFactories;
using Microsoft.Azure.Management.DataFactories.Models;
using Microsoft.Azure.Management.DataFactories.Common.Models;

using Microsoft.IdentityModel.Clients.ActiveDirectory;

6. Add the following code that creates an instance of DataPipelineManagementClient class to the Main
method. You use this object to create a data factory, a linked service, input and output datasets, and a
pipeline. You also use this object to monitor slices of a dataset at runtime.
// create data factory management client
string resourceGroupName = "ADFTutorialResourceGroup";
string dataFactoryName = "APITutorialFactory";

TokenCloudCredentials aadTokenCredentials = new TokenCloudCredentials(


ConfigurationManager.AppSettings["SubscriptionId"],
GetAuthorizationHeader().Result);

Uri resourceManagerUri = new Uri(ConfigurationManager.AppSettings["ResourceManagerEndpoint"]);

DataFactoryManagementClient client = new DataFactoryManagementClient(aadTokenCredentials,


resourceManagerUri);

IMPORTANT
Replace the value of resourceGroupName with the name of your Azure resource group.
Update name of the data factory (dataFactoryName) to be unique. Name of the data factory must be globally
unique. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.

7. Add the following code that creates a data factory to the Main method.

// create a data factory


Console.WriteLine("Creating a data factory");
client.DataFactories.CreateOrUpdate(resourceGroupName,
new DataFactoryCreateOrUpdateParameters()
{
DataFactory = new DataFactory()
{
Name = dataFactoryName,
Location = "westus",
Properties = new DataFactoryProperties()
}
}
);

A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For
example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive
activity to run a Hive script to transform input data to product output data. Let's start with creating the
data factory in this step.
8. Add the following code that creates an Azure Storage linked service to the Main method.

IMPORTANT
Replace storageaccountname and accountkey with name and key of your Azure Storage account.
// create a linked service for input data store: Azure Storage
Console.WriteLine("Creating Azure Storage linked service");
client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName,
new LinkedServiceCreateOrUpdateParameters()
{
LinkedService = new LinkedService()
{
Name = "AzureStorageLinkedService",
Properties = new LinkedServiceProperties
(
new AzureStorageLinkedService("DefaultEndpointsProtocol=https;AccountName=
<storageaccountname>;AccountKey=<accountkey>")
)
}
}
);

You create linked services in a data factory to link your data stores and compute services to the data
factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake
Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination).
Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService
of types: AzureStorage and AzureSqlDatabase.
The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account
is the one in which you created a container and uploaded the data as part of prerequisites.
9. Add the following code that creates an Azure SQL linked service to the Main method.

IMPORTANT
Replace servername, databasename, username, and password with names of your Azure SQL server, database,
user, and password.

// create a linked service for output data store: Azure SQL Database
Console.WriteLine("Creating Azure SQL Database linked service");
client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName,
new LinkedServiceCreateOrUpdateParameters()
{
LinkedService = new LinkedService()
{
Name = "AzureSqlLinkedService",
Properties = new LinkedServiceProperties
(
new AzureSqlDatabaseLinkedService("Data Source=tcp:
<servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>;Password=
<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30")
)
}
}
);

AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the
blob storage is stored in this database. You created the emp table in this database as part of prerequisites.
10. Add the following code that creates input and output datasets to the Main method.

// create input and output datasets


Console.WriteLine("Creating input and output datasets");
string Dataset_Source = "InputDataset";
string Dataset_Destination = "OutputDataset";
string Dataset_Destination = "OutputDataset";

Console.WriteLine("Creating input dataset of type: Azure Blob");


client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName,

new DatasetCreateOrUpdateParameters()
{
Dataset = new Dataset()
{
Name = Dataset_Source,
Properties = new DatasetProperties()
{
Structure = new List<DataElement>()
{
new DataElement() { Name = "FirstName", Type = "String" },
new DataElement() { Name = "LastName", Type = "String" }
},
LinkedServiceName = "AzureStorageLinkedService",
TypeProperties = new AzureBlobDataset()
{
FolderPath = "adftutorial/",
FileName = "emp.txt"
},
External = true,
Availability = new Availability()
{
Frequency = SchedulePeriod.Hour,
Interval = 1,
},

Policy = new Policy()


{
Validation = new ValidationPolicy()
{
MinimumRows = 1
}
}
}
}
});

Console.WriteLine("Creating output dataset of type: Azure SQL");


client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName,
new DatasetCreateOrUpdateParameters()
{
Dataset = new Dataset()
{
Name = Dataset_Destination,
Properties = new DatasetProperties()
{
Structure = new List<DataElement>()
{
new DataElement() { Name = "FirstName", Type = "String" },
new DataElement() { Name = "LastName", Type = "String" }
},
LinkedServiceName = "AzureSqlLinkedService",
TypeProperties = new AzureSqlTableDataset()
{
TableName = "emp"
},
Availability = new Availability()
{
Frequency = SchedulePeriod.Hour,
Interval = 1,
},
}
}
});
In the previous step, you created linked services to link your Azure Storage account and Azure SQL
database to your data factory. In this step, you define two datasets named InputDataset and
OutputDataset that represent input and output data that is stored in the data stores referred by
AzureStorageLinkedService and AzureSqlLinkedService respectively.
The Azure storage linked service specifies the connection string that Data Factory service uses at run time
to connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the
container and the folder that contains the input data.
Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service
uses at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset)
specifies the table in the database to which the data from the blob storage is copied.
In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder
of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService
linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input
folder are copied to the destination. In this tutorial, you specify a value for the fileName.
In this step, you create an output dataset named OutputDataset. This dataset points to a SQL table in the
Azure SQL database represented by AzureSqlLinkedService.
11. Add the following code that creates and activates a pipeline to the Main method. In this step, you
create a pipeline with a copy activity that uses InputDataset as an input and OutputDataset as an
output.
// create a pipeline
Console.WriteLine("Creating a pipeline");
DateTime PipelineActivePeriodStartTime = new DateTime(2017, 5, 11, 0, 0, 0, 0, DateTimeKind.Utc);
DateTime PipelineActivePeriodEndTime = new DateTime(2017, 5, 12, 0, 0, 0, 0, DateTimeKind.Utc);
string PipelineName = "ADFTutorialPipeline";

client.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName,
new PipelineCreateOrUpdateParameters()
{
Pipeline = new Pipeline()
{
Name = PipelineName,
Properties = new PipelineProperties()
{
Description = "Demo Pipeline for data transfer between blobs",

// Initial value for pipeline's active period. With this, you won't need to set slice
status
Start = PipelineActivePeriodStartTime,
End = PipelineActivePeriodEndTime,

Activities = new List<Activity>()


{
new Activity()
{
Name = "BlobToAzureSql",
Inputs = new List<ActivityInput>()
{
new ActivityInput() {
Name = Dataset_Source
}
},
Outputs = new List<ActivityOutput>()
{
new ActivityOutput()
{
Name = Dataset_Destination
}
},
TypeProperties = new CopyActivity()
{
Source = new BlobSource(),
Sink = new BlobSink()
{
WriteBatchSize = 10000,
WriteBatchTimeout = TimeSpan.FromMinutes(10)
}
}
}
}
}
}
});

Note the following points:


In the activities section, there is only one activity whose type is set to Copy. For more information
about the copy activity, see data movement activities. In Data Factory solutions, you can also use data
transformation activities.
Input for the activity is set to InputDataset and output for the activity is set to OutputDataset.
In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as
the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see
supported data stores. To learn how to use a specific supported data store as a source/sink, click the
link in the table.
Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to
produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is
24 hours. Therefore, 24 slices of output dataset are produced by the pipeline.
12. Add the following code to the Main method to get the status of a data slice of the output dataset. There is
only slice expected in this sample.

// Pulling status within a timeout threshold


DateTime start = DateTime.Now;
bool done = false;

while (DateTime.Now - start < TimeSpan.FromMinutes(5) && !done)


{
Console.WriteLine("Pulling the slice status");
// wait before the next status check
Thread.Sleep(1000 * 12);

var datalistResponse = client.DataSlices.List(resourceGroupName, dataFactoryName,


Dataset_Destination,
new DataSliceListParameters()
{
DataSliceRangeStartTime = PipelineActivePeriodStartTime.ConvertToISO8601DateTimeString(),
DataSliceRangeEndTime = PipelineActivePeriodEndTime.ConvertToISO8601DateTimeString()
});

foreach (DataSlice slice in datalistResponse.DataSlices)


{
if (slice.State == DataSliceState.Failed || slice.State == DataSliceState.Ready)
{
Console.WriteLine("Slice execution is done with status: {0}", slice.State);
done = true;
break;
}
else
{
Console.WriteLine("Slice status is: {0}", slice.State);
}
}
}

13. Add the following code to get run details for a data slice to the Main method.
Console.WriteLine("Getting run details of a data slice");

// give it a few minutes for the output slice to be ready


Console.WriteLine("\nGive it a few minutes for the output slice to be ready and press any key.");
Console.ReadKey();

var datasliceRunListResponse = client.DataSliceRuns.List(


resourceGroupName,
dataFactoryName,
Dataset_Destination,
new DataSliceRunListParameters()
{
DataSliceStartTime = PipelineActivePeriodStartTime.ConvertToISO8601DateTimeString()
}
);

foreach (DataSliceRun run in datasliceRunListResponse.DataSliceRuns)


{
Console.WriteLine("Status: \t\t{0}", run.Status);
Console.WriteLine("DataSliceStart: \t{0}", run.DataSliceStart);
Console.WriteLine("DataSliceEnd: \t\t{0}", run.DataSliceEnd);
Console.WriteLine("ActivityId: \t\t{0}", run.ActivityName);
Console.WriteLine("ProcessingStartTime: \t{0}", run.ProcessingStartTime);
Console.WriteLine("ProcessingEndTime: \t{0}", run.ProcessingEndTime);
Console.WriteLine("ErrorMessage: \t{0}", run.ErrorMessage);
}

Console.WriteLine("\nPress any key to exit.");


Console.ReadKey();

14. Add the following helper method used by the Main method to the Program class.

NOTE
When you copy and paste the following code, make sure that the copied code is at the same level as the Main
method.

public static async Task<string> GetAuthorizationHeader()


{
AuthenticationContext context = new
AuthenticationContext(ConfigurationManager.AppSettings["ActiveDirectoryEndpoint"] +
ConfigurationManager.AppSettings["ActiveDirectoryTenantId"]);
ClientCredential credential = new ClientCredential(
ConfigurationManager.AppSettings["ApplicationId"],
ConfigurationManager.AppSettings["Password"]);
AuthenticationResult result = await context.AcquireTokenAsync(
resource: ConfigurationManager.AppSettings["WindowsManagementUri"],
clientCredential: credential);

if (result != null)
return result.AccessToken;

throw new InvalidOperationException("Failed to acquire token");


}

15. In the Solution Explorer, expand the project (DataFactoryAPITestApp), right-click References, and click
Add Reference. Select check box for System.Configuration assembly. and click OK.
16. Build the console application. Click Build on the menu and click Build Solution.
17. Confirm that there is at least one file in the adftutorial container in your Azure blob storage. If not, create
Emp.txt file in Notepad with the following content and upload it to the adftutorial container.
John, Doe
Jane, Doe

18. Run the sample by clicking Debug -> Start Debugging on the menu. When you see the Getting run
details of a data slice, wait for a few minutes, and press ENTER.
19. Use the Azure portal to verify that the data factory APITutorialFactory is created with the following artifacts:
Linked service: LinkedService_AzureStorage
Dataset: InputDataset and OutputDataset.
Pipeline: PipelineBlobSample
20. Verify that the two employee records are created in the emp table in the specified Azure SQL database.

Next steps
For complete documentation on .NET API for Data Factory, see Data Factory .NET API Reference.
In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination
data store in a copy operation. The following table provides a list of data stores supported as sources and
destinations by the copy activity:

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*

SAP HANA*

SQL Server*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

To learn about how to copy data to/from a data store, click the link for the data store in the table.
Tutorial: Build your first pipeline to transform data
using Hadoop cluster
8/24/2017 4 min to read Edit Online

In this tutorial, you build your first Azure data factory with a data pipeline. The pipeline transforms input
data by running Hive script on an Azure HDInsight (Hadoop) cluster to produce output data.
This article provides overview and prerequisites for the tutorial. After you complete the prerequisites, you
can do the tutorial using one of the following tools/SDKs: Azure portal, Visual Studio, PowerShell, Resource
Manager template, REST API. Select one of the options in the drop-down list at the beginning (or) links at
the end of this article to do the tutorial using one of these options.

Tutorial overview
In this tutorial, you perform the following steps:
1. Create a data factory. A data factory can contain one or more data pipelines that move and
transform data.
In this tutorial, you create one pipeline in the data factory.
2. Create a pipeline. A pipeline can have one or more activities (Examples: Copy Activity, HDInsight
Hive Activity). This sample uses the HDInsight Hive activity that runs a Hive script on a HDInsight
Hadoop cluster. The script first creates a table that references the raw web log data stored in Azure
blob storage and then partitions the raw data by year and month.
In this tutorial, the pipeline uses the Hive Activity to transform data by running a Hive query on an
Azure HDInsight Hadoop cluster.
3. Create linked services. You create a linked service to link a data store or a compute service to the
data factory. A data store such as Azure Storage holds input/output data of activities in the pipeline. A
compute service such as HDInsight Hadoop cluster processes/transforms data.
In this tutorial, you create two linked services: Azure Storage and Azure HDInsight. The Azure
Storage linked service links an Azure Storage Account that holds the input/output data to the data
factory. Azure HDInsight linked service links an Azure HDInsight cluster that is used to transform data
to the data factory.
4. Create input and output datasets. An input dataset represents the input for an activity in the pipeline
and an output dataset represents the output for the activity.
In this tutorial, the input and output datasets specify locations of input and output data in the Azure
Blob Storage. The Azure Storage linked service specifies what Azure Storage Account is used. An
input dataset specifies where the input files are located and an output dataset specifies where the
output files are placed.
See Introduction to Azure Data Factory article for a detailed overview of Azure Data Factory.
Here is the diagram view of the sample data factory you build in this tutorial. MyFirstPipeline has one
activity of type Hive that consumes AzureBlobInput dataset as an input and produces AzureBlobOutput
dataset as an output.
In this tutorial, inputdata folder of the adfgetstarted Azure blob container contains one file named
input.log. This log file has entries from three months: January, February, and March of 2016. Here are the
sample rows for each month in the input file.

2016-01-01,02:01:09,SAMPLEWEBSITE,GET,/blogposts/mvc4/step2.png,X-ARR-LOG-ID=2ec4b8ad-3cf0-4442-93ab-
837317ece6a1,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+
(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-
,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-
post-scenarios.aspx,\N,200,0,0,53175,871
2016-02-01,02:01:10,SAMPLEWEBSITE,GET,/blogposts/mvc4/step7.png,X-ARR-LOG-ID=d7472a26-431a-4a4d-99eb-
c7b4fda2cf4c,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+
(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-
,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-
post-scenarios.aspx,\N,200,0,0,30184,871
2016-03-01,02:01:10,SAMPLEWEBSITE,GET,/blogposts/mvc4/step7.png,X-ARR-LOG-ID=d7472a26-431a-4a4d-99eb-
c7b4fda2cf4c,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+
(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-
,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-
post-scenarios.aspx,\N,200,0,0,30184,871

When the file is processed by the pipeline with HDInsight Hive Activity, the activity runs a Hive script on the
HDInsight cluster that partitions input data by year and month. The script creates three output folders that
contain a file with entries from each month.

adfgetstarted/partitioneddata/year=2016/month=1/000000_0
adfgetstarted/partitioneddata/year=2016/month=2/000000_0
adfgetstarted/partitioneddata/year=2016/month=3/000000_0

From the sample lines shown above, the first one (with 2016-01-01) is written to the 000000_0 file in the
month=1 folder. Similarly, the second one is written to the file in the month=2 folder and the third one is
written to the file in the month=3 folder.

Prerequisites
Before you begin this tutorial, you must have the following prerequisites:
1. Azure subscription - If you don't have an Azure subscription, you can create a free trial account in just a
couple of minutes. See the Free Trial article on how you can obtain a free trial account.
2. Azure Storage You use a general-purpose standard Azure storage account for storing the data in this
tutorial. If you don't have a general-purpose standard Azure storage account, see the Create a storage
account article. After you have created the storage account, note down the account name and access
key. See View, copy and regenerate storage access keys.
3. Download and review the Hive query file (HQL) located at:
https://adftutorialfiles.blob.core.windows.net/hivetutorial/partitionweblogs.hql. This query transforms
input data to produce output data.
4. Download and review the sample input file (input.log) located at:
https://adftutorialfiles.blob.core.windows.net/hivetutorial/input.log
5. Create a blob container named adfgetstarted in your Azure Blob Storage.
6. Upload partitionweblogs.hql file to the script folder in the adfgetstarted container. Use tools such as
Microsoft Azure Storage Explorer.
7. Upload input.log file to the inputdata folder in the adfgetstarted container.
After you complete the prerequisites, select one of the following tools/SDKs to do the tutorial:
Azure portal
Visual Studio
PowerShell
Resource Manager template
REST API
Azure portal and Visual Studio provide GUI way of building your data factories. Whereas, PowerShell,
Resource Manager Template, and REST API options provides scripting/programming way of building your
data factories.

NOTE
The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source
data store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see Tutorial:
Copy data from Blob Storage to SQL Database.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input
dataset of the other activity. See Scheduling and execution in Data Factory for detailed information.
Tutorial: Build your first Azure data factory using
Azure portal
8/21/2017 14 min to read Edit Online

In this article, you learn how to use Azure portal to create your first Azure data factory. To do the tutorial using
other tools/SDKs, select one of the options from the drop-down list.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.

NOTE
The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data using
Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the
output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution
in Data Factory.

Prerequisites
1. Read through Tutorial Overview article and complete the prerequisite steps.
2. This article does not provide a conceptual overview of the Azure Data Factory service. We recommend that you
go through Introduction to Azure Data Factory article for a detailed overview of the service.

Create data factory


A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a
Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive
script to transform input data to product output data. Let's start with creating the data factory in this step.
1. Log in to the Azure portal.
2. Click NEW on the left menu, click Data + Analytics, and click Data Factory.
3. In the New data factory blade, enter GetStartedDF for the Name.

IMPORTANT
The name of the Azure data factory must be globally unique. If you receive the error: Data factory name
GetStartedDF is not available. Change the name of the data factory (for example, yournameGetStartedDF) and
try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
The name of the data factory may be registered as a DNS name in the future and hence become publically visible.
4. Select the Azure subscription where you want the data factory to be created.
5. Select existing resource group or create a resource group. For the tutorial, create a resource group named:
ADFGetStartedRG.
6. Select the location for the data factory. Only regions supported by the Data Factory service are shown in the
drop-down list.
7. Select Pin to dashboard.
8. Click Create on the New data factory blade.

IMPORTANT
To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.

9. On the dashboard, you see the following tile with status: Deploying data factory.

10. Congratulations! You have successfully created your first data factory. After the data factory has been
created successfully, you see the data factory page, which shows you the contents of the data factory.
Before creating a pipeline in the data factory, you need to create a few Data Factory entities first. You first create
linked services to link data stores/computes to your data store, define input and output datasets to represent
input/output data in linked data stores, and then create the pipeline with an activity that uses these datasets.

Create linked services


In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory.
The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked
service is used to run a Hive script specified in the activity of the pipeline in this sample. Identify what data
store/compute services are used in your scenario and link those services to the data factory by creating linked
services.
Create Azure Storage linked service
In this step, you link your Azure Storage account to your data factory. In this tutorial, you use the same Azure
Storage account to store input/output data and the HQL script file.
1. Click Author and deploy on the DATA FACTORY blade for GetStartedDF. You should see the Data
Factory Editor.
2. Click New data store and choose Azure storage.

3. You should see the JSON script for creating an Azure Storage linked service in the editor.

4. Replace account name with the name of your Azure storage account and account key with the access key of
the Azure storage account. To learn how to get your storage access key, see the information about how to
view, copy, and regenerate storage access keys in Manage your storage account.
5. Click Deploy on the command bar to deploy the linked service.

After the linked service is deployed successfully, the Draft-1 window should disappear and you see
AzureStorageLinkedService in the tree view on the left.
Create Azure HDInsight linked service
In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically
created at runtime and deleted after it is done processing and idle for the specified amount of time.
1. In the Data Factory Editor, click ... More, click New compute, and select On-demand HDInsight
cluster.

2. Copy and paste the following snippet to the Draft-1 window. The JSON snippet describes the properties
that are used to create the HDInsight cluster on-demand.

{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "AzureStorageLinkedService"
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

ClusterSize Specifies the size of the HDInsight cluster.

TimeToLive Specifies that the idle time for the HDInsight cluster,
before it is deleted.

linkedServiceName Specifies the storage account that is used to store the


logs that are generated by HDInsight.

Note the following points:


The Data Factory creates a Linux-based HDInsight cluster for you with the JSON. See On-demand
HDInsight Linked Service for details.
You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See
HDInsight Linked Service for details.
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This
behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every
time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is
automatically deleted when the processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not
need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost.
The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-
datetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure
blob storage.
See On-demand HDInsight Linked Service for details.
3. Click Deploy on the command bar to deploy the linked service.

4. Confirm that you see both AzureStorageLinkedService and HDInsightOnDemandLinkedService in


the tree view on the left.

Create datasets
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to
the AzureStorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure
Storage account and datasets specify container, folder, file name in the storage that holds input and output data.
Create input dataset
1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob
storage.
2. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a
dataset called AzureBlobInput that represents input data for an activity in the pipeline. In addition, you
specify that the input data is located in the blob container called adfgetstarted and the folder called
inputdata.

{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureBlob because data


resides in an Azure blob storage.

linkedServiceName Refers to the AzureStorageLinkedService you created


earlier.

folderPath Specifies the blob container and the folder that contains
input blobs.
PROPERTY DESCRIPTION

fileName This property is optional. If you omit this property, all the
files from the folderPath are picked. In this tutorial, only
the input.log is processed.

type The log files are in text format, so we use TextFormat.

columnDelimiter columns in the log files are delimited by comma


character ( , )

frequency/interval frequency set to Month and interval is 1, which means


that the input slices are available monthly.

external This property is set to true if the input data is not


generated by this pipeline. In this tutorial, the input.log
file is not generated by this pipeline, so we set the
property to true.

For more information about these JSON properties, see Azure Blob connector article.
3. Click Deploy on the command bar to deploy the newly created dataset. You should see the dataset in the tree
view on the left.
Create output dataset
Now, you create the output dataset to represent the output data stored in the Azure Blob storage.
1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob
storage.
2. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a
dataset called AzureBlobOutput, and specifying the structure of the data that is produced by the Hive
script. In addition, you specify that the results are stored in the blob container called adfgetstarted and the
folder called partitioneddata. The availability section specifies that the output dataset is produced on a
monthly basis.

{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adfgetstarted/partitioneddata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}

See Create the input dataset section for descriptions of these properties. You do not set the external
property on an output dataset as the dataset is produced by the Data Factory service.
3. Click Deploy on the command bar to deploy the newly created dataset.
4. Verify that the dataset is created successfully.

Create pipeline
In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly
(frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also
set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output
dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any
output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the
following JSON are explained at the end of this section.
1. In the Data Factory Editor, click Ellipsis () More commands and then click New pipeline.

2. Copy and paste the following snippet to the Draft-1 window.

IMPORTANT
Replace storageaccountname with the name of your storage account in the JSON.
{
"name": "MyFirstPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-02T00:00:00Z",
"isPaused": false
}
}

In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process
Data on an HDInsight cluster.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container
adfgetstarted.
The defines section is used to specify the runtime settings that are passed to the hive script as Hive
configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}).
The start and end properties of the pipeline specifies the active period of the pipeline.
In the activity JSON, you specify that the Hive script runs on the compute specified by the
linkedServiceName HDInsightOnDemandLinkedService.
NOTE
See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties used in the
example.

3. Confirm the following:


a. input.log file exists in the inputdata folder of the adfgetstarted container in the Azure blob storage
b. partitionweblogs.hql file exists in the script folder of the adfgetstarted container in the Azure blob
storage. Complete the prerequisite steps in the Tutorial Overview if you don't see these files.
c. Confirm that you replaced storageaccountname with the name of your storage account in the
pipeline JSON.
4. Click Deploy on the command bar to deploy the pipeline. Since the start and end times are set in the past and
isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy.
5. Confirm that you see the pipeline in the tree view.

6. Congratulations, you have successfully created your first pipeline!

Monitor pipeline
Monitor pipeline using Diagram View
1. Click X to close Data Factory Editor blades and to navigate back to the Data Factory blade, and click
Diagram.

2. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial.
3. To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline.

4. Confirm that you see the HDInsightHive activity in the pipeline.

To navigate back to the previous view, click Data factory in the breadcrumb menu at the top.
5. In the Diagram View, double-click the dataset AzureBlobInput. Confirm that the slice is in Ready state. It
may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait
for sometime, see if you have the input file (input.log) placed in the right container (adfgetstarted) and
folder (inputdata).
6. Click X to close AzureBlobInput blade.
7. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently
being processed.

8. When processing is done, you see the slice in Ready state.


IMPORTANT
Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect
the pipeline to take approximately 30 minutes to process the slice.

9. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your
blob storage for the output data.

10. Click the slice to see details about it in a Data slice blade.
11. Click an activity run in the Activity runs list to see details about an activity run (Hive activity in our
scenario) in an Activity run details window.
From the log files, you can see the Hive query that was executed and status information. These logs are
useful for troubleshooting any issues. See Monitor and manage pipelines using Azure portal blades article
for more details.

IMPORTANT
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial
again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.

Monitor pipeline using Monitor & Manage App


You can also use Monitor & Manage application to monitor your pipelines. For detailed information about using
this application, see Monitor and manage Azure Data Factory pipelines using Monitoring and Management App.
1. Click Monitor & Manage tile on the home page for your data factory.
2. You should see Monitor & Manage application. Change the Start time and End time to match start
and end times of your pipeline, and click Apply.

3. Select an activity window in the Activity Windows list to see details about it.

Summary
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop
cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
1. Created an Azure data factory.
2. Created two linked services:
a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data
factory.
b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the
data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data
and produce output data.
3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
4. Created a pipeline with a HDInsight Hive activity.

Next Steps
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script
on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure
SQL, see Tutorial: Copy data from an Azure blob to Azure SQL.

See Also
TOPIC DESCRIPTION

Pipelines This article helps you understand pipelines and activities in


Azure Data Factory and how to use them to construct end-
to-end data-driven workflows for your scenario or business.

Datasets This article helps you understand datasets in Azure Data


Factory.

Scheduling and execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.

Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Tutorial: Create a data factory by using Visual Studio
8/21/2017 22 min to read Edit Online

This tutorial shows you how to create an Azure data factory by using Visual Studio. You create a Visual Studio
project using the Data Factory project template, define Data Factory entities (linked services, datasets, and
pipeline) in JSON format, and then publish/deploy these entities to the cloud.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.

NOTE
This tutorial does not show how copy data by using Azure Data Factory. For a tutorial on how to copy data using Azure
Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the
output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution
in Data Factory.

Walkthrough: Create and publish Data Factory entities


Here are the steps you perform as part of this walkthrough:
1. Create two linked services: AzureStorageLinkedService1 and HDInsightOnDemandLinkedService1.
In this tutorial, both input and output data for the hive activity are in the same Azure Blob Storage. You use
an on-demand HDInsight cluster to process existing input data to produce output data. The on-demand
HDInsight cluster is automatically created for you by Azure Data Factory at run time when the input data is
ready to be processed. You need to link your data stores or computes to your data factory so that the Data
Factory service can connect to them at runtime. Therefore, you link your Azure Storage Account to the data
factory by using the AzureStorageLinkedService1, and link an on-demand HDInsight cluster by using the
HDInsightOnDemandLinkedService1. When publishing, you specify the name for the data factory to be
created or an existing data factory.
2. Create two datasets: InputDataset and OutputDataset, which represent the input/output data that is
stored in the Azure blob storage.
These dataset definitions refer to the Azure Storage linked service you created in the previous step. For the
InputDataset, you specify the blob container (adfgetstarted) and the folder (inptutdata) that contains a blob
with the input data. For the OutputDataset, you specify the blob container (adfgetstarted) and the folder
(partitioneddata) that holds the output data. You also specify other properties such as structure, availability,
and policy.
3. Create a pipeline named MyFirstPipeline.
In this walkthrough, the pipeline has only one activity: HDInsight Hive Activity. This activity transform
input data to produce output data by running a hive script on an on-demand HDInsight cluster. To learn
more about hive activity, see Hive Activity
4. Create a data factory named DataFactoryUsingVS. Deploy the data factory and all Data Factory entities
(linked services, tables, and the pipeline).
5. After you publish, you use Azure portal blades and Monitoring & Management App to monitor the pipeline.
Prerequisites
1. Read through Tutorial Overview article and complete the prerequisite steps. You can also select the
Overview and prerequisites option in the drop-down list at the top to switch to the article. After you
complete the prerequisites, switch back to this article by selecting Visual Studio option in the drop-down list.
2. To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.
3. You must have the following installed on your computer:
Visual Studio 2013 or Visual Studio 2015
Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to Azure Download Page
and click VS 2013 or VS 2015 in the .NET section.
Download the latest Azure Data Factory plugin for Visual Studio: VS 2013 or VS 2015. You can also
update the plugin by doing the following steps: On the menu, click Tools -> Extensions and Updates
-> Online -> Visual Studio Gallery -> Microsoft Azure Data Factory Tools for Visual Studio ->
Update.
Now, let's use Visual Studio to create an Azure data factory.
Create Visual Studio project
1. Launch Visual Studio 2013 or Visual Studio 2015. Click File, point to New, and click Project. You should
see the New Project dialog box.
2. In the New Project dialog, select the DataFactory template, and click Empty Data Factory Project.

3. Enter a name for the project, location, and a name for the solution, and click OK.
Create linked services
In this step, you create two linked services: Azure Storage and HDInsight on-demand.
The Azure Storage linked service links your Azure Storage account to the data factory by providing the connection
information. Data Factory service uses the connection string from the linked service setting to connect to the
Azure storage at runtime. This storage holds input and output data for the pipeline, and the hive script file used by
the hive activity.
With on-demand HDInsight linked service, The HDInsight cluster is automatically created at runtime when the
input data is ready to processed. The cluster is deleted after it is done processing and idle for the specified
amount of time.

NOTE
You create a data factory by specifying its name and settings at the time of publishing your Data Factory solution.

Create Azure Storage linked service


1. Right-click Linked Services in the solution explorer, point to Add, and click New Item.
2. In the Add New Item dialog box, select Azure Storage Linked Service from the list, and click Add.

3. Replace <accountname> and <accountkey> with the name of your Azure storage account and its key. To learn
how to get your storage access key, see the information about how to view, copy, and regenerate storage
access keys in Manage your storage account.

4. Save the AzureStorageLinkedService1.json file.


Create Azure HDInsight linked service
1. In the Solution Explorer, right-click Linked Services, point to Add, and click New Item.
2. Select HDInsight On Demand Linked Service, and click Add.
3. Replace the JSON with the following JSON:

{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "AzureStorageLinkedService1"
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

ClusterSize Specifies the size of the HDInsight Hadoop cluster.

TimeToLive Specifies that the idle time for the HDInsight cluster,
before it is deleted.

linkedServiceName Specifies the storage account that is used to store the


logs that are generated by HDInsight Hadoop cluster.

IMPORTANT
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by
design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed
unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for
troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these
containers follow a pattern: adf<yourdatafactoryname>-<linkedservicename>-datetimestamp . Use tools such as
Microsoft Storage Explorer to delete containers in your Azure blob storage.

For more information about JSON properties, see Compute linked services article.
4. Save the HDInsightOnDemandLinkedService1.json file.
Create datasets
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to
the AzureStorageLinkedService1 you have created earlier in this tutorial. The linked service points to an Azure
Storage account and datasets specify container, folder, file name in the storage that holds input and output data.
Create input dataset
1. In the Solution Explorer, right-click Tables, point to Add, and click New Item.
2. Select Azure Blob from the list, change the name of the file to InputDataSet.json, and click Add.
3. Replace the JSON in the editor with the following JSON snippet:
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService1",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}

This JSON snippet defines a dataset called AzureBlobInput that represents input data for the hive activity
in the pipeline. You specify that the input data is located in the blob container called adfgetstarted and the
folder called inputdata .
The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureBlob because data


resides in Azure Blob Storage.

linkedServiceName Refers to the AzureStorageLinkedService1 you created


earlier.

fileName This property is optional. If you omit this property, all the
files from the folderPath are picked. In this case, only the
input.log is processed.

type The log files are in text format, so we use TextFormat.

columnDelimiter columns in the log files are delimited by the comma


character ( , )

frequency/interval frequency set to Month and interval is 1, which means


that the input slices are available monthly.

external This property is set to true if the input data for the
activity is not generated by the pipeline. This property is
only specified on input datasets. For the input dataset of
the first activity, always set it to true.

4. Save the InputDataset.json file.


Create output dataset
Now, you create the output dataset to represent output data stored in the Azure Blob storage.
1. In the Solution Explorer, right-click tables, point to Add, and click New Item.
2. Select Azure Blob from the list, change the name of the file to OutputDataset.json, and click Add.
3. Replace the JSON in the editor with the following JSON:

{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService1",
"typeProperties": {
"folderPath": "adfgetstarted/partitioneddata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}

The JSON snippet defines a dataset called AzureBlobOutput that represents output data produced by the
hive activity in the pipeline. You specify that the output data is produced by the hive activity is placed in the
blob container called adfgetstarted and the folder called partitioneddata .
The availability section specifies that the output dataset is produced on a monthly basis. The output
dataset drives the schedule of the pipeline. The pipeline runs monthly between its start and end times.
See Create the input dataset section for descriptions of these properties. You do not set the external
property on an output dataset as the dataset is produced by the pipeline.
4. Save the OutputDataset.json file.
Create pipeline
You have created the Azure Storage linked service, and input and output datasets so far. Now, you create a
pipeline with a HDInsightHive activity. The input for the hive activity is set to AzureBlobInput and output is
set to AzureBlobOutput. A slice of an input dataset is available monthly (frequency: Month, interval: 1), and the
output slice is produced monthly too.
1. In the Solution Explorer, right-click Pipelines, point to Add, and click New Item.
2. Select Hive Transformation Pipeline from the list, and click Add.
3. Replace the JSON with the following snippet:

IMPORTANT
Replace <storageaccountname> with the name of your storage account.
{
"name": "MyFirstPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService1",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
],
"start": "2016-04-01T00:00:00Z",
"end": "2016-04-02T00:00:00Z",
"isPaused": false
}
}

IMPORTANT
Replace <storageaccountname> with the name of your storage account.

The JSON snippet defines a pipeline that consists of a single activity (Hive Activity). This activity runs a Hive
script to process input data on an on-demand HDInsight cluster to produce output data. In the activities
section of the pipeline JSON, you see only one activity in the array with type set to HDInsightHive.
In the type properties that are specific to HDInsight Hive activity, you specify what Azure Storage linked
service has the hive script file, the path to the script file, and parameters to the script file.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService), and in the script folder in the container adfgetstarted .
The defines section is used to specify the runtime settings that are passed to the hive script as Hive
configuration values (e.g ${hiveconf:inputtable} , ${hiveconf:partitionedtable}) .
The start and end properties of the pipeline specifies the active period of the pipeline. You configured the
dataset to be produced monthly, therefore, only once slice is produced by the pipeline (because the month
is same in start and end dates).
In the activity JSON, you specify that the Hive script runs on the compute specified by the
linkedServiceName HDInsightOnDemandLinkedService.
4. Save the HiveActivity1.json file.
Add partitionweblogs.hql and input.log as a dependency
1. Right-click Dependencies in the Solution Explorer window, point to Add, and click Existing Item.
2. Navigate to the C:\ADFGettingStarted and select partitionweblogs.hql, input.log files, and click Add. You
created these two files as part of prerequisites from the Tutorial Overview.
When you publish the solution in the next step, the partitionweblogs.hql file is uploaded to the script folder in
the adfgetstarted blob container.
Publish/deploy Data Factory entities
In this step, you publish the Data Factory entities (linked services, datasets, and pipeline) in your project to the
Azure Data Factory service. In the process of publishing, you specify the name for your data factory.
1. Right-click project in the Solution Explorer, and click Publish.
2. If you see Sign in to your Microsoft account dialog box, enter your credentials for the account that has
Azure subscription, and click sign in.
3. You should see the following dialog box:

4. In the Configure data factory page, do the following steps:


a. select Create New Data Factory option.
b. Enter a unique name for the data factory. For example: DataFactoryUsingVS09152016. The name
must be globally unique.
c. Select the right subscription for the Subscription field. > [!IMPORTANT] > If you do not see any
subscription, ensure that you logged in using an account that is an admin or co-admin of the
subscription.
d. Select the resource group for the data factory to be created.
e. Select the region for the data factory.
f. Click Next to switch to the Publish Items page. (Press TAB to move out of the Name field to if the
Next button is disabled.)

IMPORTANT
If you receive the error Data factory name DataFactoryUsingVS is not available when publishing,
change the name (for example, yournameDataFactoryUsingVS). See Data Factory - Naming Rules topic for
naming rules for Data Factory artifacts.

5. In the Publish Items page, ensure that all the Data Factories entities are selected, and click Next to switch
to the Summary page.
6. Review the summary and click Next to start the deployment process and view the Deployment Status.

7. In the Deployment Status page, you should see the status of the deployment process. Click Finish after the
deployment is done.
Important points to note:
If you receive the error: This subscription is not registered to use namespace Microsoft.DataFactory,
do one of the following and try publishing again:
In Azure PowerShell, run the following command to register the Data Factory provider.
Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

You can run the following command to confirm that the Data Factory provider is registered.

Get-AzureRmResourceProvider

Login using the Azure subscription in to the Azure portal and navigate to a Data Factory blade (or)
create a data factory in the Azure portal. This action automatically registers the provider for you.
The name of the data factory may be registered as a DNS name in the future and hence become publically
visible.
To create Data Factory instances, you need to be an admin or co-admin of the Azure subscription
Monitor pipeline
In this step, you monitor the pipeline using Diagram View of the data factory.
Monitor pipeline using Diagram View
1. Log in to the Azure portal, do the following steps:
a. Click More services and click Data factories.

b. Select the name of your data factory (for example: DataFactoryUsingVS09152016) from the list of
data factories.

2. In the home page for your data factory, click Diagram.


3. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial.

4. To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline.

5. Confirm that you see the HDInsightHive activity in the pipeline.

To navigate back to the previous view, click Data factory in the breadcrumb menu at the top.
6. In the Diagram View, double-click the dataset AzureBlobInput. Confirm that the slice is in Ready state. It
may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait
for sometime, see if you have the input file (input.log) placed in the right container ( adfgetstarted ) and
folder ( inputdata ). And, make sure that the external property on the input dataset is set to true.

7. Click X to close AzureBlobInput blade.


8. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently
being processed.

9. When processing is done, you see the slice in Ready state.

IMPORTANT
Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect
the pipeline to take approximately 30 minutes to process the slice.
10. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your
blob storage for the output data.

11. Click the slice to see details about it in a Data slice blade.
12. Click an activity run in the Activity runs list to see details about an activity run (Hive activity in our
scenario) in an Activity run details window.
From the log files, you can see the Hive query that was executed and status information. These logs are
useful for troubleshooting any issues.
See Monitor datasets and pipeline for instructions on how to use the Azure portal to monitor the pipeline and
datasets you have created in this tutorial.
Monitor pipeline using Monitor & Manage App
You can also use Monitor & Manage application to monitor your pipelines. For detailed information about using
this application, see Monitor and manage Azure Data Factory pipelines using Monitoring and Management App.
1. Click Monitor & Manage tile.

2. You should see Monitor & Manage application. Change the Start time and End time to match start (04-
01-2016 12:00 AM) and end times (04-02-2016 12:00 AM) of your pipeline, and click Apply.
3. To see details about an activity window, select it in the Activity Windows list to see details about it.

IMPORTANT
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the
tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.

Additional notes
A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a
Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive
script to transform input data. See supported data stores for all the sources and sinks supported by the Copy
Activity. See compute linked services for the list of compute services supported by Data Factory.
Linked services link data stores or compute services to an Azure data factory. See supported data stores for all
the sources and sinks supported by the Copy Activity. See compute linked services for the list of compute
services supported by Data Factory and transformation activities that can run on them.
See Move data from/to Azure Blob for details about JSON properties used in the Azure Storage linked service
definition.
You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute
Linked Services for details.
The Data Factory creates a Linux-based HDInsight cluster for you with the preceding JSON. See On-demand
HDInsight Linked Service for details.
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is
by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is
processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the
processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not need
them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names
of these containers follow a pattern: adf**yourdatafactoryname**-**linkedservicename**-datetimestamp . Use
tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage.
Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity
does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset.
This tutorial does not show how copy data by using Azure Data Factory. For a tutorial on how to copy data
using Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.

Use Server Explorer to view data factories


1. In Visual Studio, click View on the menu, and click Server Explorer.
2. In the Server Explorer window, expand Azure and expand Data Factory. If you see Sign in to Visual
Studio, enter the account associated with your Azure subscription and click Continue. Enter password,
and click Sign in. Visual Studio tries to get information about all Azure data factories in your subscription.
You see the status of this operation in the Data Factory Task List window.

3. You can right-click a data factory, and select Export Data Factory to New Project to create a Visual
Studio project based on an existing data factory.
Update Data Factory tools for Visual Studio
To update Azure Data Factory tools for Visual Studio, do the following steps:
1. Click Tools on the menu and select Extensions and Updates.
2. Select Updates in the left pane and then select Visual Studio Gallery.
3. Select Azure Data Factory tools for Visual Studio and click Update. If you do not see this entry, you
already have the latest version of the tools.

Use configuration files


You can use configuration files in Visual Studio to configure properties for linked services/tables/pipelines
differently for each environment.
Consider the following JSON definition for an Azure Storage linked service. To specify connectionString with
different values for accountname and accountkey based on the environment (Dev/Test/Production) to which you
are deploying Data Factory entities. You can achieve this behavior by using separate configuration file for each
environment.

{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"description": "",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Add a configuration file


Add a configuration file for each environment by performing the following steps:
1. Right-click the Data Factory project in your Visual Studio solution, point to Add, and click New item.
2. Select Config from the list of installed templates on the left, select Configuration File, enter a name for
the configuration file, and click Add.
3. Add configuration parameters and their values in the following format:

{
"$schema":
"http://datafactories.schema.management.azure.com/vsschemas/V1/Microsoft.DataFactory.Config.json",
"AzureStorageLinkedService1": [
{
"name": "$.properties.typeProperties.connectionString",
"value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
],
"AzureSqlLinkedService1": [
{
"name": "$.properties.typeProperties.connectionString",
"value": "Server=tcp:spsqlserver.database.windows.net,1433;Database=spsqldb;User
ID=spelluru;Password=Sowmya123;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
]
}

This example configures connectionString property of an Azure Storage linked service and an Azure SQL
linked service. Notice that the syntax for specifying name is JsonPath.
If JSON has a property that has an array of values as shown in the following code:

"structure": [
{
"name": "FirstName",
"type": "String"
},
{
"name": "LastName",
"type": "String"
}
],

Configure properties as shown in the following configuration file (use zero-based indexing):
{
"name": "$.properties.structure[0].name",
"value": "FirstName"
}
{
"name": "$.properties.structure[0].type",
"value": "String"
}
{
"name": "$.properties.structure[1].name",
"value": "LastName"
}
{
"name": "$.properties.structure[1].type",
"value": "String"
}

Property names with spaces


If a property name has spaces in it, use square brackets as shown in the following example (Database server
name):

{
"name": "$.properties.activities[1].typeProperties.webServiceParameters.['Database server name']",
"value": "MyAsqlServer.database.windows.net"
}

Deploy solution using a configuration


When you are publishing Azure Data Factory entities in VS, you can specify the configuration that you want to use
for that publishing operation.
To publish entities in an Azure Data Factory project using configuration file:
1. Right-click Data Factory project and click Publish to see the Publish Items dialog box.
2. Select an existing data factory or specify values for creating a data factory on the Configure data factory
page, and click Next.
3. On the Publish Items page: you see a drop-down list with available configurations for the Select
Deployment Config field.

4. Select the configuration file that you would like to use and click Next.
5. Confirm that you see the name of JSON file in the Summary page and click Next.
6. Click Finish after the deployment operation is finished.
When you deploy, the values from the configuration file are used to set values for properties in the JSON files
before the entities are deployed to Azure Data Factory service.

Use Azure Key Vault


It is not advisable and often against security policy to commit sensitive data such as connection strings to the
code repository. See ADF Secure Publish sample on GitHub to learn about storing sensitive information in Azure
Key Vault and using it while publishing Data Factory entities. The Secure Publish extension for Visual Studio
allows the secrets to be stored in Key Vault and only references to them are specified in linked services/
deployment configurations. These references are resolved when you publish Data Factory entities to Azure. These
files can then be committed to source repository without exposing any secrets.

Summary
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop
cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
1. Created an Azure data factory.
2. Created two linked services:
a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data
factory.
b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the
data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data
and produce output data.
3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
4. Created a pipeline with a HDInsight Hive activity.

Next Steps
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script
on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure
SQL, see Tutorial: Copy data from an Azure blob to Azure SQL.
You can chain two activities (run one activity after another) by setting the output dataset of one activity as the
input dataset of the other activity. See Scheduling and execution in Data Factory for detailed information.

See Also
TOPIC DESCRIPTION

Pipelines This article helps you understand pipelines and activities in


Azure Data Factory and how to use them to construct data-
driven workflows for your scenario or business.

Datasets This article helps you understand datasets in Azure Data


Factory.

Data Transformation Activities This article provides a list of data transformation activities
(such as HDInsight Hive transformation you used in this
tutorial) supported by Azure Data Factory.
TOPIC DESCRIPTION

Scheduling and execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.

Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Tutorial: Build your first Azure data factory using
Azure PowerShell
8/21/2017 14 min to read Edit Online

In this article, you use Azure PowerShell to create your first Azure data factory. To do the tutorial using other
tools/SDKs, select one of the options from the drop-down list.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.

NOTE
The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source data
store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see Tutorial: Copy data from
Blob Storage to SQL Database.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the
output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution
in Data Factory.

Prerequisites
Read through Tutorial Overview article and complete the prerequisite steps.
Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure
PowerShell on your computer.
(optional) This article does not cover all the Data Factory cmdlets. See Data Factory Cmdlet Reference for
comprehensive documentation on Data Factory cmdlets.

Create data factory


In this step, you use Azure PowerShell to create an Azure Data Factory named FirstDataFactoryPSH. A data
factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy
Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to
transform input data. Let's start with creating the data factory in this step.
1. Start Azure PowerShell and run the following command. Keep Azure PowerShell open until the end of this
tutorial. If you close and reopen, you need to run these commands again.
Run the following command and enter the user name and password that you use to sign in to the Azure
portal. PowerShell Login-AzureRmAccount
Run the following command to view all the subscriptions for this account.
PowerShell Get-AzureRmSubscription
Run the following command to select the subscription that you want to work with. This subscription
should be the same as the one you used in the Azure portal.
PowerShell Get-AzureRmSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzureRmContext

2. Create an Azure resource group named ADFTutorialResourceGroup by running the following command:
New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US"

Some of the steps in this tutorial assume that you use the resource group named
ADFTutorialResourceGroup. If you use a different resource group, you need to use it in place of
ADFTutorialResourceGroup in this tutorial.
3. Run the New-AzureRmDataFactory cmdlet that creates a data factory named FirstDataFactoryPSH.

New-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH Location


"West US"

Note the following points:


The name of the Azure Data Factory must be globally unique. If you receive the error Data factory name
FirstDataFactoryPSH is not available, change the name (for example, yournameFirstDataFactoryPSH).
Use this name in place of ADFTutorialFactoryPSH while performing steps in this tutorial. See Data Factory -
Naming Rules topic for naming rules for Data Factory artifacts.
To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription
The name of the data factory may be registered as a DNS name in the future and hence become publically
visible.
If you receive the error: "This subscription is not registered to use namespace
Microsoft.DataFactory", do one of the following and try publishing again:
In Azure PowerShell, run the following command to register the Data Factory provider:

Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

You can run the following command to confirm that the Data Factory provider is registered:

Get-AzureRmResourceProvider

Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or) create
a data factory in the Azure portal. This action automatically registers the provider for you.
Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to
link data stores/computes to your data store, define input and output datasets to represent input/output data in
linked data stores, and then create the pipeline with an activity that uses these datasets.

Create linked services


In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory.
The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked
service is used to run a Hive script specified in the activity of the pipeline in this sample. Identify what data
store/compute services are used in your scenario and link those services to the data factory by creating linked
services.
Create Azure Storage linked service
In this step, you link your Azure Storage account to your data factory. You use the same Azure Storage account to
store input/output data and the HQL script file.
1. Create a JSON file named StorageLinkedService.json in the C:\ADFGetStarted folder with the following
content. Create the folder ADFGetStarted if it does not already exist.
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"description": "",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Replace account name with the name of your Azure storage account and account key with the access
key of the Azure storage account. To learn how to get your storage access key, see the information about
how to view, copy, and regenerate storage access keys in Manage your storage account.
2. In Azure PowerShell, switch to the ADFGetStarted folder.
3. You can use the New-AzureRmDataFactoryLinkedService cmdlet that creates a linked service. This
cmdlet and other Data Factory cmdlets you use in this tutorial requires you to pass values for the
ResourceGroupName and DataFactoryName parameters. Alternatively, you can use Get-
AzureRmDataFactory to get a DataFactory object and pass the object without typing
ResourceGroupName and DataFactoryName each time you run a cmdlet. Run the following command to
assign the output of the Get-AzureRmDataFactory cmdlet to a $df variable.

$df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH

4. Now, run the New-AzureRmDataFactoryLinkedService cmdlet that creates the linked


StorageLinkedService service.

New-AzureRmDataFactoryLinkedService $df -File .\StorageLinkedService.json

If you hadn't run the Get-AzureRmDataFactory cmdlet and assigned the output to the $df variable, you
would have to specify values for the ResourceGroupName and DataFactoryName parameters as follows.

New-AzureRmDataFactoryLinkedService -ResourceGroupName ADFTutorialResourceGroup -DataFactoryName


FirstDataFactoryPSH -File .\StorageLinkedService.json

If you close Azure PowerShell in the middle of the tutorial, you have to run the Get-AzureRmDataFactory
cmdlet next time you start Azure PowerShell to complete the tutorial.
Create Azure HDInsight linked service
In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically
created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use
your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute Linked Services for
details.
1. Create a JSON file named HDInsightOnDemandLinkedService.json in the C:\ADFGetStarted folder
with the following content.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "StorageLinkedService"
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

ClusterSize Specifies the size of the HDInsight cluster.

TimeToLive Specifies that the idle time for the HDInsight cluster,
before it is deleted.

linkedServiceName Specifies the storage account that is used to store the


logs that are generated by HDInsight

Note the following points:


The Data Factory creates a Linux-based HDInsight cluster for you with the JSON. See On-demand
HDInsight Linked Service for details.
You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See
HDInsight Linked Service for details.
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This
behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created
every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is
automatically deleted when the processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not
need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost.
The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-
datetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure
blob storage.
See On-demand HDInsight Linked Service for details.
2. Run the New-AzureRmDataFactoryLinkedService cmdlet that creates the linked service called
HDInsightOnDemandLinkedService.

New-AzureRmDataFactoryLinkedService $df -File .\HDInsightOnDemandLinkedService.json

Create datasets
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to
the StorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage
account and datasets specify container, folder, file name in the storage that holds input and output data.
Create input dataset
1. Create a JSON file named InputTable.json in the C:\ADFGetStarted folder with the following content:

{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}

The JSON defines a dataset named AzureBlobInput, which represents input data for an activity in the
pipeline. In addition, it specifies that the input data is located in the blob container called adfgetstarted
and the folder called inputdata.
The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

type The type property is set to AzureBlob because data


resides in Azure blob storage.

linkedServiceName refers to the StorageLinkedService you created earlier.

fileName This property is optional. If you omit this property, all the
files from the folderPath are picked. In this case, only the
input.log is processed.

type The log files are in text format, so we use TextFormat.

columnDelimiter columns in the log files are delimited by the comma


character (,).

frequency/interval frequency set to Month and interval is 1, which means


that the input slices are available monthly.

external this property is set to true if the input data is not


generated by the Data Factory service.

2. Run the following command in Azure PowerShell to create the Data Factory dataset:

New-AzureRmDataFactoryDataset $df -File .\InputTable.json


Create output dataset
Now, you create the output dataset to represent the output data stored in the Azure Blob storage.
1. Create a JSON file named OutputTable.json in the C:\ADFGetStarted folder with the following content:

{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "adfgetstarted/partitioneddata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}

The JSON defines a dataset named AzureBlobOutput, which represents output data for an activity in the
pipeline. In addition, it specifies that the results are stored in the blob container called adfgetstarted and
the folder called partitioneddata. The availability section specifies that the output dataset is produced
on a monthly basis.
2. Run the following command in Azure PowerShell to create the Data Factory dataset:

New-AzureRmDataFactoryDataset $df -File .\OutputTable.json

Create pipeline
In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly
(frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also
set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output
dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any
output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the
following JSON are explained at the end of this section.
1. Create a JSON file named MyFirstPipelinePSH.json in the C:\ADFGetStarted folder with the following
content:

IMPORTANT
Replace storageaccountname with the name of your storage account in the JSON.
{
"name": "MyFirstPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "StorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata"
}
},
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-02T00:00:00Z",
"isPaused": false
}
}

In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process
Data on an HDInsight cluster.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService, called StorageLinkedService), and in script folder in the container adfgetstarted.
The defines section is used to specify the runtime settings that be passed to the hive script as Hive
configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}).
The start and end properties of the pipeline specifies the active period of the pipeline.
In the activity JSON, you specify that the Hive script runs on the compute specified by the
linkedServiceName HDInsightOnDemandLinkedService.

NOTE
See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties that are used
in the example.
2. Confirm that you see the input.log file in the adfgetstarted/inputdata folder in the Azure blob storage,
and run the following command to deploy the pipeline. Since the start and end times are set in the past
and isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy.

New-AzureRmDataFactoryPipeline $df -File .\MyFirstPipelinePSH.json

3. Congratulations, you have successfully created your first pipeline using Azure PowerShell!

Monitor pipeline
In this step, you use Azure PowerShell to monitor whats going on in an Azure data factory.
1. Run Get-AzureRmDataFactory and assign the output to a $df variable.

$df=Get-AzureRmDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH

2. Run Get-AzureRmDataFactorySlice to get details about all slices of the EmpSQLTable, which is the
output table of the pipeline.

Get-AzureRmDataFactorySlice $df -DatasetName AzureBlobOutput -StartDateTime 2017-07-01

Notice that the StartDateTime you specify here is the same start time specified in the pipeline JSON. Here is
the sample output:

ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : FirstDataFactoryPSH
DatasetName : AzureBlobOutput
Start : 7/1/2017 12:00:00 AM
End : 7/2/2017 12:00:00 AM
RetryCount : 0
State : InProgress
SubState :
LatencyStatus :
LongRetryCount : 0

3. Run Get-AzureRmDataFactoryRun to get the details of activity runs for a specific slice.

Get-AzureRmDataFactoryRun $df -DatasetName AzureBlobOutput -StartDateTime 2017-07-01

Here is the sample output:


Id : 0f6334f2-d56c-4d48-b427-
d4f0fb4ef883_635268096000000000_635292288000000000_AzureBlobOutput
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : FirstDataFactoryPSH
DatasetName : AzureBlobOutput
ProcessingStartTime : 12/18/2015 4:50:33 AM
ProcessingEndTime : 12/31/9999 11:59:59 PM
PercentComplete : 0
DataSliceStart : 7/1/2017 12:00:00 AM
DataSliceEnd : 7/2/2017 12:00:00 AM
Status : AllocatingResources
Timestamp : 12/18/2015 4:50:33 AM
RetryAttempt : 0
Properties : {}
ErrorMessage :
ActivityName : RunSampleHiveActivity
PipelineName : MyFirstPipeline
Type : Script

You can keep running this cmdlet until you see the slice in Ready state or Failed state. When the slice is in
Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the
output data. Creation of an on-demand HDInsight cluster usually takes some time.

IMPORTANT
Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the
pipeline to take approximately 30 minutes to process the slice.
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial
again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.

Summary
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop
cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
1. Created an Azure data factory.
2. Created two linked services:
a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data
factory.
b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the
data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data
and produce output data.
3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
4. Created a pipeline with a HDInsight Hive activity.

Next steps
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script
on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to
Azure SQL, see Tutorial: Copy data from an Azure Blob to Azure SQL.
See Also
TOPIC DESCRIPTION

Data Factory Cmdlet Reference See comprehensive documentation on Data Factory cmdlets

Pipelines This article helps you understand pipelines and activities in


Azure Data Factory and how to use them to construct end-
to-end data-driven workflows for your scenario or business.

Datasets This article helps you understand datasets in Azure Data


Factory.

Scheduling and Execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.

Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Tutorial: Build your first Azure data factory using
Azure Resource Manager template
7/21/2017 12 min to read Edit Online

In this article, you use an Azure Resource Manager template to create your first Azure data factory. To do the
tutorial using other tools/SDKs, select one of the options from the drop-down list.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.

NOTE
The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data using
Azure Data Factory, see Tutorial: Copy data from Blob Storage to SQL Database.
The pipeline in this tutorial has only one activity of type: HDInsightHive. A pipeline can have more than one activity. And,
you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input
dataset of the other activity. For more information, see scheduling and execution in Data Factory.

Prerequisites
Read through Tutorial Overview article and complete the prerequisite steps.
Follow instructions in How to install and configure Azure PowerShell article to install latest version of Azure
PowerShell on your computer.
See Authoring Azure Resource Manager Templates to learn about Azure Resource Manager templates.

In this tutorial
ENTITY DESCRIPTION

Azure Storage linked service Links your Azure Storage account to the data factory. The
Azure Storage account holds the input and output data for
the pipeline in this sample.

HDInsight on-demand linked service Links an on-demand HDInsight cluster to the data factory.
The cluster is automatically created for you to process data
and is deleted after the processing is done.

Azure Blob input dataset Refers to the Azure Storage linked service. The linked service
refers to an Azure Storage account and the Azure Blob
dataset specifies the container, folder, and file name in the
storage that holds the input data.

Azure Blob output dataset Refers to the Azure Storage linked service. The linked service
refers to an Azure Storage account and the Azure Blob
dataset specifies the container, folder, and file name in the
storage that holds the output data.
ENTITY DESCRIPTION

Data pipeline The pipeline has one activity of type HDInsightHive, which
consumes the input dataset and produces the output
dataset.

A data factory can have one or more pipelines. A pipeline can have one or more activities in it. There are two types
of activities: data movement activities and data transformation activities. In this tutorial, you create a pipeline with
one activity (Hive activity).
The following section provides the complete Resource Manager template for defining Data Factory entities so that
you can quickly run through the tutorial and test the template. To understand how each Data Factory entity is
defined, see Data Factory entities in the template section.

Data Factory JSON template


The top-level Resource Manager template for defining a data factory is:

{
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": { ...
},
"variables": { ...
},
"resources": [
{
"name": "[parameters('dataFactoryName')]",
"apiVersion": "[variables('apiVersion')]",
"type": "Microsoft.DataFactory/datafactories",
"location": "westus",
"resources": [
{ ... },
{ ... },
{ ... },
{ ... }
]
}
]
}

Create a JSON file named ADFTutorialARM.json in C:\ADFGetStarted folder with the following content:

{
"contentVersion": "1.0.0.0",
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
"storageAccountName": { "type": "string", "metadata": { "description": "Name of the Azure storage
account that contains the input/output data." } },
"storageAccountKey": { "type": "securestring", "metadata": { "description": "Key for the Azure
storage account." } },
"blobContainer": { "type": "string", "metadata": { "description": "Name of the blob container in
the Azure Storage account." } },
"inputBlobFolder": { "type": "string", "metadata": { "description": "The folder in the blob
container that has the input file." } },
"inputBlobName": { "type": "string", "metadata": { "description": "Name of the input file/blob." }
},
"outputBlobFolder": { "type": "string", "metadata": { "description": "The folder in the blob
container that will hold the transformed data." } },
"hiveScriptFolder": { "type": "string", "metadata": { "description": "The folder in the blob
container that contains the Hive query file." } },
"hiveScriptFile": { "type": "string", "metadata": { "description": "Name of the hive query (HQL)
file." } }
},
"variables": {
"dataFactoryName": "[concat('HiveTransformDF', uniqueString(resourceGroup().id))]",
"azureStorageLinkedServiceName": "AzureStorageLinkedService",
"hdInsightOnDemandLinkedServiceName": "HDInsightOnDemandLinkedService",
"blobInputDatasetName": "AzureBlobInput",
"blobOutputDatasetName": "AzureBlobOutput",
"pipelineName": "HiveTransformPipeline"
},
"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "West US",
"resources": [
{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',paramet
ers('storageAccountKey'))]"
}
}
},
{
"type": "linkedservices",
"name": "[variables('hdInsightOnDemandLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]"
}
}
},
{
"type": "datasets",
"name": "[variables('blobInputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"typeProperties": {
"fileName": "[parameters('inputBlobName')]",
"folderPath": "[concat(parameters('blobContainer'), '/',
parameters('inputBlobFolder'))]",
"format": {
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true
}
},
{
"type": "datasets",
"name": "[variables('blobOutputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/',
parameters('outputBlobFolder'))]",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
},
{
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('hdInsightOnDemandLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('blobOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"description": "Pipeline that transforms data using Hive script.",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "[concat(parameters('blobContainer'), '/',
parameters('hiveScriptFolder'), '/', parameters('hiveScriptFile'))]",
"scriptLinkedService": "[variables('azureStorageLinkedServiceName')]",
"defines": {
"inputtable": "[concat('wasb://', parameters('blobContainer'), '@',
parameters('storageAccountName'), '.blob.core.windows.net/', parameters('inputBlobFolder'))]",
"partitionedtable": "[concat('wasb://', parameters('blobContainer'), '@',
parameters('storageAccountName'), '.blob.core.windows.net/', parameters('outputBlobFolder'))]"
}
},
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
"outputs": [
{
"name": "[variables('blobOutputDatasetName')]"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "[variables('hdInsightOnDemandLinkedServiceName')]"
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-02T00:00:00Z",
"isPaused": false
}
}
]
}
]
}

NOTE
You can find another example of Resource Manager template for creating an Azure data factory on Tutorial: Create a
pipeline with Copy Activity using an Azure Resource Manager template.

Parameters JSON
Create a JSON file named ADFTutorialARM-Parameters.json that contains parameters for the Azure Resource
Manager template.

IMPORTANT
Specify the name and key of your Azure Storage account for the storageAccountName and storageAccountKey
parameters in this parameter file.
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageAccountName": {
"value": "<Name of your Azure Storage account>"
},
"storageAccountKey": {
"value": "<Key of your Azure Storage account>"
},
"blobContainer": {
"value": "adfgetstarted"
},
"inputBlobFolder": {
"value": "inputdata"
},
"inputBlobName": {
"value": "input.log"
},
"outputBlobFolder": {
"value": "partitioneddata"
},
"hiveScriptFolder": {
"value": "script"
},
"hiveScriptFile": {
"value": "partitionweblogs.hql"
}
}
}

IMPORTANT
You may have separate parameter JSON files for development, testing, and production environments that you can use with
the same Data Factory JSON template. By using a Power Shell script, you can automate deploying Data Factory entities in
these environments.

Create data factory


1. Start Azure PowerShell and run the following command:
Run the following command and enter the user name and password that you use to sign in to the Azure
portal. PowerShell Login-AzureRmAccount
Run the following command to view all the subscriptions for this account.
PowerShell Get-AzureRmSubscription
Run the following command to select the subscription that you want to work with. This subscription
should be the same as the one you used in the Azure portal.
Get-AzureRmSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzureRmContext
2. Run the following command to deploy Data Factory entities using the Resource Manager template you
created in Step 1.

New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile C:\ADFGetStarted\ADFTutorialARM.json -TemplateParameterFile
C:\ADFGetStarted\ADFTutorialARM-Parameters.json

Monitor pipeline
1. After logging in to the Azure portal, Click Browse and select Data factories.
2. In the Data Factories blade, click the data factory (TutorialFactoryARM) you created.
3. In the Data Factory blade for your data factory, click Diagram.

4. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial.
5. In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently
being processed.

6. When processing is done, you see the slice in Ready state. Creation of an on-demand HDInsight cluster
usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately
30 minutes to process the slice.
7. When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your
blob storage for the output data.
See Monitor datasets and pipeline for instructions on how to use the Azure portal blades to monitor the pipeline
and datasets you have created in this tutorial.
You can also use Monitor and Manage App to monitor your data pipelines. See Monitor and manage Azure Data
Factory pipelines using Monitoring App for details about using the application.

IMPORTANT
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the
tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.

Data Factory entities in the template


Define data factory
You define a data factory in the Resource Manager template as shown in the following sample:

"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "2015-10-01",
"type": "Microsoft.DataFactory/datafactories",
"location": "West US"
}

The dataFactoryName is defined as:


"dataFactoryName": "[concat('HiveTransformDF', uniqueString(resourceGroup().id))]",

It is a unique string based on the resource group ID.


Defining Data Factory entities
The following Data Factory entities are defined in the JSON template:
Azure Storage linked service
HDInsight on-demand linked service
Azure blob input dataset
Azure blob output dataset
Data pipeline with a copy activity
Azure Storage linked service
You specify the name and key of your Azure storage account in this section. See Azure Storage linked service for
details about JSON properties used to define an Azure Storage linked service.

{
"type": "linkedservices",
"name": "[variables('azureStorageLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureStorage",
"description": "Azure Storage linked service",
"typeProperties": {
"connectionString": "
[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',paramet
ers('storageAccountKey'))]"
}
}
}

The connectionString uses the storageAccountName and storageAccountKey parameters. The values for these
parameters passed by using a configuration file. The definition also uses variables: azureStroageLinkedService
and dataFactoryName defined in the template.
HDInsight on-demand linked service
See Compute linked services article for details about JSON properties used to define an HDInsight on-demand
linked service.
{
"type": "linkedservices",
"name": "[variables('hdInsightOnDemandLinkedServiceName')]",
"dependsOn": [
"[variables('dataFactoryName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]"
}
}
}

Note the following points:


The Data Factory creates a Linux-based HDInsight cluster for you with the above JSON. See On-demand
HDInsight Linked Service for details.
You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight
Linked Service for details.
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior
is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice
needs to be processed unless there is an existing live cluster (timeToLive) and is deleted when the
processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not need
them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names
of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-datetimestamp".
Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage.
See On-demand HDInsight Linked Service for details.
Azure blob input dataset
You specify the names of blob container, folder, and file that contains the input data. See Azure Blob dataset
properties for details about JSON properties used to define an Azure Blob dataset.
{
"type": "datasets",
"name": "[variables('blobInputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"typeProperties": {
"fileName": "[parameters('inputBlobName')]",
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'))]",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true
}
}

This definition uses the following parameters defined in parameter template: blobContainer, inputBlobFolder, and
inputBlobName.
Azure Blob output dataset
You specify the names of blob container and folder that holds the output data. See Azure Blob dataset properties
for details about JSON properties used to define an Azure Blob dataset.

{
"type": "datasets",
"name": "[variables('blobOutputDatasetName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "[variables('azureStorageLinkedServiceName')]",
"typeProperties": {
"folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'))]",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}

This definition uses the following parameters defined in the parameter template: blobContainer and
outputBlobFolder.
Data pipeline
You define a pipeline that transform data by running Hive script on an on-demand Azure HDInsight cluster. See
Pipeline JSON for descriptions of JSON elements used to define a pipeline in this example.

{
"type": "datapipelines",
"name": "[variables('pipelineName')]",
"dependsOn": [
"[variables('dataFactoryName')]",
"[variables('azureStorageLinkedServiceName')]",
"[variables('hdInsightOnDemandLinkedServiceName')]",
"[variables('blobInputDatasetName')]",
"[variables('blobOutputDatasetName')]"
],
"apiVersion": "2015-10-01",
"properties": {
"description": "Pipeline that transforms data using Hive script.",
"activities": [
{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "[concat(parameters('blobContainer'), '/', parameters('hiveScriptFolder'), '/',
parameters('hiveScriptFile'))]",
"scriptLinkedService": "[variables('azureStorageLinkedServiceName')]",
"defines": {
"inputtable": "[concat('wasb://', parameters('blobContainer'), '@',
parameters('storageAccountName'), '.blob.core.windows.net/', parameters('inputBlobFolder'))]",
"partitionedtable": "[concat('wasb://', parameters('blobContainer'), '@',
parameters('storageAccountName'), '.blob.core.windows.net/', parameters('outputBlobFolder'))]"
}
},
"inputs": [
{
"name": "[variables('blobInputDatasetName')]"
}
],
"outputs": [
{
"name": "[variables('blobOutputDatasetName')]"
}
],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "[variables('hdInsightOnDemandLinkedServiceName')]"
}
],
"start": "2017-07-01T00:00:00Z",
"end": "2017-07-02T00:00:00Z",
"isPaused": false
}
}

Reuse the template


In the tutorial, you created a template for defining Data Factory entities and a template for passing values for
parameters. To use the same template to deploy Data Factory entities to different environments, you create a
parameter file for each environment and use it when deploying to that environment.
Example:

New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Dev.json

New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Test.json

New-AzureRmResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -


TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Production.json

Notice that the first command uses parameter file for the development environment, second one for the test
environment, and the third one for the production environment.
You can also reuse the template to perform repeated tasks. For example, you need to create many data factories
with one or more pipelines that implement the same logic but each data factory uses different Azure storage and
Azure SQL Database accounts. In this scenario, you use the same template in the same environment (dev, test, or
production) with different parameter files to create data factories.

Resource Manager template for creating a gateway


Here is a sample Resource Manager template for creating a logical gateway in the back. Install a gateway on your
on-premises computer or Azure IaaS VM and register the gateway with Data Factory service using a key. See
Move data between on-premises and cloud for details.

{
"contentVersion": "1.0.0.0",
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {
},
"variables": {
"dataFactoryName": "GatewayUsingArmDF",
"apiVersion": "2015-10-01",
"singleQuote": "'"
},
"resources": [
{
"name": "[variables('dataFactoryName')]",
"apiVersion": "[variables('apiVersion')]",
"type": "Microsoft.DataFactory/datafactories",
"location": "eastus",
"resources": [
{
"dependsOn": [ "[concat('Microsoft.DataFactory/dataFactories/',
variables('dataFactoryName'))]" ],
"type": "gateways",
"apiVersion": "[variables('apiVersion')]",
"name": "GatewayUsingARM",
"properties": {
"description": "my gateway"
}
}
]
}
]
}

This template creates a data factory named GatewayUsingArmDF with a gateway named: GatewayUsingARM.

See Also
TOPIC DESCRIPTION

Pipelines This article helps you understand pipelines and activities in


Azure Data Factory and how to use them to construct end-
to-end data-driven workflows for your scenario or business.

Datasets This article helps you understand datasets in Azure Data


Factory.

Scheduling and execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.

Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Tutorial: Build your first Azure data factory using
Data Factory REST API
8/21/2017 14 min to read Edit Online

In this article, you use Data Factory REST API to create your first Azure data factory. To do the tutorial using other
tools/SDKs, select one of the options from the drop-down list.
The pipeline in this tutorial has one activity: HDInsight Hive activity. This activity runs a hive script on an Azure
HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a
month between the specified start and end times.

NOTE
This article does not cover all the REST API. For comprehensive documentation on REST API, see Data Factory REST API
Reference.
A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the
output dataset of one activity as the input dataset of the other activity. For more information, see scheduling and execution
in Data Factory.

Prerequisites
Read through Tutorial Overview article and complete the prerequisite steps.
Install Curl on your machine. You use the CURL tool with REST commands to create a data factory.
Follow instructions from this article to:
1. Create a Web application named ADFGetStartedApp in Azure Active Directory.
2. Get client ID and secret key.
3. Get tenant ID.
4. Assign the ADFGetStartedApp application to the Data Factory Contributor role.
Install Azure PowerShell.
Launch PowerShell and run the following command. Keep Azure PowerShell open until the end of this
tutorial. If you close and reopen, you need to run the commands again.
1. Run Login-AzureRmAccount and enter the user name and password that you use to sign in to the
Azure portal.
2. Run Get-AzureRmSubscription to view all the subscriptions for this account.
3. Run Get-AzureRmSubscription -SubscriptionName NameOfAzureSubscription | Set-
AzureRmContext to select the subscription that you want to work with. Replace
NameOfAzureSubscription with the name of your Azure subscription.
Create an Azure resource group named ADFTutorialResourceGroup by running the following command
in the PowerShell:

New-AzureRmResourceGroup -Name ADFTutorialResourceGroup -Location "West US"

Some of the steps in this tutorial assume that you use the resource group named
ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of your
resource group in place of ADFTutorialResourceGroup in this tutorial.
Create JSON definitions
Create following JSON files in the folder where curl.exe is located.
datafactory.json

IMPORTANT
Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name.

{
"name": "FirstDataFactoryREST",
"location": "WestUS"
}

azurestoragelinkedservice.json

IMPORTANT
Replace accountname and accountkey with name and key of your Azure storage account. To learn how to get your
storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your
storage account.

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

hdinsightondemandlinkedservice.json

{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "AzureStorageLinkedService"
}
}
}

The following table provides descriptions for the JSON properties used in the snippet:

PROPERTY DESCRIPTION

ClusterSize Size of the HDInsight cluster.


PROPERTY DESCRIPTION

TimeToLive Specifies that the idle time for the HDInsight cluster, before it
is deleted.

linkedServiceName Specifies the storage account that is used to store the logs
that are generated by HDInsight

Note the following points:


The Data Factory creates a Linux-based HDInsight cluster for you with the above JSON. See On-demand
HDInsight Linked Service for details.
You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight
Linked Service for details.
The HDInsight cluster creates a default container in the blob storage you specified in the JSON
(linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior
is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is
processed unless there is an existing live cluster (timeToLive) and is deleted when the processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not need
them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names
of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-datetimestamp".
Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage.
See On-demand HDInsight Linked Service for details.
inputdataset.json

{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"fileName": "input.log",
"folderPath": "adfgetstarted/inputdata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
},
"external": true,
"policy": {}
}
}

The JSON defines a dataset named AzureBlobInput, which represents input data for an activity in the pipeline. In
addition, it specifies that the input data is located in the blob container called adfgetstarted and the folder called
inputdata.
The following table provides descriptions for the JSON properties used in the snippet:
PROPERTY DESCRIPTION

type The type property is set to AzureBlob because data resides in


Azure blob storage.

linkedServiceName refers to the StorageLinkedService you created earlier.

fileName This property is optional. If you omit this property, all the files
from the folderPath are picked. In this case, only the input.log
is processed.

type The log files are in text format, so we use TextFormat.

columnDelimiter columns in the log files are delimited by a comma character (,)

frequency/interval frequency set to Month and interval is 1, which means that


the input slices are available monthly.

external this property is set to true if the input data is not generated
by the Data Factory service.

outputdataset.json

{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adfgetstarted/partitioneddata",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Month",
"interval": 1
}
}
}

The JSON defines a dataset named AzureBlobOutput, which represents output data for an activity in the
pipeline. In addition, it specifies that the results are stored in the blob container called adfgetstarted and the
folder called partitioneddata. The availability section specifies that the output dataset is produced on a
monthly basis.
pipeline.json

IMPORTANT
Replace storageaccountname with name of your Azure storage account.
{
"name": "MyFirstPipeline",
"properties": {
"description": "My first Azure Data Factory pipeline",
"activities": [{
"type": "HDInsightHive",
"typeProperties": {
"scriptPath": "adfgetstarted/script/partitionweblogs.hql",
"scriptLinkedService": "AzureStorageLinkedService",
"defines": {
"inputtable":
"wasb://adfgetstarted@<stroageaccountname>.blob.core.windows.net/inputdata",
"partitionedtable":
"wasb://adfgetstarted@<stroageaccountname>t.blob.core.windows.net/partitioneddata"
}
},
"inputs": [{
"name": "AzureBlobInput"
}],
"outputs": [{
"name": "AzureBlobOutput"
}],
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
},
"name": "RunSampleHiveActivity",
"linkedServiceName": "HDInsightOnDemandLinkedService"
}],
"start": "2017-07-10T00:00:00Z",
"end": "2017-07-11T00:00:00Z",
"isPaused": false
}
}

In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process data on
a HDInsight cluster.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the
scriptLinkedService, called StorageLinkedService), and in script folder in the container adfgetstarted.
The defines section specifies runtime settings that are passed to the hive script as Hive configuration values (e.g
${hiveconf:inputtable}, ${hiveconf:partitionedtable}).
The start and end properties of the pipeline specifies the active period of the pipeline.
In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceName
HDInsightOnDemandLinkedService.

NOTE
See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties used in the preceding
example.

Set global variables


In Azure PowerShell, execute the following commands after replacing the values with your own:
IMPORTANT
See Prerequisites section for instructions on getting client ID, client secret, tenant ID, and subscription ID.

$client_id = "<client ID of application in AAD>"


$client_secret = "<client key of application in AAD>"
$tenant = "<Azure tenant ID>";
$subscription_id="<Azure subscription ID>";

$rg = "ADFTutorialResourceGroup"
$adf = "FirstDataFactoryREST"

Authenticate with AAD


$cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token -F
grant_type=client_credentials -F resource=https://management.core.windows.net/ -F client_id=$client_id -F
client_secret=$client_secret };
$responseToken = Invoke-Command -scriptblock $cmd;
$accessToken = (ConvertFrom-Json $responseToken).access_token;

(ConvertFrom-Json $responseToken)

Create data factory


In this step, you create an Azure Data Factory named FirstDataFactoryREST. A data factory can have one or more
pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source
to a destination data store and a HDInsight Hive activity to run a Hive script to transform data. Run the following
commands to create the data factory:
1. Assign the command to variable named cmd.
Confirm that the name of the data factory you specify here (ADFCopyTutorialDF) matches the name
specified in the datafactory.json.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data @datafactory.json
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/FirstDataFactoryREST?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in
the results; otherwise, you see an error message.

Write-Host $results

Note the following points:


The name of the Azure Data Factory must be globally unique. If you see the error in results: Data factory
name FirstDataFactoryREST is not available, do the following steps:
1. Change the name (for example, yournameFirstDataFactoryREST) in the datafactory.json file. See Data
1. Change the name (for example, yournameFirstDataFactoryREST) in the datafactory.json file. See Data
Factory - Naming Rules topic for naming rules for Data Factory artifacts.
2. In the first command where the $cmd variable is assigned a value, replace FirstDataFactoryREST with
the new name and run the command.
3. Run the next two commands to invoke the REST API to create the data factory and print the results of
the operation.
To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription
The name of the data factory may be registered as a DNS name in the future and hence become publicly
visible.
If you receive the error: "This subscription is not registered to use namespace
Microsoft.DataFactory", do one of the following and try publishing again:
In Azure PowerShell, run the following command to register the Data Factory provider:

Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

You can run the following command to confirm that the Data Factory provider is registered:

Get-AzureRmResourceProvider

Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or)
create a data factory in the Azure portal. This action automatically registers the provider for you.
Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to
link data stores/computes to your data store, define input and output datasets to represent data in linked data
stores.

Create linked services


In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory.
The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked
service is used to run a Hive script specified in the activity of the pipeline in this sample.
Create Azure Storage linked service
In this step, you link your Azure Storage account to your data factory. With this tutorial, you use the same Azure
Storage account to store input/output data and the HQL script file.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data @azurestoragelinkedservice.json
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/linkedservices/AzureStorageLinkedService?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the linked service has been successfully created, you see the JSON for the linked service
in the results; otherwise, you see an error message.

Write-Host $results
Create Azure HDInsight linked service
In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically
created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use
your own HDInsight cluster instead of using an on-demand HDInsight cluster. See Compute Linked Services for
details.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data "@hdinsightondemandlinkedservice.json"
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/linkedservices/hdinsightondemandlinkedservice?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the linked service has been successfully created, you see the JSON for the linked service
in the results; otherwise, you see an error message.

Write-Host $results

Create datasets
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to
the StorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage
account and datasets specify container, folder, file name in the storage that holds input and output data.
Create input dataset
In this step, you create the input dataset to represent input data stored in the Azure Blob storage.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data "@inputdataset.json"
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/datasets/AzureBlobInput?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.

Write-Host $results

Create output dataset


In this step, you create the output dataset to represent output data stored in the Azure Blob storage.
1. Assign the command to variable named cmd.
$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"
--data "@outputdataset.json"
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/datasets/AzureBlobOutput?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.

Write-Host $results

Create pipeline
In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly
(frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also
set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output
dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce
any output. If the activity doesn't take any input, you can skip creating the input dataset.
Confirm that you see the input.log file in the adfgetstarted/inputdata folder in the Azure blob storage, and
run the following command to deploy the pipeline. Since the start and end times are set in the past and
isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy.
1. Assign the command to variable named cmd.

$cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json"


--data "@pipeline.json"
https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.Dat
aFactory/datafactories/$adf/datapipelines/MyFirstPipeline?api-version=2015-10-01};

2. Run the command by using Invoke-Command.

$results = Invoke-Command -scriptblock $cmd;

3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the
results; otherwise, you see an error message.

Write-Host $results

4. Congratulations, you have successfully created your first pipeline using Azure PowerShell!

Monitor pipeline
In this step, you use Data Factory REST API to monitor slices being produced by the pipeline.
$ds ="AzureBlobOutput"

$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken"


https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactor
y/datafactories/$adf/datasets/$ds/slices?start=1970-01-01T00%3a00%3a00.0000000Z"&"end=2016-08-
12T00%3a00%3a00.0000000Z"&"api-version=2015-10-01};

$results2 = Invoke-Command -scriptblock $cmd;

IF ((ConvertFrom-Json $results2).value -ne $NULL) {


ConvertFrom-Json $results2 | Select-Object -Expand value | Format-Table
} else {
(convertFrom-Json $results2).RemoteException
}

IMPORTANT
Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the
pipeline to take approximately 30 minutes to process the slice.

Run the Invoke-Command and the next one until you see the slice in Ready state or Failed state. When the slice
is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the
output data. The creation of an on-demand HDInsight cluster usually takes some time.

IMPORTANT
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the
tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.

You can also use Azure portal to monitor slices and troubleshoot any issues. See Monitor pipelines using Azure
portal details.

Summary
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop
cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
1. Created an Azure data factory.
2. Created two linked services:
a. Azure Storage linked service to link your Azure blob storage that holds input/output files to the data
factory.
b. Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the
data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data
and produce output data.
3. Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
4. Created a pipeline with a HDInsight Hive activity.

Next steps
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script
on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to
Azure SQL, see Tutorial: Copy data from an Azure Blob to Azure SQL.

See Also
TOPIC DESCRIPTION

Data Factory REST API Reference See comprehensive documentation on Data Factory cmdlets

Pipelines This article helps you understand pipelines and activities in


Azure Data Factory and how to use them to construct end-
to-end data-driven workflows for your scenario or business.

Datasets This article helps you understand datasets in Azure Data


Factory.

Scheduling and Execution This article explains the scheduling and execution aspects of
Azure Data Factory application model.

Monitor and manage pipelines using Monitoring App This article describes how to monitor, manage, and debug
pipelines using the Monitoring & Management App.
Move data between on-premises sources and the
cloud with Data Management Gateway
8/21/2017 15 min to read Edit Online

This article provides an overview of data integration between on-premises data stores and cloud data stores using
Data Factory. It builds on the Data Movement Activities article and other data factory core concepts articles: datasets
and pipelines.

Data Management Gateway


You must install Data Management Gateway on your on-premises machine to enable moving data to/from an on-
premises data store. The gateway can be installed on the same machine as the data store or on a different machine
as long as the gateway can connect to the data store.

IMPORTANT
See Data Management Gateway article for details about Data Management Gateway.

The following walkthrough shows you how to create a data factory with a pipeline that moves data from an on-
premises SQL Server database to an Azure blob storage. As part of the walkthrough, you install and configure the
Data Management Gateway on your machine.

Walkthrough: copy on-premises data to cloud


In this walkthrough you do the following steps:
1. Create a data factory.
2. Create a data management gateway.
3. Create linked services for source and sink data stores.
4. Create datasets to represent input and output data.
5. Create a pipeline with a copy activity to move the data.

Prerequisites for the tutorial


Before you begin this walkthrough, you must have the following prerequisites:
Azure subscription. If you don't have a subscription, you can create a free trial account in just a couple of
minutes. See the Free Trial article for details.
Azure Storage Account. You use the blob storage as a destination/sink data store in this tutorial. if you don't
have an Azure storage account, see the Create a storage account article for steps to create one.
SQL Server. You use an on-premises SQL Server database as a source data store in this tutorial.

Create data factory


In this step, you use the Azure portal to create an Azure Data Factory instance named ADFTutorialOnPremDF.
1. Log in to the Azure portal.
2. Click + NEW, click Intelligence + analytics, and click Data Factory.
3. In the New data factory page, enter ADFTutorialOnPremDF for the Name.
IMPORTANT
The name of the Azure data factory must be globally unique. If you receive the error: Data factory name
ADFTutorialOnPremDF is not available, change the name of the data factory (for example,
yournameADFTutorialOnPremDF) and try creating again. Use this name in place of ADFTutorialOnPremDF while
performing remaining steps in this tutorial.
The name of the data factory may be registered as a DNS name in the future and hence become publically visible.

4. Select the Azure subscription where you want the data factory to be created.
5. Select existing resource group or create a resource group. For the tutorial, create a resource group named:
ADFTutorialResourceGroup.
6. Click Create on the New data factory page.

IMPORTANT
To create Data Factory instances, you must be a member of the Data Factory Contributor role at the
subscription/resource group level.

7. After creation is complete, you see the Data Factory page as shown in the following image:
Create gateway
1. In the Data Factory page, click Author and deploy tile to launch the Editor for the data factory.

2. In the Data Factory Editor, click ... More on the toolbar and then click New data gateway. Alternatively, you
can right-click Data Gateways in the tree view, and click New data gateway.
3. In the Create page, enter adftutorialgateway for the name, and click OK.

NOTE
In this walkthrough, you create the logical gateway with only one node (on-premises Windows machine). You can
scale out a data management gateway by associating multiple on-premises machines with the gateway. You can scale
up by increasing number of data movement jobs that can run concurrently on a node. This feature is also available for
a logical gateway with a single node. See Scaling data management gateway in Azure Data Factory article for details.

4. In the Configure page, click Install directly on this computer. This action downloads the installation
package for the gateway, installs, configures, and registers the gateway on the computer.
NOTE
Use Internet Explorer or a Microsoft ClickOnce compatible web browser.
If you are using Chrome, go to the Chrome web store, search with "ClickOnce" keyword, choose one of the ClickOnce
extensions, and install it.
Do the same for Firefox (install add-in). Click Open Menu button on the toolbar (three horizontal lines in the top-
right corner), click Add-ons, search with "ClickOnce" keyword, choose one of the ClickOnce extensions, and install it.

This way is the easiest way (one-click) to download, install, configure, and register the gateway in one single
step. You can see the Microsoft Data Management Gateway Configuration Manager application is
installed on your computer. You can also find the executable ConfigManager.exe in the folder: C:\Program
Files\Microsoft Data Management Gateway\2.0\Shared.
You can also download and install gateway manually by using the links in this page and register it using the
key shown in the NEW KEY text box.
See Data Management Gateway article for all the details about the gateway.

NOTE
You must be an administrator on the local computer to install and configure the Data Management Gateway
successfully. You can add additional users to the Data Management Gateway Users local Windows group. The
members of this group can use the Data Management Gateway Configuration Manager tool to configure the
gateway.

5. Wait for a couple of minutes or wait until you see the following notification message:
6. Launch Data Management Gateway Configuration Manager application on your computer. In the
Search window, type Data Management Gateway to access this utility. You can also find the executable
ConfigManager.exe in the folder: C:\Program Files\Microsoft Data Management
Gateway\2.0\Shared

7. Confirm that you see adftutorialgateway is connected to the cloud service message. The status bar the
bottom displays Connected to the cloud service along with a green check mark.
On the Home tab, you can also do the following operations:
Register a gateway with a key from the Azure portal by using the Register button.
Stop the Data Management Gateway Host Service running on your gateway machine.
Schedule updates to be installed at a specific time of the day.
View when the gateway was last updated.
Specify time at which an update to the gateway can be installed.
8. Switch to the Settings tab. The certificate specified in the Certificate section is used to encrypt/decrypt
credentials for the on-premises data store that you specify on the portal. (optional) Click Change to use your
own certificate instead. By default, the gateway uses the certificate that is auto-generated by the Data Factory
service.

You can also do the following actions on the Settings tab:


View or export the certificate being used by the gateway.
Change the HTTPS endpoint used by the gateway.
Set an HTTP proxy to be used by the gateway.
9. (optional) Switch to the Diagnostics tab, check the Enable verbose logging option if you want to enable
verbose logging that you can use to troubleshoot any issues with the gateway. The logging information can
be found in Event Viewer under Applications and Services Logs -> Data Management Gateway node.
You can also perform the following actions in the Diagnostics tab:
Use Test Connection section to an on-premises data source using the gateway.
Click View Logs to see the Data Management Gateway log in an Event Viewer window.
Click Send Logs to upload a zip file with logs of last seven days to Microsoft to facilitate troubleshooting
of your issues.
10. On the Diagnostics tab, in the Test Connection section, select SqlServer for the type of the data store, enter
the name of the database server, name of the database, specify authentication type, enter user name, and
password, and click Test to test whether the gateway can connect to the database.
11. Switch to the web browser, and in the Azure portal, click OK on the Configure page and then on the New data
gateway page.
12. You should see adftutorialgateway under Data Gateways in the tree view on the left. If you click it, you
should see the associated JSON.

Create linked services


In this step, you create two linked services: AzureStorageLinkedService and SqlServerLinkedService. The
SqlServerLinkedService links an on-premises SQL Server database and the AzureStorageLinkedService linked
service links an Azure blob store to the data factory. You create a pipeline later in this walkthrough that copies data
from the on-premises SQL Server database to the Azure blob store.
Add a linked service to an on-premises SQL Server database
1. In the Data Factory Editor, click New data store on the toolbar and select SQL Server.
2. In the JSON editor on the right, do the following steps:
a. For the gatewayName, specify adftutorialgateway.
b. In the connectionString, do the following steps:
a. For servername, enter the name of the server that hosts the SQL Server database.
b. For databasename, enter the name of the database.
c. Click Encrypt button on the toolbar. You see the Credentials Manager application.

d. In the Setting Credentials dialog box, specify authentication type, user name, and password, and
click OK. If the connection is successful, the encrypted credentials are stored in the JSON and the
dialog box closes.
e. Close the empty browser tab that launched the dialog box if it is not automatically closed and
get back to the tab with the Azure portal.
On the gateway machine, these credentials are encrypted by using a certificate that the Data
Factory service owns. If you want to use the certificate that is associated with the Data
Management Gateway instead, see Set credentials securely.
c. Click Deploy on the command bar to deploy the SQL Server linked service. You should see the linked
service in the tree view.

Add a linked service for an Azure storage account


1. In the Data Factory Editor, click New data store on the command bar and click Azure storage.
2. Enter the name of your Azure storage account for the Account name.
3. Enter the key for your Azure storage account for the Account key.
4. Click Deploy to deploy the AzureStorageLinkedService.

Create datasets
In this step, you create input and output datasets that represent input and output data for the copy operation (On-
premises SQL Server database => Azure blob storage). Before creating datasets, do the following steps (detailed
steps follows the list):
Create a table named emp in the SQL Server Database you added as a linked service to the data factory and
insert a couple of sample entries into the table.
Create a blob container named adftutorial in the Azure blob storage account you added as a linked service to
the data factory.
Prepare On-premises SQL Server for the tutorial
1. In the database you specified for the on-premises SQL Server linked service (SqlServerLinkedService), use
the following SQL script to create the emp table in the database.

CREATE TABLE dbo.emp


(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50),
CONSTRAINT PK_emp PRIMARY KEY (ID)
)
GO

2. Insert some sample into the table:

INSERT INTO emp VALUES ('John', 'Doe')


INSERT INTO emp VALUES ('Jane', 'Doe')

Create input dataset


1. In the Data Factory Editor, click ... More, click New dataset on the command bar, and click SQL Server table.
2. Replace the JSON in the right pane with the following text:
{
"name": "EmpOnPremSQLTable",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "SqlServerLinkedService",
"typeProperties": {
"tableName": "emp"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}

Note the following points:


type is set to SqlServerTable.
tableName is set to emp.
linkedServiceName is set to SqlServerLinkedService (you had created this linked service earlier in this
walkthrough.).
For an input dataset that is not generated by another pipeline in Azure Data Factory, you must set
external to true. It denotes the input data is produced external to the Azure Data Factory service. You can
optionally specify any external data policies using the externalData element in the Policy section.
See Move data to/from SQL Server for details about JSON properties.
3. Click Deploy on the command bar to deploy the dataset.
Create output dataset
1. In the Data Factory Editor, click New dataset on the command bar, and click Azure Blob storage.
2. Replace the JSON in the right pane with the following text:

{
"name": "OutputBlobTable",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "adftutorial/outfromonpremdf",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

Note the following points:


type is set to AzureBlob.
linkedServiceName is set to AzureStorageLinkedService (you had created this linked service in Step
2).
folderPath is set to adftutorial/outfromonpremdf where outfromonpremdf is the folder in the
adftutorial container. Create the adftutorial container if it does not already exist.
The availability is set to hourly (frequency set to hour and interval set to 1). The Data Factory service
generates an output data slice every hour in the emp table in the Azure SQL Database.
If you do not specify a fileName for an output table, the generated files in the folderPath are named in
the following format: Data..txt (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.).
To set folderPath and fileName dynamically based on the SliceStart time, use the partitionedBy property.
In the following example, folderPath uses Year, Month, and Day from the SliceStart (start time of the slice
being processed) and fileName uses Hour from the SliceStart. For example, if a slice is being produced for
2014-10-20T08:00:00, the folderName is set to wikidatagateway/wikisampledataout/2014/10/20 and the
fileName is set to 08.csv.

"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[

{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },


{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],

See Move data to/from Azure Blob Storage for details about JSON properties.
3. Click Deploy on the command bar to deploy the dataset. Confirm that you see both the datasets in the tree
view.

Create pipeline
In this step, you create a pipeline with one Copy Activity that uses EmpOnPremSQLTable as input and
OutputBlobTable as output.
1. In Data Factory Editor, click ... More, and click New pipeline.
2. Replace the JSON in the right pane with the following text:
{
"name": "ADFTutorialPipelineOnPrem",
"properties": {
"description": "This pipeline has one Copy activity that copies data from an on-prem SQL to Azure
blob",
"activities": [
{
"name": "CopyFromSQLtoBlob",
"description": "Copy data from on-prem SQL server to blob",
"type": "Copy",
"inputs": [
{
"name": "EmpOnPremSQLTable"
}
],
"outputs": [
{
"name": "OutputBlobTable"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from emp"
},
"sink": {
"type": "BlobSink"
}
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2016-07-05T00:00:00Z",
"end": "2016-07-06T00:00:00Z",
"isPaused": false
}
}

IMPORTANT
Replace the value of the start property with the current day and end value with the next day.

Note the following points:


In the activities section, there is only activity whose type is set to Copy.
Input for the activity is set to EmpOnPremSQLTable and output for the activity is set to
OutputBlobTable.
In the typeProperties section, SqlSource is specified as the source type and BlobSink **is specified
as the **sink type.
SQL query select * from emp is specified for the sqlReaderQuery property of SqlSource.
Both start and end datetimes must be in ISO format. For example: 2014-10-14T16:32:41Z. The end time is
optional, but we use it in this tutorial.
If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline
indefinitely, specify 9/9/9999 as the value for the end property.
You are defining the time duration in which the data slices are processed based on the Availability
properties that were defined for each Azure Data Factory dataset.
In the example, there are 24 data slices as each data slice is produced hourly.
3. Click Deploy on the command bar to deploy the dataset (table is a rectangular dataset). Confirm that the
pipeline shows up in the tree view under Pipelines node.
4. Now, click X twice to close the page to get back to the Data Factory page for the ADFTutorialOnPremDF.
Congratulations! You have successfully created an Azure data factory, linked services, datasets, and a pipeline and
scheduled the pipeline.
View the data factory in a Diagram View
1. In the Azure portal, click Diagram tile on the home page for the ADFTutorialOnPremDF data factory. :

2. You should see the diagram similar to the following image:

You can zoom in, zoom out, zoom to 100%, zoom to fit, automatically position pipelines and datasets, and
show lineage information (highlights upstream and downstream items of selected items). You can double-
click an object (input/output dataset or pipeline) to see properties for it.

Monitor pipeline
In this step, you use the Azure portal to monitor whats going on in an Azure data factory. You can also use
PowerShell cmdlets to monitor datasets and pipelines. For details about monitoring, see Monitor and Manage
Pipelines.
1. In the diagram, double-click EmpOnPremSQLTable.

2. Notice that all the data slices up are in Ready state because the pipeline duration (start time to end time) is in
the past. It is also because you have inserted the data in the SQL Server database and it is there all the time.
Confirm that no slices show up in the Problem slices section at the bottom. To view all the slices, click See
More at the bottom of the list of slices.
3. Now, In the Datasets page, click OutputBlobTable.
4. Click any data slice from the list and you should see the Data Slice page. You see activity runs for the slice.
You see only one activity run usually.
If the slice is not in the Ready state, you can see the upstream slices that are not Ready and are blocking the
current slice from executing in the Upstream slices that are not ready list.
5. Click the activity run from the list at the bottom to see activity run details.
You would see information such as throughput, duration, and the gateway used to transfer the data.
6. Click X to close all the pages until you
7. get back to the home page for the ADFTutorialOnPremDF.
8. (optional) Click Pipelines, click ADFTutorialOnPremDF, and drill through input tables (Consumed) or output
datasets (Produced).
9. Use tools such as Microsoft Storage Explorer to verify that a blob/file is created for each hour.
Next steps
See Data Management Gateway article for all the details about the Data Management Gateway.
See Copy data from Azure Blob to Azure SQL to learn about how to use Copy Activity to move data from a
source data store to a sink data store.
Azure Data Factory - Frequently Asked Questions
8/15/2017 22 min to read Edit Online

General questions
What is Azure Data Factory?
Data Factory is a cloud-based data integration service that automates the movement and transformation of
data. Just like a factory that runs equipment to take raw materials and transform them into finished goods, Data
Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.
Data Factory allows you to create data-driven workflows to move data between both on-premises and cloud data
stores as well as process/transform data using compute services such as Azure HDInsight and Azure Data Lake
Analytics. After you create a pipeline that performs the action that you need, you can schedule it to run periodically
(hourly, daily, weekly etc.).
For more information, see Overview & Key Concepts.
Where can I find pricing details for Azure Data Factory?
See Data Factory Pricing Details page for the pricing details for the Azure Data Factory.
How do I get started with Azure Data Factory?
For an overview of Azure Data Factory, see Introduction to Azure Data Factory.
For a tutorial on how to copy/move data using Copy Activity, see Copy data from Azure Blob Storage to Azure
SQL Database.
For a tutorial on how to transform data using HDInsight Hive Activity. See Process data by running Hive script
on Hadoop cluster
What is the Data Factorys region availability?
Data Factory is available in US West and North Europe. The compute and storage services used by data factories
can be in other regions. See Supported regions.
What are the limits on number of data factories/pipelines/activities/datasets?
See Azure Data Factory Limits section of the Azure Subscription and Service Limits, Quotas, and Constraints
article.
What is the authoring/developer experience with Azure Data Factory service?
You can author/create data factories using one of the following tools/SDKs:
Azure portal The Data Factory blades in the Azure portal provide rich user interface for you to create data
factories ad linked services. The Data Factory Editor, which is also part of the portal, allows you to easily create
linked services, tables, data sets, and pipelines by specifying JSON definitions for these artifacts. See Build your
first data pipeline using Azure portal for an example of using the portal/editor to create and deploy a data
factory.
Visual Studio You can use Visual Studio to create an Azure data factory. See Build your first data pipeline using
Visual Studio for details.
Azure PowerShell See Create and monitor Azure Data Factory using Azure PowerShell for a
tutorial/walkthrough for creating a data factory using PowerShell. See Data Factory Cmdlet Reference content
on MSDN Library for a comprehensive documentation of Data Factory cmdlets.
.NET Class Library You can programmatically create data factories by using Data Factory .NET SDK. See Create,
monitor, and manage data factories using .NET SDK for a walkthrough of creating a data factory using .NET SDK.
See Data Factory Class Library Reference for a comprehensive documentation of Data Factory .NET SDK.
REST API You can also use the REST API exposed by the Azure Data Factory service to create and deploy data
factories. See Data Factory REST API Reference for a comprehensive documentation of Data Factory REST API.
Azure Resource Manager Template See Tutorial: Build your first Azure data factory using Azure Resource
Manager template fo details.
Can I rename a data factory?
No. Like other Azure resources, the name of an Azure data factory cannot be changed.
Can I move a data factory from one Azure subscription to another?
Yes. Use the Move button on your data factory blade as shown in the following diagram:

What are the compute environments supported by Data Factory?


The following table provides a list of compute environments supported by Data Factory and the activities that can
run on them.

COMPUTE ENVIRONMENT ACTIVITIES

On-demand HDInsight cluster or your own HDInsight cluster DotNet, Hive, Pig, MapReduce, Hadoop Streaming

Azure Batch DotNet

Azure Machine Learning Machine Learning activities: Batch Execution and Update
Resource

Azure Data Lake Analytics Data Lake Analytics U-SQL

Azure SQL, Azure SQL Data Warehouse, SQL Server Stored Procedure

How does Azure Data Factory compare with SQL Server Integration Services (SSIS )?
See the Azure Data Factory vs. SSIS presentation from one of our MVPs (Most Valued Professionals): Reza Rad.
Some of the recent changes in Data Factory may not be listed in the slide deck. We are continuously adding more
capabilities to Azure Data Factory. We are continuously adding more capabilities to Azure Data Factory. We will
incorporate these updates into the comparison of data integration technologies from Microsoft sometime later this
year.

Activities - FAQ
What are the different types of activities you can use in a Data Factory pipeline?
Data Movement Activities to move data.
Data Transformation Activities to process/transform data.
When does an activity run?
The availability configuration setting in the output data table determines when the activity is run. If input datasets
are specified, the activity checks whether all the input data dependencies are satisfied (that is, Ready state) before it
starts running.

Copy Activity - FAQ


Is it better to have a pipeline with multiple activities or a separate pipeline for each activity?
Pipelines are supposed to bundle related activities. If the datasets that connect them are not consumed by any other
activity outside the pipeline, you can keep the activities in one pipeline. This way, you would not need to chain
pipeline active periods so that they align with each other. Also, the data integrity in the tables internal to the
pipeline is better preserved when updating the pipeline. Pipeline update essentially stops all the activities within the
pipeline, removes them, and creates them again. From authoring perspective, it might also be easier to see the flow
of data within the related activities in one JSON file for the pipeline.
What are the supported data stores?
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the
following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data
to and from that store.

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

MySQL*

Oracle*

PostgreSQL*

SAP Business Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

NOTE
Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-
premises/Azure IaaS machine.

What are the supported file formats?


Specifying formats
Azure Data Factory supports the following format types:
Text Format
JSON Format
Avro Format
ORC Format
Parquet Format
Specifying TextFormat
If you want to parse the text files or write the data in text format, set the format type property to TextFormat.
You can also specify the following optional properties in the format section. See TextFormat example section on
how to configure.

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

columnDelimiter The character used to Only one character is No


separate columns in a file. allowed. The default value is
You can consider to use a comma (',').
rare unprintable char which
not likely exists in your data: To use an Unicode character,
e.g. specify "\u0001" which refer to Unicode Characters
represents Start of Heading to get the corresponding
(SOH). code for it.

rowDelimiter The character used to Only one character is No


separate rows in a file. allowed. The default value is
any of the following values
on read: ["\r\n", "\r", "\n"]
and "\r\n" on write.

escapeChar The special character used to Only one character is No


escape a column delimiter in allowed. No default value.
the content of input file.
Example: if you have comma
You cannot specify both (',') as the column delimiter
escapeChar and quoteChar but you want to have the
for a table. comma character in the text
(example: "Hello, world"), you
can define $ as the escape
character and use string
"Hello$, world" in the source.

quoteChar The character used to quote Only one character is No


a string value. The column allowed. No default value.
and row delimiters inside the
quote characters would be For example, if you have
treated as part of the string comma (',') as the column
value. This property is delimiter but you want to
applicable to both input and have comma character in the
output datasets. text (example: ), you can
define " (double quote) as
You cannot specify both the quote character and use
escapeChar and quoteChar the string "Hello, world" in
for a table. the source.
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

nullValue One or more characters One or more characters. The No


used to represent a null default values are "\N" and
value. "NULL" on read and "\N"
on write.

encodingName Specify the encoding name. A valid encoding name. see No


Encoding.EncodingName
Property. Example: windows-
1250 or shift_jis. The default
value is UTF-8.

firstRowAsHeader Specifies whether to consider True No


the first row as a header. For False (default)
an input dataset, Data
Factory reads first row as a
header. For an output
dataset, Data Factory writes
first row as a header.

See Scenarios for using


firstRowAsHeader and
skipLineCount for sample
scenarios.

skipLineCount Indicates the number of Integer No


rows to skip when reading
data from input files. If both
skipLineCount and
firstRowAsHeader are
specified, the lines are
skipped first and then the
header information is read
from the input file.

See Scenarios for using


firstRowAsHeader and
skipLineCount for sample
scenarios.

treatEmptyAsNull Specifies whether to treat True (default) No


null or empty string as a null False
value when reading data
from an input file.

TextFormat example
The following sample shows some of the format properties for TextFormat.
"typeProperties":
{
"folderPath": "mycontainer/myfolder",
"fileName": "myblobname",
"format":
{
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": ";",
"quoteChar": "\"",
"NullValue": "NaN",
"firstRowAsHeader": true,
"skipLineCount": 0,
"treatEmptyAsNull": true
}
},

To use an escapeChar instead of quoteChar , replace the line with quoteChar with the following escapeChar:

"escapeChar": "$",

Scenarios for using firstRowAsHeader and skipLineCount


You are copying from a non-file source to a text file and would like to add a header line containing the schema
metadata (for example: SQL schema). Specify firstRowAsHeader as true in the output dataset for this scenario.
You are copying from a text file containing a header line to a non-file sink and would like to drop that line.
Specify firstRowAsHeader as true in the input dataset.
You are copying from a text file and want to skip a few lines at the beginning that contain no data or header
information. Specify skipLineCount to indicate the number of lines to be skipped. If the rest of the file contains a
header line, you can also specify firstRowAsHeader . If both skipLineCount and firstRowAsHeader are specified,
the lines are skipped first and then the header information is read from the input file
Specifying JsonFormat
To import/export JSON files as-is into/from Azure Cosmos DB, see Import/export JSON documents section in
the Azure Cosmos DB connector with details.
If you want to parse the JSON files or write the data in JSON format, set the format type property to
JsonFormat. You can also specify the following optional properties in the format section. See JsonFormat
example section on how to configure.

PROPERTY DESCRIPTION REQUIRED

filePattern Indicate the pattern of data stored in No


each JSON file. Allowed values are:
setOfObjects and arrayOfObjects.
The default value is setOfObjects. See
JSON file patterns section for details
about these patterns.

jsonNodeReference If you want to iterate and extract data No


from the objects inside an array field
with the same pattern, specify the JSON
path of that array. This property is
supported only when copying data from
JSON files.
PROPERTY DESCRIPTION REQUIRED

jsonPathDefinition Specify the JSON path expression for No


each column mapping with a
customized column name (start with
lowercase). This property is supported
only when copying data from JSON files,
and you can extract data from object or
array.

For fields under root object, start with


root $; for fields inside the array chosen
by jsonNodeReference property, start
from the array element. See JsonFormat
example section on how to configure.

encodingName Specify the encoding name. For the list No


of valid encoding names, see:
Encoding.EncodingName Property. For
example: windows-1250 or shift_jis. The
default value is: UTF-8.

nestingSeparator Character that is used to separate No


nesting levels. The default value is '.'
(dot).

JSON file patterns


Copy activity can parse below patterns of JSON files:
Type I: setOfObjects
Each file contains single object, or line-delimited/concatenated multiple objects. When this option is chosen
in an output dataset, copy activity produces a single JSON file with each object per line (line-delimited).
single object JSON example

{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}

line-delimited JSON example

{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"56
7834760","switch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"78
9037573","switch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":"34
5626404","switch1":"Germany","switch2":"UK"}

concatenated JSON example


{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
}
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}

Type II: arrayOfObjects


Each file contains an array of objects.

[
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
},
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
},
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
]

JsonFormat example
Case 1: Copying data from JSON files
See below two types of samples when copying data from JSON files, and the generic points to note:
Sample 1: extract data from object and array
In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file with
the following content:

{
"id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
"context": {
"device": {
"type": "PC"
},
"custom": {
"dimensions": [
{
"TargetResourceType": "Microsoft.Compute/virtualMachines"
},
{
"ResourceManagmentProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
},
{
"OccurrenceTime": "1/13/2017 11:24:37 AM"
}
]
}
}
}

and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects and
array:

RESOURCEMANAGMEN
ID DEVICETYPE TARGETRESOURCETYPE TPROCESSRUNID OCCURRENCETIME

ed0e4960-d9c5- PC Microsoft.Compute/vi 827f8aaa-ab72-437c- 1/13/2017 11:24:37


11e6-85dc- rtualMachines ba48-d8917a7336a3 AM
d7996816aad3

The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More
specifically:
section defines the customized column names and the corresponding data type while converting to
structure
tabular data. This section is optional unless you need to do column mapping. See Specifying structure
definition for rectangular datasets section for more details.
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. To copy
data from array, you can use array[x].property to extract value of the given property from the xth object, or
you can use array[*].property to find the value from any object containing such property.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "deviceType",
"type": "String"
},
{
"name": "targetResourceType",
"type": "String"
},
{
"name": "resourceManagmentProcessRunId",
"type": "String"
},
{
"name": "occurrenceTime",
"type": "DateTime"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type", "targetResourceType":
"$.context.custom.dimensions[0].TargetResourceType", "resourceManagmentProcessRunId":
"$.context.custom.dimensions[1].ResourceManagmentProcessRunId", "occurrenceTime": "
$.context.custom.dimensions[2].OccurrenceTime"}
}
}
}

Sample 2: cross apply multiple objects with the same pattern from array
In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have a
JSON file with the following content:

{
"ordernumber": "01",
"orderdate": "20170122",
"orderlines": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "sanmateo": "No 1" } ]
}

and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array and
cross join with the common root info:
ORDERNUMBER ORDERDATE ORDER_PD ORDER_PRICE CITY

01 20170122 P1 23 [{"sanmateo":"No 1"}]

01 20170122 P2 13 [{"sanmateo":"No 1"}]

01 20170122 P3 231 [{"sanmateo":"No 1"}]

The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts). More
specifically:
structure section defines the customized column names and the corresponding data type while converting to
tabular data. This section is optional unless you need to do column mapping. See Specifying structure
definition for rectangular datasets section for more details.
jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array
orderlines.
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. In this
example, "ordernumber", "orderdate" and "city" are under root object with JSON path starting with "$.", while
"order_pd" and "order_price" are defined with path derived from the array element without "$.".

"properties": {
"structure": [
{
"name": "ordernumber",
"type": "String"
},
{
"name": "orderdate",
"type": "String"
},
{
"name": "order_pd",
"type": "String"
},
{
"name": "order_price",
"type": "Int64"
},
{
"name": "city",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonNodeReference": "$.orderlines",
"jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd":
"prod", "order_price": "price", "city": " $.city"}
}
}
}

Note the following points:


If the structure and jsonPathDefinition are not defined in the Data Factory dataset, the Copy Activity detects
the schema from the first object and flatten the whole object.
If the JSON input has an array, by default the Copy Activity converts the entire array value into a string. You can
choose to extract data from it using jsonNodeReference and/or jsonPathDefinition , or skip it by not specifying it
in jsonPathDefinition .
If there are duplicate names at the same level, the Copy Activity picks the last one.
Property names are case-sensitive. Two properties with same name but different casings are treated as two
separate properties.
Case 2: Writing data to JSON file
If you have below table in SQL Database:

ID ORDER_DATE ORDER_PRICE ORDER_BY

1 20170119 2000 David

2 20170120 3500 Patrick

3 20170121 4000 Jason

and for each record, you expect to write to a JSON object in below format:

{
"id": "1",
"order": {
"date": "20170119",
"price": 2000,
"customer": "David"
}
}

The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically, structure section defines the customized property names in destination file, nestingSeparator
(default is ".") will be used to identify the nest layer from the name. This section is optional unless you want to
change the property name comparing with source column name, or nest some of the properties.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "order.date",
"type": "String"
},
{
"name": "order.price",
"type": "Int64"
},
{
"name": "order.customer",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat"
}
}
}

Specifying AvroFormat
If you want to parse the Avro files or write the data in Avro format, set the format type property to AvroFormat.
You do not need to specify any properties in the Format section within the typeProperties section. Example:

"format":
{
"type": "AvroFormat",
}

To use Avro format in a Hive table, you can refer to Apache Hives tutorial.
Note the following points:
Complex data types are not supported (records, enums, arrays, maps, unions and fixed).
Specifying OrcFormat
If you want to parse the ORC files or write the data in ORC format, set the format type property to OrcFormat.
You do not need to specify any properties in the Format section within the typeProperties section. Example:

"format":
{
"type": "OrcFormat"
}

IMPORTANT
If you are not copying ORC files as-is between on-premises and cloud data stores, you need to install the JRE 8 (Java
Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit
JRE. You can find both versions from here. Choose the appropriate one.

Note the following points:


Complex data types are not supported (STRUCT, MAP, LIST, UNION)
ORC file has three compression-related options: NONE, ZLIB, SNAPPY. Data Factory supports reading data from
ORC file in any of these compressed formats. It uses the compression codec is in the metadata to read the data.
However, when writing to an ORC file, Data Factory chooses ZLIB, which is the default for ORC. Currently, there
is no option to override this behavior.
Specifying ParquetFormat
If you want to parse the Parquet files or write the data in Parquet format, set the format type property to
ParquetFormat. You do not need to specify any properties in the Format section within the typeProperties section.
Example:

"format":
{
"type": "ParquetFormat"
}

IMPORTANT
If you are not copying Parquet files as-is between on-premises and cloud data stores, you need to install the JRE 8 (Java
Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit
JRE. You can find both versions from here. Choose the appropriate one.

Note the following points:


Complex data types are not supported (MAP, LIST)
Parquet file has the following compression-related options: NONE, SNAPPY, GZIP, and LZO. Data Factory
supports reading data from ORC file in any of these compressed formats. It uses the compression codec in the
metadata to read the data. However, when writing to a Parquet file, Data Factory chooses SNAPPY, which is the
default for Parquet format. Currently, there is no option to override this behavior.
Where is the copy operation performed?
See Globally available data movement section for details. In short, when an on-premises data store is involved, the
copy operation is performed by the Data Management Gateway in your on-premises environment. And, when the
data movement is between two cloud stores, the copy operation is performed in the region closest to the sink
location in the same geography.

HDInsight Activity - FAQ


What regions are supported by HDInsight?
See the Geographic Availability section in the following article: or HDInsight Pricing Details.
What region is used by an on-demand HDInsight cluster?
The on-demand HDInsight cluster is created in the same region where the storage you specified to be used with the
cluster exists.
How to associate additional storage accounts to your HDInsight cluster?
If you are using your own HDInsight Cluster (BYOC - Bring Your Own Cluster), see the following topics:
Using an HDInsight Cluster with Alternate Storage Accounts and Metastores
Use Additional Storage Accounts with HDInsight Hive
If you are using an on-demand cluster that is created by the Data Factory service, specify additional storage
accounts for the HDInsight linked service so that the Data Factory service can register them on your behalf. In the
JSON definition for the on-demand linked service, use additionalLinkedServiceNames property to specify
alternate storage accounts as shown in the following JSON snippet:

{
"name": "MyHDInsightOnDemandLinkedService",
"properties":
{
"type": "HDInsightOnDemandLinkedService",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:05:00",
"osType": "Linux",
"linkedServiceName": "LinkedService-SampleData",
"additionalLinkedServiceNames": [ "otherLinkedServiceName1", "otherLinkedServiceName2" ]
}
}
}

In the example above, otherLinkedServiceName1 and otherLinkedServiceName2 represent linked services whose
definitions contain credentials that the HDInsight cluster needs to access alternate storage accounts.

Slices - FAQ
Why are my input slices not in Ready state?
A common mistake is not setting external property to true on the input dataset when the input data is external to
the data factory (not produced by the data factory).
In the following example, you only need to set external to true on dataset1.
DataFactory1 Pipeline 1: dataset1 -> activity1 -> dataset2 -> activity2 -> dataset3 Pipeline 2: dataset3-> activity3
-> dataset4
If you have another data factory with a pipeline that takes dataset4 (produced by pipeline 2 in data factory 1), mark
dataset4 as an external dataset because the dataset is produced by a different data factory (DataFactory1, not
DataFactory2).
DataFactory2
Pipeline 1: dataset4->activity4->dataset5
If the external property is properly set, verify whether the input data exists in the location specified in the input
dataset definition.
How to run a slice at another time than midnight when the slice is being produced daily?
Use the offset property to specify the time at which you want the slice to be produced. See Dataset availability
section for details about this property. Here is a quick example:

"availability":
{
"frequency": "Day",
"interval": 1,
"offset": "06:00:00"
}

Daily slices start at 6 AM instead of the default midnight.


How can I rerun a slice?
You can rerun a slice in one of the following ways:
Use Monitor and Manage App to rerun an activity window or slice. See Rerun selected activity windows for
instructions.
Click Run in the command bar on the DATA SLICE blade for the slice in the Azure portal.
Run Set-AzureRmDataFactorySliceStatus cmdlet with Status set to Waiting for the slice.

Set-AzureRmDataFactorySliceStatus -Status Waiting -ResourceGroupName $ResourceGroup -DataFactoryName $df


-TableName $table -StartDateTime "02/26/2015 19:00:00" -EndDateTime "02/26/2015 20:00:00"

See Set-AzureRmDataFactorySliceStatus for details about the cmdlet.


How long did it take to process a slice?
Use Activity Window Explorer in Monitor & Manage App to know how long it took to process a data slice. See
Activity Window Explorer for details.
You can also do the following in the Azure portal:
1. Click Datasets tile on the DATA FACTORY blade for your data factory.
2. Click the specific dataset on the Datasets blade.
3. Select the slice that you are interested in from the Recent slices list on the TABLE blade.
4. Click the activity run from the Activity Runs list on the DATA SLICE blade.
5. Click Properties tile on the ACTIVITY RUN DETAILS blade.
6. You should see the DURATION field with a value. This value is the time taken to process the slice.
How to stop a running slice?
If you need to stop the pipeline from executing, you can use Suspend-AzureRmDataFactoryPipeline cmdlet.
Currently, suspending the pipeline does not stop the slice executions that are in progress. Once the in-progress
executions finish, no extra slice is picked up.
If you really want to stop all the executions immediately, the only way would be to delete the pipeline and create it
again. If you choose to delete the pipeline, you do NOT need to delete tables and linked services used by the
pipeline.
Move data by using Copy Activity
8/31/2017 9 min to read Edit Online

Overview
In Azure Data Factory, you can use Copy Activity to copy data between on-premises and cloud data
stores. After the data is copied, it can be further transformed and analyzed. You can also use Copy
Activity to publish transformation and analysis results for business intelligence (BI) and application
consumption.

Copy Activity is powered by a secure, reliable, scalable, and globally available service. This article
provides details on data movement in Data Factory and Copy Activity.
First, let's see how data migration occurs between two cloud data stores, and between an on-
premises data store and a cloud data store.

NOTE
To learn about activities in general, see Understanding pipelines and activities.

Copy data between two cloud data stores


When both source and sink data stores are in the cloud, Copy Activity goes through the following
stages to copy data from the source to the sink. The service that powers Copy Activity:
1. Reads data from the source data store.
2. Performs serialization/deserialization, compression/decompression, column mapping, and type
conversion. It does these operations based on the configurations of the input dataset, output
dataset, and Copy Activity.
3. Writes data to the destination data store.
The service automatically chooses the optimal region to perform the data movement. This region is
usually the one closest to the sink data store.

Copy data between an on-premises data store and a cloud data store
To securely move data between an on-premises data store and a cloud data store, install Data
Management Gateway on your on-premises machine. Data Management Gateway is an agent that
enables hybrid data movement and processing. You can install it on the same machine as the data
store itself, or on a separate machine that has access to the data store.
In this scenario, Data Management Gateway performs the serialization/deserialization,
compression/decompression, column mapping, and type conversion. Data does not flow through the
Azure Data Factory service. Instead, Data Management Gateway directly writes the data to the
destination store.

See Move data between on-premises and cloud data stores for an introduction and walkthrough. See
Data Management Gateway for detailed information about this agent.
You can also move data from/to supported data stores that are hosted on Azure IaaS virtual
machines (VMs) by using Data Management Gateway. In this case, you can install Data Management
Gateway on the same VM as the data store itself, or on a separate VM that has access to the data
store.

Supported data stores and formats


Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory
supports the following data stores. Data from any source can be written to any sink. Click a data store
to learn how to copy data to and from that store.

NOTE
If you need to move data to/from a data store that Copy Activity doesn't support, use a custom activity in
Data Factory with your own logic for copying/moving data. For details on creating and using a custom
activity, see Use custom activities in an Azure Data Factory pipeline.

CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Azure Azure Blob storage

Azure Cosmos DB
(DocumentDB API)

Azure Data Lake Store

Azure SQL Database

Azure SQL Data


Warehouse

Azure Search Index

Azure Table storage

Databases Amazon Redshift

DB2*

MySQL*
CATEGORY DATA STORE SUPPORTED AS A SOURCE SUPPORTED AS A SINK

Oracle*

PostgreSQL*

SAP Business
Warehouse*

SAP HANA*

SQL Server*

Sybase*

Teradata*

NoSQL Cassandra*

MongoDB*

File Amazon S3

File System*

FTP

HDFS*

SFTP

Others Generic HTTP

Generic OData

Generic ODBC*

Salesforce

Web Table (table from


HTML)

GE Historian*

NOTE
Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management
Gateway on an on-premises/Azure IaaS machine.

Supported file formats


You can use Copy Activity to copy files as-is between two file-based data stores, you can skip the
format section in both the input and output dataset definitions. The data is copied efficiently without
any serialization/deserialization.
Copy Activity also reads from and writes to files in specified formats: Text, JSON, Avro, ORC, and
Parquet, and compression codec GZip, Deflate, BZip2, and ZipDeflate are supported. See
Supported file and compression formats with details.
For example, you can do the following copy activities:
Copy data in on-premises SQL Server and write to Azure Data Lake Store in ORC format.
Copy files in text (CSV) format from on-premises File System and write to Azure Blob in Avro
format.
Copy zipped files from on-premises File System and decompress then land to Azure Data Lake
Store.
Copy data in GZip compressed text (CSV) format from Azure Blob and write to Azure SQL
Database.

Globally available data movement


Azure Data Factory is available only in the West US, East US, and North Europe regions. However, the
service that powers Copy Activity is available globally in the following regions and geographies. The
globally available topology ensures efficient data movement that usually avoids cross-region hops.
See Services by region for availability of Data Factory and Data Movement in a region.
Copy data between cloud data stores
When both source and sink data stores are in the cloud, Data Factory uses a service deployment in
the region that is closest to the sink in the same geography to move the data. Refer to the following
table for mapping:

GEOGRAPHY OF THE DESTINATION REGION OF THE DESTINATION DATA


DATA STORES STORE REGION USED FOR DATA MOVEMENT

United States East US East US

East US 2 East US 2

Central US Central US

North Central US North Central US

South Central US South Central US

West Central US West Central US

West US West US

West US 2 West US

Canada Canada East Canada Central

Canada Central Canada Central

Brazil Brazil South Brazil South

Europe North Europe North Europe


GEOGRAPHY OF THE DESTINATION REGION OF THE DESTINATION DATA
DATA STORES STORE REGION USED FOR DATA MOVEMENT

West Europe West Europe

United Kingdom UK West UK South

UK South UK South

Asia Pacific Southeast Asia Southeast Asia

East Asia Southeast Asia

Australia Australia East Australia East

Australia Southeast Australia Southeast

India Central India Central India

West India Central India

South India Central India

Japan Japan East Japan East

Japan West Japan East

Korea Korea Central Korea Central

Korea South Korea Central

Alternatively, you can explicitly indicate the region of Data Factory service to be used to perform the
copy by specifying executionLocation property under Copy Activity typeProperties . Supported
values for this property are listed in above Region used for data movement column. Note your
data goes through that region over the wire during copy. For example, to copy between Azure stores
in Korea, you can specify "executionLocation": "Japan East" to route through Japan region (see
sample JSON as reference).

NOTE
If the region of the destination data store is not in preceding list or undetectable, by default Copy Activity
fails instead of going through an alternative region, unless executionLocation is specified. The supported
region list will be expanded over time.

Copy data between an on-premises data store and a cloud data store
When data is being copied between on-premises (or Azure virtual machines/IaaS) and cloud stores,
Data Management Gateway performs data movement on an on-premises machine or virtual
machine. The data does not flow through the service in the cloud, unless you use the staged copy
capability. In this case, data flows through the staging Azure Blob storage before it is written into the
sink data store.

Create a pipeline with Copy Activity


You can create a pipeline with Copy Activity in a couple of ways:
By using the Copy Wizard
The Data Factory Copy Wizard helps you to create a pipeline with Copy Activity. This pipeline allows
you to copy data from supported sources to destinations without writing JSON definitions for linked
services, datasets, and pipelines. See Data Factory Copy Wizard for details about the wizard.
By using JSON scripts
You can use Data Factory Editor in the Azure portal, Visual Studio, or Azure PowerShell to create a
JSON definition for a pipeline (by using Copy Activity). Then, you can deploy it to create the pipeline
in Data Factory. See Tutorial: Use Copy Activity in an Azure Data Factory pipeline for a tutorial with
step-by-step instructions.
JSON properties (such as name, description, input and output tables, and policies) are available for all
types of activities. Properties that are available in the typeProperties section of the activity vary with
each activity type.
For Copy Activity, the typeProperties section varies depending on the types of sources and sinks.
Click a source/sink in the Supported sources and sinks section to learn about type properties that
Copy Activity supports for that data store.
Here's a sample JSON definition:

{
"name": "ADFTutorialPipeline",
"properties": {
"description": "Copy data from Azure blob to Azure SQL table",
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [
{
"name": "InputBlobTable"
}
],
"outputs": [
{
"name": "OutputSQLTable"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
},
"executionLocation": "Japan East"
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2016-07-12T00:00:00Z",
"end": "2016-07-13T00:00:00Z"
}
}
The schedule that is defined in the output dataset determines when the activity runs (for example:
daily, frequency as day, and interval as 1). The activity copies data from an input dataset (source) to
an output dataset (sink).
You can specify more than one input dataset to Copy Activity. They are used to verify the
dependencies before the activity is run. However, only the data from the first dataset is copied to the
destination dataset. For more information, see Scheduling and execution.

Performance and tuning


See the Copy Activity performance and tuning guide, which describes key factors that affect the
performance of data movement (Copy Activity) in Azure Data Factory. It also lists the observed
performance during internal testing and discusses various ways to optimize the performance of Copy
Activity.

Fault tolerance
By default, copy activity will stop copying data and return failure when encounter incompatible data
between source and sink; while you can explicitly configure to skip and log the incompatible rows
and only copy those compatible data to make the copy succeeded. See the Copy Activity fault
tolerance on more details.

Security considerations
See the Security considerations, which describes security infrastructure that data movement services
in Azure Data Factory use to secure your data.

Scheduling and sequential copy


See Scheduling and execution for detailed information about how scheduling and execution works in
Data Factory. It is possible to run multiple copy operations one after another in a sequential/ordered
manner. See the Copy sequentially section.

Type conversions
Different data stores have different native type systems. Copy Activity performs automatic type
conversions from source types to sink types with the following two-step approach:
1. Convert from native source types to a .NET type.
2. Convert from a .NET type to a native sink type.
The mapping from a native type system to a .NET type for a data store is in the respective data store
article. (Click the specific link in the Supported data stores table). You can use these mappings to
determine appropriate types while creating your tables, so that Copy Activity performs the right
conversions.

Next steps
To learn about the Copy Activity more, see Copy data from Azure Blob storage to Azure SQL
Database.
To learn about moving data from an on-premises data store to a cloud data store, see Move data
from on-premises to cloud data stores.
Azure Data Factory Copy Wizard
8/15/2017 4 min to read Edit Online

The Azure Data Factory Copy Wizard eases the process of ingesting data, which is usually a first step in an end-to-
end data integration scenario. When going through the Azure Data Factory Copy Wizard, you do not need to
understand any JSON definitions for linked services, data sets, and pipelines. The wizard automatically creates a
pipeline to copy data from the selected data source to the selected destination. In addition, the Copy Wizard helps
you to validate the data being ingested at the time of authoring. This saves time, especially when you are ingesting
data for the first time from the data source. To start the Copy Wizard, click the Copy data tile on the home page of
your data factory.

Designed for big data


This wizard allows you to easily move data from a wide variety of sources to destinations in minutes. After you go
through the wizard, a pipeline with a copy activity is automatically created for you, along with dependent Data
Factory entities (linked services and data sets). No additional steps are required to create the pipeline.
NOTE
For step-by-step instructions to create a sample pipeline to copy data from an Azure blob to an Azure SQL Database table,
see the Copy Wizard tutorial.

The wizard is designed with big data in mind from the start, with support for diverse data and object types. You can
author Data Factory pipelines that move hundreds of folders, files, or tables. The wizard supports automatic data
preview, schema capture and mapping, and data filtering.

Automatic data preview


You can preview part of the data from the selected data source in order to validate whether the data is what you
want to copy. In addition, if the source data is in a text file, the Copy Wizard parses the text file to learn the row and
column delimiters and schema automatically.
Schema capture and mapping
The schema of input data may not match the schema of output data in some cases. In this scenario, you need to
map columns from the source schema to columns from the destination schema.

TIP
When copying data from SQL Server or Azure SQL Database into Azure SQL Data Warehouse, if the table does not exist in
the destination store, Data Factory support auto table creation using source's schema. Learn more from Move data to and
from Azure SQL Data Warehouse using Azure Data Factory.

Use a drop-down list to select a column from the source schema to map to a column in the destination schema. The
Copy Wizard tries to understand your pattern for column mapping. It applies the same pattern to the rest of the
columns, so that you do not need to select each of the columns individually to complete the schema mapping. If
you prefer, you can override these mappings by using the drop-down lists to map the columns one by one. The
pattern becomes more accurate as you map more columns. The Copy Wizard constantly updates the pattern, and
ultimately reaches the right pattern for the column mapping you want to achieve.
Filtering data
You can filter source data to select only the data that needs to be copied to the sink data store. Filtering reduces the
volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy
operation. It provides a flexible way to filter data in a relational database by using the SQL query language, or files
in an Azure blob folder by using Data Factory functions and variables.
Filtering of data in a database
The following screenshot shows a SQL query using the Text.Format function and WindowStart variable.

Filtering of data in an Azure blob folder


You can use variables in the folder path to copy data from a folder that is determined at runtime based on system
variables. The supported variables are: {year}, {month}, {day}, {hour}, {minute}, and {custom}. For example:
inputfolder/{year}/{month}/{day}.
Suppose that you have input folders in the following format:

2016/03/01/01
2016/03/01/02
2016/03/01/03
...

Click the Browse button for File or folder, browse to one of these folders (for example, 2016->03->01->02), and
click Choose. You should see 2016/03/01/02 in the text box. Now, replace 2016 with {year}, 03 with {month}, 01
with {day}, and 02 with {hour}, and press the Tab key. You should see drop-down lists to select the format for
these four variables:

As shown in the following screenshot, you can also use a custom variable and any supported format strings. To
select a folder with that structure, use the Browse button first. Then replace a value with {custom}, and press the
Tab key to see the text box where you can type the format string.
Scheduling options
You can run the copy operation once or on a schedule (hourly, daily, and so on). Both of these options can be used
for the breadth of the connectors across environments, including on-premises, cloud, and local desktop copy.
A one-time copy operation enables data movement from a source to a destination only once. It applies to data of
any size and any supported format. The scheduled copy allows you to copy data on a prescribed recurrence. You
can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy.

Next steps
For a quick walkthrough of using the Data Factory Copy Wizard to create a pipeline with Copy Activity, see Tutorial:
Create a pipeline using the Copy Wizard.
Load 1 TB into Azure SQL Data Warehouse under 15
minutes with Data Factory
8/22/2017 7 min to read Edit Online

Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data,
both relational and non-relational. Built on massively parallel processing (MPP) architecture, SQL Data Warehouse
is optimized for enterprise data warehouse workloads. It offers cloud elasticity with the flexibility to scale storage
and compute independently.
Getting started with Azure SQL Data Warehouse is now easier than ever using Azure Data Factory. Azure Data
Factory is a fully managed cloud-based data integration service, which can be used to populate a SQL Data
Warehouse with the data from your existing system, and saving you valuable time while evaluating SQL Data
Warehouse and building your analytics solutions. Here are the key benefits of loading data into Azure SQL Data
Warehouse using Azure Data Factory:
Easy to set up: 5-step intuitive wizard with no scripting required.
Rich data store support: built-in support for a rich set of on-premises and cloud-based data stores.
Secure and compliant: data is transferred over HTTPS or ExpressRoute, and global service presence ensures
your data never leaves the geographical boundary
Unparalleled performance by using PolyBase Using Polybase is the most efficient way to move data into
Azure SQL Data Warehouse. Using the staging blob feature, you can achieve high load speeds from all types of
data stores besides Azure Blob storage, which the Polybase supports by default.
This article shows you how to use Data Factory Copy Wizard to load 1-TB data from Azure Blob Storage into Azure
SQL Data Warehouse in under 15 minutes, at over 1.2 GBps throughput.
This article provides step-by-step instructions for moving data into Azure SQL Data Warehouse by using the Copy
Wizard.

NOTE
For general information about capabilities of Data Factory in moving data to/from Azure SQL Data Warehouse, see Move
data to and from Azure SQL Data Warehouse using Azure Data Factory article.
You can also build pipelines using Azure portal, Visual Studio, PowerShell, etc. See Tutorial: Copy data from Azure Blob to
Azure SQL Database for a quick walkthrough with step-by-step instructions for using the Copy Activity in Azure Data
Factory.

Prerequisites
Azure Blob Storage: this experiment uses Azure Blob Storage (GRS) for storing TPC-H testing dataset. If you do
not have an Azure storage account, learn how to create a storage account.
TPC-H data: we are going to use TPC-H as the testing dataset. To do that, you need to use dbgen from TPC-
H toolkit, which helps you generate the dataset. You can either download source code for dbgen from TPC
Tools and compile it yourself, or download the compiled binary from GitHub. Run dbgen.exe with the
following commands to generate 1 TB flat file for lineitem table spread across 10 files:
Dbgen -s 1000 -S **1** -C 10 -T L -v
Dbgen -s 1000 -S **2** -C 10 -T L -v

Dbgen -s 1000 -S **10** -C 10 -T L -v

Now copy the generated files to Azure Blob. Refer to Move data to and from an on-premises file
system by using Azure Data Factory for how to do that using ADF Copy.
Azure SQL Data Warehouse: this experiment loads data into Azure SQL Data Warehouse created with 6,000
DWUs
Refer to Create an Azure SQL Data Warehouse for detailed instructions on how to create a SQL Data
Warehouse database. To get the best possible load performance into SQL Data Warehouse using Polybase,
we choose maximum number of Data Warehouse Units (DWUs) allowed in the Performance setting, which
is 6,000 DWUs.

NOTE
When loading from Azure Blob, the data loading performance is directly proportional to the number of DWUs you
configure on the SQL Data Warehouse:
Loading 1 TB into 1,000 DWU SQL Data Warehouse takes 87 minutes (~200 MBps throughput) Loading 1 TB into
2,000 DWU SQL Data Warehouse takes 46 minutes (~380 MBps throughput) Loading 1 TB into 6,000 DWU SQL
Data Warehouse takes 14 minutes (~1.2 GBps throughput)

To create a SQL Data Warehouse with 6,000 DWUs, move the Performance slider all the way to the right:

For an existing database that is not configured with 6,000 DWUs, you can scale it up using Azure portal.
Navigate to the database in Azure portal, and there is a Scale button in the Overview panel shown in the
following image:
Click the Scale button to open the following panel, move the slider to the maximum value, and click Save
button.

This experiment loads data into Azure SQL Data Warehouse using xlargerc resource class.
To achieve best possible throughput, copy needs to be performed using a SQL Data Warehouse user
belonging to xlargerc resource class. Learn how to do that by following Change a user resource class
example.
Create destination table schema in Azure SQL Data Warehouse database, by running the following DDL
statement:

CREATE TABLE [dbo].[lineitem]


(
[L_ORDERKEY] [bigint] NOT NULL,
[L_PARTKEY] [bigint] NOT NULL,
[L_SUPPKEY] [bigint] NOT NULL,
[L_LINENUMBER] [int] NOT NULL,
[L_QUANTITY] [decimal](15, 2) NULL,
[L_EXTENDEDPRICE] [decimal](15, 2) NULL,
[L_DISCOUNT] [decimal](15, 2) NULL,
[L_TAX] [decimal](15, 2) NULL,
[L_RETURNFLAG] [char](1) NULL,
[L_LINESTATUS] [char](1) NULL,
[L_SHIPDATE] [date] NULL,
[L_COMMITDATE] [date] NULL,
[L_RECEIPTDATE] [date] NULL,
[L_SHIPINSTRUCT] [char](25) NULL,
[L_SHIPMODE] [char](10) NULL,
[L_COMMENT] [varchar](44) NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)

With the prerequisite steps completed, we are now ready to configure the copy activity using the Copy
Wizard.

Launch Copy Wizard


1. Log in to the Azure portal.
2. Click + NEW from the top-left corner, click Intelligence + analytics, and click Data Factory.
3. In the New data factory blade:
a. Enter LoadIntoSQLDWDataFactory for the name. The name of the Azure data factory must be globally
unique. If you receive the error: Data factory name LoadIntoSQLDWDataFactory is not available,
change the name of the data factory (for example, yournameLoadIntoSQLDWDataFactory) and try
creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
b. Select your Azure subscription.
c. For Resource Group, do one of the following steps:
a. Select Use existing to select an existing resource group.
b. Select Create new to enter a name for a resource group.
d. Select a location for the data factory.
e. Select Pin to dashboard check box at the bottom of the blade.
f. Click Create.
4. After the creation is complete, you see the Data Factory blade as shown in the following image:

5. On the Data Factory home page, click the Copy data tile to launch Copy Wizard.

NOTE
If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third party cookies and site
data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try launching the
wizard again.

Step 1: Configure data loading schedule


The first step is to configure the data loading schedule.
In the Properties page:
1. Enter CopyFromBlobToAzureSqlDataWarehouse for Task name
2. Select Run once now option.
3. Click Next.
Step 2: Configure source
This section shows you the steps to configure the source: Azure Blob containing the 1-TB TPC-H line item files.
1. Select the Azure Blob Storage as the data store and click Next.

2. Fill in the connection information for the Azure Blob storage account, and click Next.
3. Choose the folder containing the TPC-H line item files and click Next.

4. Upon clicking Next, the file format settings are detected automatically. Check to make sure that column
delimiter is | instead of the default comma ,. Click Next after you have previewed the data.
Step 3: Configure destination
This section shows you how to configure the destination: lineitem table in the Azure SQL Data Warehouse
database.
1. Choose Azure SQL Data Warehouse as the destination store and click Next.

2. Fill in the connection information for Azure SQL Data Warehouse. Make sure you specify the user that is a
member of the role xlargerc (see the prerequisites section for detailed instructions), and click Next.
3. Choose the destination table and click Next.
4. In Schema mapping page, leave "Apply column mapping" option unchecked and click Next.

Step 4: Performance settings


Allow polybase is checked by default. Click Next.

Step 5: Deploy and monitor load results


1. Click Finish button to deploy.
2. After the deployment is complete, click Click here to monitor copy pipeline to monitor the copy run
progress. Select the copy pipeline you created in the Activity Windows list.

You can view the copy run details in the Activity Window Explorer in the right panel, including the data
volume read from source and written into destination, duration, and the average throughput for the run.
As you can see from the following screen shot, copying 1 TB from Azure Blob Storage into SQL Data
Warehouse took 14 minutes, effectively achieving 1.22 GBps throughput!
Best practices
Here are a few best practices for running your Azure SQL Data Warehouse database:
Use a larger resource class when loading into a CLUSTERED COLUMNSTORE INDEX.
For more efficient joins, consider using hash distribution by a select column instead of default round robin
distribution.
For faster load speeds, consider using heap for transient data.
Create statistics after you finish loading Azure SQL Data Warehouse.
See Best practices for Azure SQL Data Warehouse for details.

Next steps
Data Factory Copy Wizard - This article provides details about the Copy Wizard.
Copy Activity performance and tuning guide - This article contains the reference performance measurements
and tuning guide.
Copy Activity performance and tuning guide
8/21/2017 28 min to read Edit Online

Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading
solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-
premises data stores. Blazing-fast data loading performance is key to ensure you can focus on the core big
data problem: building advanced analytics solutions and getting deep insights from all that data.
Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a
highly optimized data loading experience that is easy to configure and set up. With just a single copy activity,
you can achieve:
Loading data into Azure SQL Data Warehouse at 1.2 GBps. For a walkthrough with a use case, see Load 1
TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
Loading data into Azure Blob storage at 1.0 GBps
Loading data into Azure Data Lake Store at 1.0 GBps
This article describes:
Performance reference numbers for supported source and sink data stores to help you plan your project;
Features that can boost the copy throughput in different scenarios, including cloud data movement units,
parallel copy, and staged Copy;
Performance tuning guidance on how to tune the performance and the key factors that can impact copy
performance.

NOTE
If you are not familiar with Copy Activity in general, see Move data by using Copy Activity before reading this article.

Performance reference
As a reference, below table shows the copy throughput number in MBps for the given source and sink pairs
based on in-house testing. For comparison, it also demonstrates how different settings of cloud data
movement units or Data Management Gateway scalability (multiple gateway nodes) can help on copy
performance.
Points to note:
Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run
duration].
The performance reference numbers in the table were measured using TPC-H data set in a single copy
activity run.
In Azure data stores, the source and sink are in the same Azure region.
For hybrid copy between on-premises and cloud data stores, each gateway node was running on a machine
that was separate from the on-premises data store with below specification. When a single activity was
running on gateway, the copy operation consumed only a small portion of the test machine's CPU, memory,
or network bandwidth. Learn more from consideration for Data Management Gateway.

CPU 32 cores 2.20 GHz Intel Xeon E5-2660 v2

Memory 128 GB

Network Internet interface: 10 Gbps; intranet interface: 40 Gbps

TIP
You can achieve higher throughput by leveraging more data movement units (DMUs) than the default maximum DMUs,
which is 32 for a cloud-to-cloud copy activity run. For example, with 100 DMUs, you can achieve copying data from
Azure Blob into Azure Data Lake Store at 1.0GBps. See the Cloud data movement units section for details about this
feature and the supported scenario. Contact Azure support to request more DMUs.

Parallel copy
You can read data from the source or write data to the destination in parallel within a Copy Activity run.
This feature enhances the throughput of a copy operation and reduces the time it takes to move data.
This setting is different from the concurrency property in the activity definition. The concurrency property
determines the number of concurrent Copy Activity runs to process data from different activity windows (1
AM to 2 AM, 2 AM to 3 AM, 3 AM to 4 AM, and so on). This capability is helpful when you perform a historical
load. The parallel copy capability applies to a single activity run.
Let's look at a sample scenario. In the following example, multiple slices from the past need to be processed.
Data Factory runs an instance of Copy Activity (an activity run) for each slice:
The data slice from the first activity window (1 AM to 2 AM) ==> Activity run 1
The data slice from the second activity window (2 AM to 3 AM) ==> Activity run 2
The data slice from the second activity window (3 AM to 4 AM) ==> Activity run 3
And so on.
In this example, when the concurrency value is set to 2, Activity run 1 and Activity run 2 copy data from two
activity windows concurrently to improve data movement performance. However, if multiple files are
associated with Activity run 1, the data movement service copies files from the source to the destination one
file at a time.
Cloud data movement units
A cloud data movement unit (DMU) is a measure that represents the power (a combination of CPU,
memory, and network resource allocation) of a single unit in Data Factory. A DMU might be used in a cloud-to-
cloud copy operation, but not in a hybrid copy.
By default, Data Factory uses a single cloud DMU to perform a single Copy Activity run. To override this default,
specify a value for the cloudDataMovementUnits property as follows. For information about the level of
performance gain you might get when you configure more units for a specific copy source and sink, see the
performance reference.

"activities":[
{
"name": "Sample copy activity",
"description": "",
"type": "Copy",
"inputs": [{ "name": "InputDataset" }],
"outputs": [{ "name": "OutputDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"cloudDataMovementUnits": 32
}
}
]

The allowed values for the cloudDataMovementUnits property are 1 (default), 2, 4, 8, 16, 32. The actual
number of cloud DMUs that the copy operation uses at run time is equal to or less than the configured value,
depending on your data pattern.

NOTE
If you need more cloud DMUs for a higher throughput, contact Azure support. Setting of 8 and above currently works
only when you copy multiple files from Blob storage/Data Lake Store/Amazon S3/cloud FTP/cloud SFTP to
Blob storage/Data Lake Store/Azure SQL Database.
parallelCopies
You can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use. You
can think of this property as the maximum number of threads within Copy Activity that can read from your
source or write to your sink data stores in parallel.
For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the
source data store and to the destination data store. The default number of parallel copies that it uses depends
on the type of source and sink that you are using.

SOURCE AND SINK DEFAULT PARALLEL COPY COUNT DETERMINED BY SERVICE

Copy data between file-based stores (Blob storage; Data Between 1 and 32. Depends on the size of the files and the
Lake Store; Amazon S3; an on-premises file system; an on- number of cloud data movement units (DMUs) used to
premises HDFS) copy data between two cloud data stores, or the physical
configuration of the Gateway machine used for a hybrid
copy (to copy data to or from an on-premises data store).

Copy data from any source data store to Azure Table 4


storage

All other source and sink pairs 1

Usually, the default behavior should give you the best throughput. However, to control the load on machines
that host your data stores, or to tune copy performance, you may choose to override the default value and
specify a value for the parallelCopies property. The value must be between 1 and 32 (both inclusive). At run
time, for the best performance, Copy Activity uses a value that is less than or equal to the value that you set.

"activities":[
{
"name": "Sample copy activity",
"description": "",
"type": "Copy",
"inputs": [{ "name": "InputDataset" }],
"outputs": [{ "name": "OutputDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"parallelCopies": 8
}
}
]

Points to note:
When you copy data between file-based stores, the parallelCopies determine the parallelism at the file
level. The chunking within a single file would happen underneath automatically and transparently, and it's
designed to use the best suitable chunk size for a given source data store type to load data in parallel and
orthogonal to parallelCopies. The actual number of parallel copies the data movement service uses for the
copy operation at run time is no more than the number of files you have. If the copy behavior is mergeFile,
Copy Activity cannot take advantage of file-level parallelism.
When you specify a value for the parallelCopies property, consider the load increase on your source and
sink data stores, and to gateway if it is a hybrid copy. This happens especially when you have multiple
activities or concurrent runs of the same activities that run against the same data store. If you notice that
either the data store or Gateway is overwhelmed with the load, decrease the parallelCopies value to relieve
the load.
When you copy data from stores that are not file-based to stores that are file-based, the data movement
service ignores the parallelCopies property. Even if parallelism is specified, it's not applied in this case.

NOTE
You must use Data Management Gateway version 1.11 or later to use the parallelCopies feature when you do a hybrid
copy.

To better use these two properties, and to enhance your data movement throughput, see the sample use cases.
You don't need to configure parallelCopies to take advantage of the default behavior. If you do configure and
parallelCopies is too small, multiple cloud DMUs might not be fully utilized.
Billing impact
It's important to remember that you are charged based on the total time of the copy operation. If a copy job
used to take one hour with one cloud unit and now it takes 15 minutes with four cloud units, the overall bill
remains almost the same. For example, you use four cloud units. The first cloud unit spends 10 minutes, the
second one, 10 minutes, the third one, 5 minutes, and the fourth one, 5 minutes, all in one Copy Activity run.
You are charged for the total copy (data movement) time, which is 10 + 10 + 5 + 5 = 30 minutes. Using
parallelCopies does not affect billing.

Staged copy
When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an
interim staging store. Staging is especially useful in the following cases:
1. You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL Data
Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data
Warehouse. However, the source data must be in Blob storage, and it must meet additional criteria. When
you load data from a data store other than Blob storage, you can activate data copying via interim staging
Blob storage. In that case, Data Factory performs the required data transformations to ensure that it meets
the requirements of PolyBase. Then it uses PolyBase to load data into SQL Data Warehouse. For more
details, see Use PolyBase to load data into Azure SQL Data Warehouse. For a walkthrough with a use case,
see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
2. Sometimes it takes a while to perform a hybrid data movement (that is, to copy between an on-
premises data store and a cloud data store) over a slow network connection. To improve
performance, you can compress the data on-premises so that it takes less time to move data to the staging
data store in the cloud. Then you can decompress the data in the staging store before you load it into the
destination data store.
3. You don't want to open ports other than port 80 and port 443 in your firewall, because of
corporate IT policies. For example, when you copy data from an on-premises data store to an Azure SQL
Database sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication
on port 1433 for both the Windows firewall and your corporate firewall. In this scenario, take advantage of
the gateway to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. Then,
load the data into SQL Database or SQL Data Warehouse from Blob storage staging. In this flow, you don't
need to enable port 1433.
How staged copy works
When you activate the staging feature, first the data is copied from the source data store to the staging data
store (bring your own). Next, the data is copied from the staging data store to the sink data store. Data Factory
automatically manages the two-stage flow for you. Data Factory also cleans up temporary data from the
staging storage after the data movement is complete.
In the cloud copy scenario (both source and sink data stores are in the cloud), gateway is not used. The Data
Factory service performs the copy operations.

In the hybrid copy scenario (source is on-premises and sink is in the cloud), the gateway moves data from the
source data store to a staging data store. Data Factory service moves data from the staging data store to the
sink data store. Copying data from a cloud data store to an on-premises data store via staging also is
supported with the reversed flow.

When you activate data movement by using a staging store, you can specify whether you want the data to be
compressed before moving data from the source data store to an interim or staging data store, and then
decompressed before moving data from an interim or staging data store to the sink data store.
Currently, you can't copy data between two on-premises data stores by using a staging store. We expect this
option to be available soon.
Configuration
Configure the enableStaging setting in Copy Activity to specify whether you want the data to be staged in
Blob storage before you load it into a destination data store. When you set enableStaging to TRUE, specify the
additional properties listed in the next table. If you dont have one, you also need to create an Azure Storage or
Storage shared access signature-linked service for staging.

PROPERTY DESCRIPTION DEFAULT VALUE REQUIRED

enableStaging Specify whether you want False No


to copy data via an interim
staging store.

linkedServiceName Specify the name of an N/A Yes, when enableStaging


AzureStorage or is set to TRUE
AzureStorageSas linked
service, which refers to the
instance of Storage that
you use as an interim
staging store.

You cannot use Storage


with a shared access
signature to load data into
SQL Data Warehouse via
PolyBase. You can use it in
all other scenarios.
PROPERTY DESCRIPTION DEFAULT VALUE REQUIRED

path Specify the Blob storage N/A No


path that you want to
contain the staged data. If
you do not provide a path,
the service creates a
container to store
temporary data.

Specify a path only if you


use Storage with a shared
access signature, or you
require temporary data to
be in a specific location.

enableCompression Specifies whether data False No


should be compressed
before it is copied to the
destination. This setting
reduces the volume of data
being transferred.

Here's a sample definition of Copy Activity with the properties that are described in the preceding table:

"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [{ "name": "OnpremisesSQLServerInput" }],
"outputs": [{ "name": "AzureSQLDBOutput" }],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "MyStagingBlob",
"path": "stagingcontainer/path",
"enableCompression": true
}
}
}
]

Billing impact
You are charged based on two steps: copy duration and copy type.
When you use staging during a cloud copy (copying data from a cloud data store to another cloud data
store), you are charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price].
When you use staging during a hybrid copy (copying data from an on-premises data store to a cloud data
store), you are charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud
copy unit price].

Performance tuning steps


We suggest that you take these steps to tune the performance of your Data Factory service with Copy Activity:
1. Establish a baseline. During the development phase, test your pipeline by using Copy Activity against a
representative data sample. You can use the Data Factory slicing model to limit the amount of data you
work with.
Collect execution time and performance characteristics by using the Monitoring and Management
App. Choose Monitor & Manage on your Data Factory home page. In the tree view, choose the
output dataset. In the Activity Windows list, choose the Copy Activity run. Activity Windows lists
the Copy Activity duration and the size of the data that's copied. The throughput is listed in Activity
Window Explorer. To learn more about the app, see Monitor and manage Azure Data Factory pipelines
by using the Monitoring and Management App.

Later in the article, you can compare the performance and configuration of your scenario to Copy
Activitys performance reference from our tests.
2. Diagnose and optimize performance. If the performance you observe doesn't meet your
expectations, you need to identify performance bottlenecks. Then, optimize performance to remove or
reduce the effect of bottlenecks. A full description of performance diagnosis is beyond the scope of this
article, but here are some common considerations:
Performance features:
Parallel copy
Cloud data movement units
Staged copy
Data Management Gateway scalability
Data Management Gateway
Source
Sink
Serialization and deserialization
Compression
Column mapping
Other considerations
3. Expand the configuration to your entire data set. When you're satisfied with the execution results and
performance, you can expand the definition and pipeline active period to cover your entire data set.

Considerations for Data Management Gateway


Gateway setup: We recommend that you use a dedicated machine to host Data Management Gateway. See
Considerations for using Data Management Gateway.
Gateway monitoring and scale-up/out: A single logical gateway with one or more gateway nodes can serve
multiple Copy Activity runs at the same time concurrently. You can view near-real time snapshot of resource
utilization (CPU, memory, network(in/out), etc.) on a gateway machine as well as the number of concurrent jobs
running versus limit in the Azure portal, see Monitor gateway in the portal. If you have heavy need on hybrid
data movement either with large number of concurrent copy activity runs or with large volume of data to copy,
consider to scale up or scale out gateway so as to better utilize your resource or to provision more resource to
empower copy.

Considerations for the source


General
Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.
For Microsoft data stores, see monitoring and tuning topics that are specific to data stores, and help you
understand data store performance characteristics, minimize response times, and maximize throughput.
If you copy data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost performance.
See Use PolyBase to load data into Azure SQL Data Warehouse for details. For a walkthrough with a use case,
see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
File -based data stores
(Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS)
Average file size and file count: Copy Activity transfers data one file at a time. With the same amount of
data to be moved, the overall throughput is lower if the data consists of many small files rather than a few
large files due to the bootstrap phase for each file. Therefore, if possible, combine small files into larger files
to gain higher throughput.
File format and compression: For more ways to improve performance, see the Considerations for
serialization and deserialization and Considerations for compression sections.
For the on-premises file system scenario, in which Data Management Gateway is required, see the
Considerations for Data Management Gateway section.
Relational data stores
(Includes SQL Database; SQL Data Warehouse; Amazon Redshift; SQL Server databases; and Oracle, MySQL,
DB2, Teradata, Sybase, and PostgreSQL databases, etc.)
Data pattern: Your table schema affects copy throughput. A large row size gives you a better performance
than small row size, to copy the same amount of data. The reason is that the database can more efficiently
retrieve fewer batches of data that contain fewer rows.
Query or stored procedure: Optimize the logic of the query or stored procedure you specify in the Copy
Activity source to fetch data more efficiently.
For on-premises relational databases, such as SQL Server and Oracle, which require the use of Data
Management Gateway, see the Considerations for Data Management Gateway section.

Considerations for the sink


General
Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.
For Microsoft data stores, refer to monitoring and tuning topics that are specific to data stores. These topics can
help you understand data store performance characteristics and how to minimize response times and
maximize throughput.
If you are copying data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost
performance. See Use PolyBase to load data into Azure SQL Data Warehouse for details. For a walkthrough
with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
File -based data stores
(Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS)
Copy behavior: If you copy data from a different file-based data store, Copy Activity has three options via
the copyBehavior property. It preserves hierarchy, flattens hierarchy, or merges files. Either preserving or
flattening hierarchy has little or no performance overhead, but merging files causes performance overhead
to increase.
File format and compression: See the Considerations for serialization and deserialization and
Considerations for compression sections for more ways to improve performance.
Blob storage: Currently, Blob storage supports only block blobs for optimized data transfer and
throughput.
For on-premises file systems scenarios that require the use of Data Management Gateway, see the
Considerations for Data Management Gateway section.
Relational data stores
(Includes SQL Database, SQL Data Warehouse, SQL Server databases, and Oracle databases)
Copy behavior: Depending on the properties you've set for sqlSink, Copy Activity writes data to the
destination database in different ways.
By default, the data movement service uses the Bulk Copy API to insert data in append mode, which
provides the best performance.
If you configure a stored procedure in the sink, the database applies the data one row at a time
instead of as a bulk load. Performance drops significantly. If your data set is large, when applicable,
consider switching to using the sqlWriterCleanupScript property.
If you configure the sqlWriterCleanupScript property for each Copy Activity run, the service
triggers the script, and then you use the Bulk Copy API to insert the data. For example, to overwrite
the entire table with the latest data, you can specify a script to first delete all records before bulk-
loading the new data from the source.
Data pattern and batch size:
Your table schema affects copy throughput. To copy the same amount of data, a large row size gives
you better performance than a small row size because the database can more efficiently commit
fewer batches of data.
Copy Activity inserts data in a series of batches. You can set the number of rows in a batch by using
the writeBatchSize property. If your data has small rows, you can set the writeBatchSize property
with a higher value to benefit from lower batch overhead and higher throughput. If the row size of
your data is large, be careful when you increase writeBatchSize. A high value might lead to a copy
failure caused by overloading the database.
For on-premises relational databases like SQL Server and Oracle, which require the use of Data
Management Gateway, see the Considerations for Data Management Gateway section.
NoSQL stores
(Includes Table storage and Azure Cosmos DB )
For Table storage:
Partition: Writing data to interleaved partitions dramatically degrades performance. Sort your
source data by partition key so that the data is inserted efficiently into one partition after another, or
adjust the logic to write the data to a single partition.
For Azure Cosmos DB:
Batch size: The writeBatchSize property sets the number of parallel requests to the Azure Cosmos
DB service to create documents. You can expect better performance when you increase
writeBatchSize because more parallel requests are sent to Azure Cosmos DB. However, watch for
throttling when you write to Azure Cosmos DB (the error message is "Request rate is large"). Various
factors can cause throttling, including document size, the number of terms in the documents, and the
target collection's indexing policy. To achieve higher copy throughput, consider using a better
collection, for example, S3.

Considerations for serialization and deserialization


Serialization and deserialization can occur when your input data set or output data set is a file. See Supported
file and compression formats with details on supported file formats by Copy Activity.
Copy behavior:
Copying files between file-based data stores:
When input and output data sets both have the same or no file format settings, the data movement
service executes a binary copy without any serialization or deserialization. You see a higher
throughput compared to the scenario, in which the source and sink file format settings are different
from each other.
When input and output data sets both are in text format and only the encoding type is different, the
data movement service only does encoding conversion. It doesn't do any serialization and
deserialization, which causes some performance overhead compared to a binary copy.
When input and output data sets both have different file formats or different configurations, like
delimiters, the data movement service deserializes source data to stream, transform, and then
serialize it into the output format you indicated. This operation results in a much more significant
performance overhead compared to other scenarios.
When you copy files to/from a data store that is not file-based (for example, from a file-based store to a
relational store), the serialization or deserialization step is required. This step results in significant
performance overhead.
File format: The file format you choose might affect copy performance. For example, Avro is a compact binary
format that stores metadata with data. It has broad support in the Hadoop ecosystem for processing and
querying. However, Avro is more expensive for serialization and deserialization, which results in lower copy
throughput compared to text format. Make your choice of file format throughout the processing flow
holistically. Start with what form the data is stored in, source data stores or to be extracted from external
systems; the best format for storage, analytical processing, and querying; and in what format the data should
be exported into data marts for reporting and visualization tools. Sometimes a file format that is suboptimal for
read and write performance might be a good choice when you consider the overall analytical process.

Considerations for compression


When your input or output data set is a file, you can set Copy Activity to perform compression or
decompression as it writes data to the destination. When you choose compression, you make a tradeoff
between input/output (I/O) and CPU. Compressing the data costs extra in compute resources. But in return, it
reduces network I/O and storage. Depending on your data, you may see a boost in overall copy throughput.
Codec: Copy Activity supports gzip, bzip2, and Deflate compression types. Azure HDInsight can consume all
three types for processing. Each compression codec has advantages. For example, bzip2 has the lowest copy
throughput, but you get the best Hive query performance with bzip2 because you can split it for processing.
Gzip is the most balanced option, and it is used the most often. Choose the codec that best suits your end-to-
end scenario.
Level: You can choose from two options for each compression codec: fastest compressed and optimally
compressed. The fastest compressed option compresses the data as quickly as possible, even if the resulting
file is not optimally compressed. The optimally compressed option spends more time on compression and
yields a minimal amount of data. You can test both options to see which provides better overall performance in
your case.
A consideration: To copy a large amount of data between an on-premises store and the cloud, consider using
interim blob storage with compression. Using interim storage is helpful when the bandwidth of your corporate
network and your Azure services is the limiting factor, and you want the input data set and output data set both
to be in uncompressed form. More specifically, you can break a single copy activity into two copy activities. The
first copy activity copies from the source to an interim or staging blob in compressed form. The second copy
activity copies the compressed data from staging, and then decompresses while it writes to the sink.

Considerations for column mapping


You can set the columnMappings property in Copy Activity to map all or a subset of the input columns to the
output columns. After the data movement service reads the data from the source, it needs to perform column
mapping on the data before it writes the data to the sink. This extra processing reduces copy throughput.
If your source data store is queryable, for example, if it's a relational store like SQL Database or SQL Server, or
if it's a NoSQL store like Table storage or Azure Cosmos DB, consider pushing the column filtering and
reordering logic to the query property instead of using column mapping. This way, the projection occurs while
the data movement service reads data from the source data store, where it is much more efficient.

Other considerations
If the size of data you want to copy is large, you can adjust your business logic to further partition the data
using the slicing mechanism in Data Factory. Then, schedule Copy Activity to run more frequently to reduce the
data size for each Copy Activity run.
Be cautious about the number of data sets and copy activities requiring Data Factory to connector to the same
data store at the same time. Many concurrent copy jobs might throttle a data store and lead to degraded
performance, copy job internal retries, and in some cases, execution failures.

Sample scenario: Copy from an on-premises SQL Server to Blob


storage
Scenario: A pipeline is built to copy data from an on-premises SQL Server to Blob storage in CSV format. To
make the copy job faster, the CSV files should be compressed into bzip2 format.
Test and analysis: The throughput of Copy Activity is less than 2 MBps, which is much slower than the
performance benchmark.
Performance analysis and tuning: To troubleshoot the performance issue, lets look at how the data is
processed and moved.
1. Read data: Gateway opens a connection to SQL Server and sends the query. SQL Server responds by
sending the data stream to Gateway via the intranet.
2. Serialize and compress data: Gateway serializes the data stream to CSV format, and compresses the data
to a bzip2 stream.
3. Write data: Gateway uploads the bzip2 stream to Blob storage via the Internet.
As you can see, the data is being processed and moved in a streaming sequential manner: SQL Server > LAN >
Gateway > WAN > Blob storage. The overall performance is gated by the minimum throughput across
the pipeline.
One or more of the following factors might cause the performance bottleneck:
Source: SQL Server itself has low throughput because of heavy loads.
Data Management Gateway:
LAN: Gateway is located far from the SQL Server machine and has a low-bandwidth connection.
Gateway: Gateway has reached its load limitations to perform the following operations:
Serialization: Serializing the data stream to CSV format has slow throughput.
Compression: You chose a slow compression codec (for example, bzip2, which is 2.8 MBps
with Core i7).
WAN: The bandwidth between the corporate network and your Azure services is low (for example, T1
= 1,544 kbps; T2 = 6,312 kbps).
Sink: Blob storage has low throughput. (This scenario is unlikely because its SLA guarantees a minimum of
60 MBps.)
In this case, bzip2 data compression might be slowing down the entire pipeline. Switching to a gzip
compression codec might ease this bottleneck.

Sample scenarios: Use parallel copy


Scenario I: Copy 1,000 1-MB files from the on-premises file system to Blob storage.
Analysis and performance tuning: For an example, if you have installed gateway on a quad core machine,
Data Factory uses 16 parallel copies to move files from the file system to Blob storage concurrently. This
parallel execution should result in high throughput. You also can explicitly specify the parallel copies count.
When you copy many small files, parallel copies dramatically help throughput by using resources more
effectively.

Scenario II: Copy 20 blobs of 500 MB each from Blob storage to Data Lake Store Analytics, and then tune
performance.
Analysis and performance tuning: In this scenario, Data Factory copies the data from Blob storage to Data
Lake Store by using single-copy (parallelCopies set to 1) and single-cloud data movement units. The
throughput you observe will be close to that described in the performance reference section.

Scenario III: Individual file size is greater than dozens of MBs and total volume is large.
Analysis and performance turning: Increasing parallelCopies doesn't result in better copy performance
because of the resource limitations of a single-cloud DMU. Instead, you should specify more cloud DMUs to get
more resources to perform the data movement. Do not specify a value for the parallelCopies property. Data
Factory handles the parallelism for you. In this case, if you set cloudDataMovementUnits to 4, a throughput
of about four times occurs.

Reference
Here are performance monitoring and tuning references for some of the supported data stores:
Azure Storage (including Blob storage and Table storage): Azure Storage scalability targets and Azure
Storage performance and scalability checklist
Azure SQL Database: You can monitor the performance and check the database transaction unit (DTU)
percentage
Azure SQL Data Warehouse: Its capability is measured in data warehouse units (DWUs); see Manage
compute power in Azure SQL Data Warehouse (Overview)
Azure Cosmos DB: Performance levels in Azure Cosmos DB
On-premises SQL Server: Monitor and tune for performance
On-premises file server: Performance tuning for file servers
Add fault tolerance in Copy Activity by skipping
incompatible rows
8/21/2017 3 min to read Edit Online

Azure Data Factory Copy Activity offers you two ways to handle incompatible rows when copying data between
source and sink data stores:
You can abort and fail the copy activity when incompatible data is encountered (default behavior).
You can continue to copy all of the data by adding fault tolerance and skipping incompatible data rows. In
addition, you can log the incompatible rows in Azure Blob storage. You can then examine the log to learn the
cause for the failure, fix the data on the data source, and retry the copy activity.

Supported scenarios
Copy Activity supports three scenarios for detecting, skipping, and logging incompatible data:
Incompatibility between the source data type and the sink native type
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains three INT type columns. The CSV file rows that contain numeric data, such as 123,456,789 are
copied successfully to the sink store. However, the rows that contain non-numeric values, such as
123,456,abc are detected as incompatible and are skipped.

Mismatch in the number of columns between the source and the sink
For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that
contains six columns. The CSV file rows that contain six columns are copied successfully to the sink store.
The CSV file rows that contain more or fewer than six columns are detected as incompatible and are
skipped.
Primary key violation when writing to a relational database
For example: Copy data from a SQL server to a SQL database. A primary key is defined in the sink SQL
database, but no such primary key is defined in the source SQL server. The duplicated rows that exist in the
source cannot be copied to the sink. Copy Activity copies only the first row of the source data into the sink.
The subsequent source rows that contain the duplicated primary key value are detected as incompatible and
are skipped.

Configuration
The following example provides a JSON definition to configure skipping the incompatible rows in Copy Activity:
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
},
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": "BlobStorage",
"path": "redirectcontainer/erroroutput"
}
}

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

enableSkipIncompatibleR Enable skipping incompatible True No


ow rows during copy or not. False (default)

redirectIncompatibleRow A group of properties that No


Settings can be specified when you
want to log the incompatible
rows.

linkedServiceName The linked service of Azure The name of an No


Storage to store the log that AzureStorage or
contains the skipped rows. AzureStorageSas linked
service, which refers to the
storage instance that you
want to use to store the log
file.

path The path of the log file that Specify the Blob storage No
contains the skipped rows. path that you want to use to
log the incompatible data. If
you do not provide a path,
the service creates a
container for you.

Monitoring
After the copy activity run completes, you can see the number of skipped rows in the monitoring section:
If you configure to log the incompatible rows, you can find the log file at this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-generated-
GUID].csv
In the log file, you can see the rows that were skipped and the root cause of the incompatibility.
Both the original data and the corresponding error are logged in the file. An example of the log file content is as
follows:

data1, data2, data3, UserErrorInvalidDataValue,Column 'Prop_2' contains an invalid value 'data3'. Cannot
convert 'data3' to type 'DateTime'.,
data4, data5, data6, Violation of PRIMARY KEY constraint 'PK_tblintstrdatetimewithpk'. Cannot insert duplicate
key in object 'dbo.tblintstrdatetimewithpk'. The duplicate key value is (data4).

Next steps
To learn more about Azure Data Factory Copy Activity, see Move data by using Copy Activity.
Azure Data Factory - Security considerations for data
movement
8/21/2017 10 min to read Edit Online

Introduction
This article describes basic security infrastructure that data movement services in Azure Data Factory use to secure
your data. Azure Data Factory management resources are built on Azure security infrastructure and use all possible
security measures offered by Azure.
In a Data Factory solution, you create one or more data pipelines. A pipeline is a logical grouping of activities that
together perform a task. These pipelines reside in the region where the data factory was created.
Even though Data Factory is available in only West US, East US, and North Europe regions, the data movement
service is available globally in several regions. Data Factory service ensures that data does not leave a geographical
area/ region unless you explicitly instruct the service to use an alternate region if the data movement service is not
yet deployed to that region.
Azure Data Factory itself does not store any data except for linked service credentials for cloud data stores, which
are encrypted using certificates. It lets you create data-driven workflows to orchestrate movement of data between
supported data stores and processing of data using compute services in other regions or in an on-premises
environment. It also allows you to monitor and manage workflows using both programmatic and UI mechanisms.
Data movement using Azure Data Factory has been certified for:
HIPAA/HITECH
ISO/IEC 27001
ISO/IEC 27018
CSA STAR
If you are interested in Azure compliance and how Azure secures its own infrastructure, visit the Microsoft Trust
Center.
In this article, we review security considerations in the following two data movement scenarios:
Cloud scenario- In this scenario, both your source and destination are publicly accessible through internet.
These include managed cloud storage services like Azure Storage, Azure SQL Data Warehouse, Azure SQL
Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce, and web
protocols such as FTP and OData. You can find a complete list of supported data sources here.
Hybrid scenario- In this scenario, either your source or destination is behind a firewall or inside an on-
premises corporate network or the data store is in a private network/ virtual network (most often the source)
and is not publicly accessible. Database servers hosted on virtual machines also fall under this scenario.

Cloud scenarios
Securing data store credentials
Azure Data Factory protects your data store credentials by encrypting them by using certificates managed by
Microsoft. These certificates are rotated every two years (which includes renewal of certificate and migration of
credentials). These encrypted credentials are securely stored in an Azure Storage managed by Azure Data
Factory management services. For more information about Azure Storage security, refer Azure Storage Security
Overview.
Data encryption in transit
If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data Factory
and a cloud data store are via secure channel HTTPS or TLS.

NOTE
All connections to Azure SQL Database and Azure SQL Data Warehouse always require encryption (SSL/TLS) while data is
in transit to and from the database. While authoring a pipeline using a JSON editor, add the encryption property and set it
to true in the connection string. When you use the Copy Wizard, the wizard sets this property by default. For Azure
Storage, you can use HTTPS in the connection string.

Data encryption at rest


Some data stores support encryption of data at rest. We suggest that you enable data encryption mechanism for
those data stores.
Azure SQL Data Warehouse
Transparent Data Encryption (TDE) in Azure SQL Data Warehouse helps with protecting against the threat of
malicious activity by performing real-time encryption and decryption of your data at rest. This behavior is
transparent to the client. For more information, see Secure a database in SQL Data Warehouse.
Azure SQL Database
Azure SQL Database also supports transparent data encryption (TDE), which helps with protecting against the
threat of malicious activity by performing real-time encryption and decryption of the data without requiring
changes to the application. This behavior is transparent to the client. For more information, see Transparent Data
Encryption with Azure SQL Database.
Azure Data Lake Store
Azure Data Lake store also provides encryption for data stored in the account. When enabled, Data Lake store
automatically encrypts data before persisting and decrypts before retrieval, making it transparent to the client
accessing the data. For more information, see Security in Azure Data Lake Store.
Azure Blob Storage and Azure Table Storage
Azure Blob Storage and Azure Table storage supports Storage Service Encryption (SSE), which automatically
encrypts your data before persisting to storage and decrypts before retrieval. For more information, see Azure
Storage Service Encryption for Data at Rest.
Amazon S3
Amazon S3 supports both client and server encryption of data at Rest. For more information, see Protecting Data
Using Encryption. Currently, Data Factory does not support Amazon S3 inside a virtual private cloud (VPC).
Amazon Redshift
Amazon Redshift supports cluster encryption for data at rest. For more information, see Amazon Redshift Database
Encryption. Currently, Data Factory does not support Amazon Redshift inside a VPC.
Salesforce
Salesforce supports Shield Platform Encryption that allows encryption of all files, attachments, custom fields. For
more information, see Understanding the Web Server OAuth Authentication Flow.

Hybrid Scenarios (using Data Management Gateway)


Hybrid scenarios require Data Management Gateway to be installed in an on-premises network or inside a virtual
network (Azure) or a virtual private cloud (Amazon). The gateway must be able to access the local data stores. For
more information about the gateway, see Data Management Gateway.
The command channel allows communication between data movement services in Data Factory and Data
Management Gateway. The communication contains information related to the activity. The data channel is used
for transferring data between on-premises data stores and cloud data stores.
On-premises data store credentials
The credentials for your on-premises data stores are stored locally (not in the cloud). They can be set in three
different ways.
Using plain-text (less secure) via HTTPS from Azure Portal/ Copy Wizard. The credentials are passed in plain-
text to the on-premises gateway.
Using JavaScript Cryptography library from Copy Wizard.
Using click-once based credentials manager app. The click-once application executes on the on-premises
machine that has access to the gateway and sets credentials for the data store. This option and the next one are
the most secure options. The credential manager app, by default, uses the port 8050 on the machine with
gateway for secure communication.
Use New-AzureRmDataFactoryEncryptValue PowerShell cmdlet to encrypt credentials. The cmdlet uses the
certificate that gateway is configured to use to encrypt the credentials. You can use the encrypted credentials
returned by this cmdlet and add it to EncryptedCredential element of the connectionString in the JSON file
that you use with the New-AzureRmDataFactoryLinkedService cmdlet or in the JSON snippet in the Data
Factory Editor in the portal. This option and the click-once application are the most secure options.
JavaScript cryptography library-based encryption
You can encrypt data store credentials using JavaScript Cryptography library from the Copy Wizard. When you
select this option, the Copy Wizard retrieves the public key of gateway and uses it to encrypt the data store
credentials. The credentials are decrypted by the gateway machine and protected by Windows DPAPI.
Supported browsers: IE8, IE9, IE10, IE11, Microsoft Edge, and latest Firefox, Chrome, Opera, Safari browsers.
Click-once credentials manager app
You can launch the click-once based credential manager app from Azure portal/Copy Wizard when authoring
pipelines. This application ensures that credentials are not transferred in plain text over the wire. By default, it uses
the port 8050 on the machine with gateway for secure communication. If necessary, this port can be changed.
Currently, Data Management Gateway uses a single certificate. This certificate is created during the gateway
installation (applies to Data Management Gateway created after November 2016 and version 2.4.xxxx.x or later).
You can replace this certificate with your own SSL/TLS certificate. This certificate is used by the click-once credential
manager application to securely connect to the gateway machine for setting data store credentials. It stores data
store credentials securely on-premises by using the Windows DPAPI on the machine with gateway.

NOTE
Older gateways that were installed before November 2016 or of version 2.3.xxxx.x continue to use credentials encrypted and
stored on cloud. Even if you upgrade the gateway to the latest version, the credentials are not migrated to an on-premises
machine

GATEWAY VERSION (DURING CREATION) CREDENTIALS STORED CREDENTIAL ENCRYPTION/ SECURITY

< = 2.3.xxxx.x On cloud Encrypted using certificate (different


from the one used by Credential
manager app)

> = 2.4.xxxx.x On premises Secured via DPAPI

Encryption in transit
All data transfers are via secure channel HTTPS and TLS over TCP to prevent man-in-the-middle attacks during
communication with Azure services.
You can also use IPSec VPN or Express Route to further secure the communication channel between your on-
premises network and Azure.
Virtual network is a logical representation of your network in the cloud. You can connect an on-premises network
to your Azure virtual network (VNet) by setting up IPSec VPN (site-to-site) or Express Route (Private Peering)
The following table summarizes the network and gateway configuration recommendations based on different
combinations of source and destination locations for hybrid data movement.

SOURCE DESTINATION NETWORK CONFIGURATION GATEWAY SETUP

On-premises Virtual machines and cloud IPSec VPN (point-to-site or Gateway can be installed
services deployed in virtual site-to-site) either on-premises or on an
networks Azure virtual machine (VM)
in VNet

On-premises Virtual machines and cloud ExpressRoute (Private Gateway can be installed
services deployed in virtual Peering) either on-premises or on an
networks Azure VM in VNet

On-premises Azure-based services that ExpressRoute (Public Gateway must be installed


have a public endpoint Peering) on-premises

The following images show the usage of Data Management Gateway for moving data between an on-premises
database and Azure services using Express route and IPSec VPN (with Virtual Network):
Express Route:

IPSec VPN:
Firewall configurations and whitelisting IP address of gateway
Firewall requirements for on-premises/private network
In an enterprise, a corporate firewall runs on the central router of the organization. And, Windows firewall runs
as a daemon on the local machine on which the gateway is installed.
The following table provides outbound port and domain requirements for the corporate firewall.

DOMAIN NAMES OUTBOUND PORTS DESCRIPTION

*.servicebus.windows.net 443, 80 Required by the gateway to connect to


data movement services in Data Factory

*.core.windows.net 443 Used by the gateway to connect to


Azure Storage Account when you use
the staged copy feature.

*.frontend.clouddatahub.net 443 Required by the gateway to connect to


the Azure Data Factory service.

*.database.windows.net 1433 (OPTIONAL) needed when your


destination is Azure SQL Database/
Azure SQL Data Warehouse. Use the
staged copy feature to copy data to
Azure SQL Database/Azure SQL Data
Warehouse without opening the port
1433.

*.azuredatalakestore.net 443 (OPTIONAL) needed when your


destination is Azure Data Lake store

NOTE
You may have to manage ports/ whitelisting domains at the corporate firewall level as required by respective data sources.
This table only uses Azure SQL Database, Azure SQL Data Warehouse, Azure Data Lake Store as examples.

The following table provides inbound port requirements for the windows firewall.
INBOUND PORTS DESCRIPTION

8050 (TCP) Required by the credential manager application to securely set


credentials for on-premises data stores on the gateway.

IP configurations/ whitelisting in data store


Some data stores in the cloud also require whitelisting of IP address of the machine accessing them. Ensure that
the IP address of the gateway machine is whitelisted/ configured in firewall appropriately.
The following cloud data stores require whitelisting of IP address of the gateway machine. Some of these data
stores, by default, may not require whitelisting of the IP address.
Azure SQL Database
Azure SQL Data Warehouse
Azure Data Lake Store
Azure Cosmos DB
Amazon Redshift

Frequently asked questions


Question: Can the Gateway be shared across different data factories? Answer: We do not support this feature yet.
We are actively working on it.
Question: What are the port requirements for the gateway to work? Answer: Gateway makes HTTP-based
connections to open internet. The outbound ports 443 and 80 must be opened for gateway to make this
connection. Open Inbound Port 8050 only at the machine level (not at corporate firewall level) for Credential
Manager application. If Azure SQL Database or Azure SQL Data Warehouse is used as source/ destination, then you
need to open 1433 port as well. For more information, see Firewall configurations and whitelisting IP addresses
section.
Question: What are certificate requirements for Gateway? Answer: Current gateway requires a certificate that is
used by the credential manager application for securely setting data store credentials. This certificate is a self-
signed certificate created and configured by the gateway setup. You can use your own TLS/ SSL certificate instead.
For more information, see click-once credential manager application section.

Next steps
For information about performance of copy activity, see Copy activity performance and tuning guide.
Move data From Amazon Redshift using Azure
Data Factory
7/27/2017 7 min to read Edit Online

This article explains how to use the Copy Activity in Azure Data Factory to move data from Amazon Redshift.
The article builds on the Data Movement Activities article, which presents a general overview of data movement
with the copy activity.
You can copy data from Amazon Redshift to any supported sink data store. For a list of data stores supported as
sinks by the copy activity, see supported data stores. Data factory currently supports moving data from Amazon
Redshift to other data stores, but not for moving data from other data stores to Amazon Redshift.

Prerequisites
If you are moving data to an on-premises data store, install Data Management Gateway on an on-premises
machine. Then, Grant Data Management Gateway (use IP address of the machine) the access to Amazon
Redshift cluster. See Authorize access to the cluster for instructions.
If you are moving data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address
and SQL ranges used by the Azure data centers.

Getting started
You can create a pipeline with a copy activity that moves data from an Amazon Redshift source by using
different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an Amazon Redshift data store, see JSON example: Copy data from Amazon Redshift to
Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Amazon Redshift:

Linked service properties


The following table provides description for JSON elements specific to Amazon Redshift linked service.
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


AmazonRedshift.

server IP address or host name of the Yes


Amazon Redshift server.

port The number of the TCP port that the No, default value: 5439
Amazon Redshift server uses to listen
for client connections.

database Name of the Amazon Redshift Yes


database.

username Name of user who has access to the Yes


database.

password Password for the user account. Yes

Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections
such as structure, availability, and policy are similar for all dataset types (Azure SQL, Azure blob, Azure table,
etc.).
The typeProperties section is different for each type of dataset. It provides information about the location of
the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes
Amazon Redshift dataset) has the following properties

PROPERTY DESCRIPTION REQUIRED

tableName Name of the table in the Amazon No (if query of RelationalSource is


Redshift database that linked service specified)
refers to.

Copy activity properties


For a full list of sections & properties available for defining activities, see the Creating Pipelines article.
Properties such as name, description, input and output tables, and policies are available for all types of activities.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For
Copy activity, they vary depending on the types of sources and sinks.
When source of copy activity is of type RelationalSource (which includes Amazon Redshift), the following
properties are available in typeProperties section:

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

query Use the custom query to SQL query string. For No (if tableName of
read data. example: select * from dataset is specified)
MyTable.

JSON example: Copy data from Amazon Redshift to Azure Blob


This sample shows how to copy data from an Amazon Redshift database to an Azure Blob Storage. However,
data can be copied directly to any of the sinks stated here using the Copy Activity in Azure Data Factory.
The sample has the following data factory entities:
A linked service of type AmazonRedshift.
A linked service of type AzureStorage.
An input dataset of type RelationalTable.
An output dataset of type AzureBlob.
A pipeline with Copy Activity that uses RelationalSource and BlobSink.
The sample copies data from a query result in Amazon Redshift to a blob every hour. The JSON properties used
in these samples are described in sections following the samples.
Amazon Redshift linked service:

{
"name": "AmazonRedshiftLinkedService",
"properties":
{
"type": "AmazonRedshift",
"typeProperties":
{
"server": "< The IP address or host name of the Amazon Redshift server >",
"port": <The number of the TCP port that the Amazon Redshift server uses to listen for client
connections.>,
"database": "<The database name of the Amazon Redshift database>",
"username": "<username>",
"password": "<password>"
}
}
}

Azure Storage linked service:

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Amazon Redshift input dataset:


Setting "external": true informs the Data Factory service that the dataset is external to the data factory and is
not produced by an activity in the data factory. Set this property to true on an input dataset that is not produced
by an activity in the pipeline.
{
"name": "AmazonRedshiftInputDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": "AmazonRedshiftLinkedService",
"typeProperties": {
"tableName": "<Table name>"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}

Azure Blob output dataset:


Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is
dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year,
month, day, and hours parts of the start time.
{
"name": "AzureBlobOutputDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/fromamazonredshift/yearno={Year}/monthno={Month}/dayno={Day}/hourno=
{Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

Copy activity in a pipeline with Azure Redshift source (RelationalSource) and Blob sink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to
run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set
to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy.
{
"name": "CopyAmazonRedshiftToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-
ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "AmazonRedshiftInputDataset"
}
],
"outputs": [
{
"name": "AzureBlobOutputDataSet"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "AmazonRedshiftToBlob"
}
],
"start": "2014-06-01T18:00:00Z",
"end": "2014-06-01T19:00:00Z"
}
}

Type mapping for Amazon Redshift


As mentioned in the data movement activities article, Copy activity performs automatic type conversions from
source types to sink types with the following two-step approach:
1. Convert from native source types to .NET type
2. Convert from .NET type to native sink type
When moving data to Amazon Redshift, the following mappings are used from Amazon Redshift types to .NET
types.

AMAZON REDSHIFT TYPE .NET BASED TYPE

SMALLINT Int16

INTEGER Int32

BIGINT Int64
AMAZON REDSHIFT TYPE .NET BASED TYPE

DECIMAL Decimal

REAL Single

DOUBLE PRECISION Double

BOOLEAN String

CHAR String

VARCHAR String

DATE DateTime

TIMESTAMP DateTime

TEXT String

Map source to sink columns


To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in
Azure Data Factory.

Repeatable read from relational sources


When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In
Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a
slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same
data is read no matter how many times a slice is run. See Repeatable read from relational sources

Performance and Tuning


See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data
movement (Copy Activity) in Azure Data Factory and various ways to optimize it.

Next Steps
See the following articles:
Copy Activity tutorial for step-by-step instructions for creating a pipeline with a Copy Activity.
Move data from Amazon Simple Storage Service
by using Azure Data Factory
6/27/2017 8 min to read Edit Online

This article explains how to use the copy activity in Azure Data Factory to move data from Amazon Simple
Storage Service (S3). It builds on the Data movement activities article, which presents a general overview of
data movement with the copy activity.
You can copy data from Amazon S3 to any supported sink data store. For a list of data stores supported as sinks
by the copy activity, see the Supported data stores table. Data Factory currently supports only moving data
from Amazon S3 to other data stores, but not moving data from other data stores to Amazon S3.

Required permissions
To copy data from Amazon S3, make sure you have been granted the following permissions:
s3:GetObject and s3:GetObjectVersion for Amazon S3 Object Operations.
s3:ListBucket for Amazon S3 Bucket Operations. If you are using the Data Factory Copy Wizard,
s3:ListAllMyBuckets is also required.

For details about the full list of Amazon S3 permissions, see Specifying Permissions in a Policy.

Getting started
You can create a pipeline with a copy activity that moves data from an Amazon S3 source by using different
tools or APIs.
The easiest way to create a pipeline is to use the Copy Wizard. For a quick walkthrough, see Tutorial: Create a
pipeline using Copy Wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. For step-by-step instructions to create a
pipeline with a copy activity, see the Copy activity tutorial.
Whether you use tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools or APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are
used to copy data from an Amazon S3 data store, see the JSON example: Copy data from Amazon S3 to Azure
Blob section of this article.
NOTE
For details about supported file and compression formats for a copy activity, see File and compression formats in Azure
Data Factory.

The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Amazon S3.

Linked service properties


A linked service links a data store to a data factory. You create a linked service of type AwsAccessKey to link
your Amazon S3 data store to your data factory. The following table provides description for JSON elements
specific to Amazon S3 (AwsAccessKey) linked service.

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

accessKeyID ID of the secret access key. string Yes

secretAccessKey The secret access key itself. Encrypted secret string Yes

Here is an example:

{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AwsAccessKey",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": "<secret access key>"
}
}
}

Dataset properties
To specify a dataset to represent input data in Azure Blob storage, set the type property of the dataset to
AmazonS3. Set the linkedServiceName property of the dataset to the name of the Amazon S3 linked service.
For a full list of sections and properties available for defining datasets, see Creating datasets.
Sections such as structure, availability, and policy are similar for all dataset types (such as SQL database, Azure
blob, and Azure table). The typeProperties section is different for each type of dataset, and provides
information about the location of the data in the data store. The typeProperties section for a dataset of type
AmazonS3 (which includes the Amazon S3 dataset) has the following properties:

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

bucketName The S3 bucket name. String Yes

key The S3 object key. String No


PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

prefix Prefix for the S3 object key. String No


Objects whose keys start
with this prefix are selected.
Applies only when key is
empty.

version The version of the S3 String No


object, if S3 versioning is
enabled.

format The following format types No


are supported:
TextFormat, JsonFormat,
AvroFormat, OrcFormat,
ParquetFormat. Set the
type property under
format to one of these
values. For more
information, see the Text
format, JSON format, Avro
format, Orc format, and
Parquet format sections.

If you want to copy files as-


is between file-based stores
(binary copy), skip the
format section in both
input and output dataset
definitions.

compression Specify the type and level of No


compression for the data.
The supported types are:
GZip, Deflate, BZip2, and
ZipDeflate. The supported
levels are: Optimal and
Fastest. For more
information, see File and
compression formats in
Azure Data Factory.

NOTE
bucketName + key specifies the location of the S3 object, where bucket is the root container for S3 objects, and key is
the full path to the S3 object.

Sample dataset with prefix


{
"name": "dataset-s3",
"properties": {
"type": "AmazonS3",
"linkedServiceName": "link- testS3",
"typeProperties": {
"prefix": "testFolder/test",
"bucketName": "testbucket",
"format": {
"type": "OrcFormat"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}

Sample dataset (with version)

{
"name": "dataset-s3",
"properties": {
"type": "AmazonS3",
"linkedServiceName": "link- testS3",
"typeProperties": {
"key": "testFolder/test.orc",
"bucketName": "testbucket",
"version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL",
"format": {
"type": "OrcFormat"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}

Dynamic paths for S3


The preceding sample uses fixed values for the key and bucketName properties in the Amazon S3 dataset.

"key": "testFolder/test.orc",
"bucketName": "testbucket",

You can have Data Factory calculate these properties dynamically at runtime, by using system variables such as
SliceStart.

"key": "$$Text.Format('{0:MM}/{0:dd}/test.orc', SliceStart)"


"bucketName": "$$Text.Format('{0:yyyy}', SliceStart)"

You can do the same for the prefix property of an Amazon S3 dataset. For a list of supported functions and
variables, see Data Factory functions and system variables.

Copy activity properties


For a full list of sections and properties available for defining activities, see Creating pipelines. Properties such
as name, description, input and output tables, and policies are available for all types of activities. Properties
available in the typeProperties section of the activity vary with each activity type. For the copy activity,
properties vary depending on the types of sources and sinks. When a source in the copy activity is of type
FileSystemSource (which includes Amazon S3), the following property is available in typeProperties section:

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

recursive Specifies whether to true/false No


recursively list S3 objects
under the directory.

JSON example: Copy data from Amazon S3 to Azure Blob storage


This sample shows how to copy data from Amazon S3 to an Azure Blob storage. However, data can be copied
directly to any of the sinks that are supported by using the copy activity in Data Factory.
The sample provides JSON definitions for the following Data Factory entities. You can use these definitions to
create a pipeline to copy data from Amazon S3 to Blob storage, by using the Azure portal, Visual Studio, or
PowerShell.
A linked service of type AwsAccessKey.
A linked service of type AzureStorage.
An input dataset of type AmazonS3.
An output dataset of type AzureBlob.
A pipeline with copy activity that uses FileSystemSource and BlobSink.
The sample copies data from Amazon S3 to an Azure blob every hour. The JSON properties used in these
samples are described in sections following the samples.
Amazon S3 linked service

{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AwsAccessKey",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": "<secret access key>"
}
}
}

Azure Storage linked service

{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Amazon S3 input dataset


Setting "external": true informs the Data Factory service that the dataset is external to the data factory. Set this
property to true on an input dataset that is not produced by an activity in the pipeline.

{
"name": "AmazonS3InputDataset",
"properties": {
"type": "AmazonS3",
"linkedServiceName": "AmazonS3LinkedService",
"typeProperties": {
"key": "testFolder/test.orc",
"bucketName": "testbucket",
"format": {
"type": "OrcFormat"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true
}
}

Azure Blob output dataset


Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is
dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year,
month, day, and hours parts of the start time.
{
"name": "AzureBlobOutputDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/fromamazons3/yearno={Year}/monthno={Month}/dayno={Day}/hourno=
{Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

Copy activity in a pipeline with an Amazon S3 source and a blob sink


The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to
run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource, and sink type is
set to BlobSink.
{
"name": "CopyAmazonS3ToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "AmazonS3InputDataset"
}
],
"outputs": [
{
"name": "AzureBlobOutputDataSet"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "AmazonS3ToBlob"
}
],
"start": "2014-08-08T18:00:00Z",
"end": "2014-08-08T19:00:00Z"
}
}

NOTE
To map columns from a source dataset to columns from a sink dataset, see Mapping dataset columns in Azure Data
Factory.

Next steps
See the following articles:
To learn about key factors that impact performance of data movement (copy activity) in Data Factory,
and various ways to optimize it, see the Copy activity performance and tuning guide.
For step-by-step instructions for creating a pipeline with a copy activity, see the Copy activity tutorial.
Copy data to or from Azure Blob Storage using
Azure Data Factory
8/21/2017 31 min to read Edit Online

This article explains how to use the Copy Activity in Azure Data Factory to copy data to and from Azure Blob
Storage. It builds on the Data Movement Activities article, which presents a general overview of data
movement with the copy activity.

Overview
You can copy data from any supported source data store to Azure Blob Storage or from Azure Blob Storage
to any supported sink data store. The following table provides a list of data stores supported as sources or
sinks by the copy activity. For example, you can move data from a SQL Server database or an Azure SQL
database to an Azure blob storage. And, you can copy data from Azure blob storage to an Azure SQL Data
Warehouse or an Azure Cosmos DB collection.

Supported scenarios
You can copy data from Azure Blob Storage to the following data stores:

CATEGORY DATA STORE

Azure Azure Blob storage


Azure Data Lake Store
Azure Cosmos DB (DocumentDB API)
Azure SQL Database
Azure SQL Data Warehouse
Azure Search Index
Azure Table storage

Databases SQL Server


Oracle

File File system

You can copy data from the following data stores to Azure Blob Storage:

CATEGORY DATA STORE

Azure Azure Blob storage


Azure Cosmos DB (DocumentDB API)
Azure Data Lake Store
Azure SQL Database
Azure SQL Data Warehouse
Azure Table storage
CATEGORY DATA STORE

Databases Amazon Redshift


DB2
MySQL
Oracle
PostgreSQL
SAP Business Warehouse
SAP HANA
SQL Server
Sybase
Teradata

NoSQL Cassandra
MongoDB

File Amazon S3
File System
FTP
HDFS
SFTP

Others Generic HTTP


Generic OData
Generic ODBC
Salesforce
Web Table (table from HTML)
GE Historian

IMPORTANT
Copy Activity supports copying data from/to both general-purpose Azure Storage accounts and Hot/Cool Blob
storage. The activity supports reading from block, append, or page blobs, but supports writing to only block
blobs. Azure Premium Storage is not supported as a sink because it is backed by page blobs.
Copy Activity does not delete data from the source after the data is successfully copied to the destination. If you need
to delete source data after a successful copy, create a custom activity to delete the data and use the activity in the
pipeline. For an example, see the Delete blob or folder sample on GitHub.

Get started
You can create a pipeline with a copy activity that moves data to/from an Azure Blob Storage by using
different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. This article has a walkthrough for creating a
pipeline to copy data from an Azure Blob Storage location to another Azure Blob Storage location. For a
tutorial on creating a pipeline to copy data from an Azure Blob Storage to Azure SQL Database, see Tutorial:
Create a pipeline using Copy Wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from
a source data store to a sink data store:
1. Create a data factory. A data factory may contain one or more pipelines.
2. Create linked services to link input and output data stores to your data factory. For example, if you are
copying data from an Azure blob storage to an Azure SQL database, you create two linked services to link
your Azure storage account and Azure SQL database to your data factory. For linked service properties
that are specific to Azure Blob Storage, see linked service properties section.
3. Create datasets to represent input and output data for the copy operation. In the example mentioned in
the last step, you create a dataset to specify the blob container and folder that contains the input data.
And, you create another dataset to specify the SQL table in the Azure SQL database that holds the data
copied from the blob storage. For dataset properties that are specific to Azure Blob Storage, see dataset
properties section.
4. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the
example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity.
Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and
BlobSink in the copy activity. For copy activity properties that are specific to Azure Blob Storage, see copy
activity properties section. For details on how to use a data store as a source or a sink, click the link in the
previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that
are used to copy data to/from an Azure Blob Storage, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Azure Blob Storage.

Linked service properties


There are two types of linked services you can use to link an Azure Storage to an Azure data factory. They
are: AzureStorage linked service and AzureStorageSas linked service. The Azure Storage linked service
provides the data factory with global access to the Azure Storage. Whereas, The Azure Storage SAS (Shared
Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure
Storage. There are no other differences between these two linked services. Choose the linked service that
suits your needs. The following sections provide more details on these two linked services.
Azure Storage Linked Service
The Azure Storage linked service allows you to link an Azure storage account to an Azure data factory by
using the account key, which provides the data factory with global access to the Azure Storage. The
following table provides description for JSON elements specific to Azure Storage linked service.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


AzureStorage

connectionString Specify information needed to Yes


connect to Azure storage for the
connectionString property.

See the following article for steps to view/copy the account key for an Azure Storage: View, copy, and
regenerate storage access keys.
Example:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Azure Storage Sas Linked Service


A shared access signature (SAS) provides delegated access to resources in your storage account. It allows
you to grant a client limited permissions to objects in your storage account for a specified period of time and
with a specified set of permissions, without having to share your account access keys. The SAS is a URI that
encompasses in its query parameters all the information necessary for authenticated access to a storage
resource. To access storage resources with the SAS, the client only needs to pass in the SAS to the
appropriate constructor or method. For detailed information about SAS, see Shared Access Signatures:
Understanding the SAS Model

IMPORTANT
Azure Data Factory now only supports Service SAS but not Account SAS. See Types of Shared Access Signatures for
details about these two types and how to construct. Note the SAS URL generable from Azure portal or Storage
Explorer is an Account SAS, which is not supported.

The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory
by using a Shared Access Signature (SAS). It provides the data factory with restricted/time-bound access to
all/specific resources (blob/container) in the storage. The following table provides description for JSON
elements specific to Azure Storage SAS linked service.

PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


AzureStorageSas

sasUri Specify Shared Access Signature URI Yes


to the Azure Storage resources such
as blob, container, or table.

Example:

{
"name": "StorageSasLinkedService",
"properties": {
"type": "AzureStorageSas",
"typeProperties": {
"sasUri": "<Specify SAS URI of the Azure Storage resource>"
}
}
}

When creating an SAS URI, considering the following:


Set appropriate read/write permissions on objects based on how the linked service (read, write,
read/write) is used in your data factory.
Set Expiry time appropriately. Make sure that the access to Azure Storage objects does not expire within
the active period of the pipeline.
Uri should be created at the right container/blob or Table level based on the need. A SAS Uri to an Azure
blob allows the Data Factory service to access that particular blob. A SAS Uri to an Azure blob container
allows the Data Factory service to iterate through blobs in that container. If you need to provide access
more/fewer objects later, or update the SAS URI, remember to update the linked service with the new URI.

Dataset properties
To specify a dataset to represent input or output data in an Azure Blob Storage, you set the type property of
the dataset to: AzureBlob. Set the linkedServiceName property of the dataset to the name of the Azure
Storage or Azure Storage SAS linked service. The type properties of the dataset specify the blob container
and the folder in the blob storage.
For a full list of JSON sections & properties available for defining datasets, see the Creating datasets article.
Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure
SQL, Azure blob, Azure table, etc.).
Data factory supports the following CLS-compliant .NET based type values for providing type information in
structure for schema-on-read data sources like Azure blob: Int16, Int32, Int64, Single, Double, Decimal,
Byte[], Bool, String, Guid, Datetime, Datetimeoffset, Timespan. Data Factory automatically performs type
conversions when moving data from a source data store to a sink data store.
The typeProperties section is different for each type of dataset and provides information about the location,
format etc., of the data in the data store. The typeProperties section for dataset of type AzureBlob dataset
has the following properties:

PROPERTY DESCRIPTION REQUIRED

folderPath Path to the container and folder in Yes


the blob storage. Example:
myblobcontainer\myblobfolder\

fileName Name of the blob. fileName is No


optional and case-sensitive.

If you specify a filename, the activity


(including Copy) works on the specific
Blob.

When fileName is not specified, Copy


includes all Blobs in the folderPath for
input dataset.

When fileName is not specified for


an output dataset and
preserveHierarchy is not specified
in activity sink, the name of the
generated file would be in the
following this format: Data..txt (for
example: : Data.0a405f8a-93ff-4c6f-
b3be-f69616f1df7a.txt
PROPERTY DESCRIPTION REQUIRED

partitionedBy partitionedBy is an optional property. No


You can use it to specify a dynamic
folderPath and filename for time
series data. For example, folderPath
can be parameterized for every hour
of data. See the Using partitionedBy
property section for details and
examples.

format The following format types are No


supported: TextFormat,
JsonFormat, AvroFormat,
OrcFormat, ParquetFormat. Set the
type property under format to one
of these values. For more
information, see Text Format, Json
Format, Avro Format, Orc Format,
and Parquet Format sections.

If you want to copy files as-is


between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.

compression Specify the type and level of No


compression for the data. Supported
types are: GZip, Deflate, BZip2, and
ZipDeflate. Supported levels are:
Optimal and Fastest. For more
information, see File and compression
formats in Azure Data Factory.

Using partitionedBy property


As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data
with the partitionedBy property, Data Factory functions, and the system variables.
For more information on time series datasets, scheduling, and slices, see Creating Datasets and Scheduling &
Execution articles.
Sample 1

"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],

In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart in the format
(YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each
slice. For example: wikidatagateway/wikisampledataout/2014100103 or
wikidatagateway/wikisampledataout/2014100104
Sample 2
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],

In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by
folderPath and fileName properties.

Copy activity properties


For a full list of sections & properties available for defining activities, see the Creating Pipelines article.
Properties such as name, description, input and output datasets, and policies are available for all types of
activities. Whereas, properties available in the typeProperties section of the activity vary with each activity
type. For Copy activity, they vary depending on the types of sources and sinks. If you are moving data from
an Azure Blob Storage, you set the source type in the copy activity to BlobSource. Similarly, if you are
moving data to an Azure Blob Storage, you set the sink type in the copy activity to BlobSink. This section
provides a list of properties supported by BlobSource and BlobSink.
BlobSource supports the following properties in the typeProperties section:

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

recursive Indicates whether the data True (default value), False No


is read recursively from the
sub folders or only from
the specified folder.

BlobSink supports the following properties typeProperties section:

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

copyBehavior Defines the copy behavior PreserveHierarchy: No


when the source is preserves the file hierarchy
BlobSource or FileSystem. in the target folder. The
relative path of source file
to source folder is identical
to the relative path of
target file to target folder.

FlattenHierarchy: all files


from the source folder are
in the first level of target
folder. The target files have
auto generated name.

MergeFiles: merges all


files from the source folder
to one file. If the File/Blob
Name is specified, the
merged file name would be
the specified name;
otherwise, would be auto-
generated file name.
BlobSource also supports these two properties for backward compatibility.
treatEmptyAsNull: Specifies whether to treat null or empty string as null value.
skipHeaderLineCount - Specifies how many lines need be skipped. It is applicable only when input
dataset is using TextFormat.
Similarly, BlobSink supports the following property for backward compatibility.
blobWriterAddHeader: Specifies whether to add a header of column definitions while writing to an
output dataset.
Datasets now support the following properties that implement the same functionality: treatEmptyAsNull,
skipLineCount, firstRowAsHeader.
The following table provides guidance on using the new dataset properties in place of these blob source/sink
properties.

COPY ACTIVITY PROPERTY DATASET PROPERTY

skipHeaderLineCount on BlobSource skipLineCount and firstRowAsHeader. Lines are skipped


first and then the first row is read as a header.

treatEmptyAsNull on BlobSource treatEmptyAsNull on input dataset

blobWriterAddHeader on BlobSink firstRowAsHeader on output dataset

See Specifying TextFormat section for detailed information on these properties.


recursive and copyBehavior examples
This section describes the resulting behavior of the Copy operation for different combinations of recursive
and copyBehavior values.

RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR

true preserveHierarchy For a source folder Folder1 with the


following structure:

Folder1
File1
File2
Subfolder1
File3
File4
File5

the target folder Folder1 is created


with the same structure as the source

Folder1
File1
File2
Subfolder1
File3
File4
File5.
RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR

true flattenHierarchy For a source folder Folder1 with the


following structure:

Folder1
File1
File2
Subfolder1
File3
File4
File5

the target Folder1 is created with the


following structure:

Folder1
auto-generated name for File1
auto-generated name for File2
auto-generated name for File3
auto-generated name for File4
auto-generated name for File5

true mergeFiles For a source folder Folder1 with the


following structure:

Folder1
File1
File2
Subfolder1
File3
File4
File5

the target Folder1 is created with the


following structure:

Folder1
File1 + File2 + File3 + File4 + File
5 contents are merged into one file
with auto-generated file name
RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR

false preserveHierarchy For a source folder Folder1 with the


following structure:

Folder1
File1
File2
Subfolder1
File3
File4
File5

the target folder Folder1 is created


with the following structure

Folder1
File1
File2

Subfolder1 with File3, File4, and File5


are not picked up.

false flattenHierarchy For a source folder Folder1 with the


following structure:

Folder1
File1
File2
Subfolder1
File3
File4
File5

the target folder Folder1 is created


with the following structure

Folder1
auto-generated name for File1
auto-generated name for File2

Subfolder1 with File3, File4, and File5


are not picked up.
RECURSIVE COPYBEHAVIOR RESULTING BEHAVIOR

false mergeFiles For a source folder Folder1 with the


following structure:

Folder1
File1
File2
Subfolder1
File3
File4
File5

the target folder Folder1 is created


with the following structure

Folder1
File1 + File2 contents are merged
into one file with auto-generated file
name. auto-generated name for File1

Subfolder1 with File3, File4, and File5


are not picked up.

Walkthrough: Use Copy Wizard to copy data to/from Blob Storage


Let's look at how to quickly copy data to/from an Azure blob storage. In this walkthrough, both source and
destination data stores of type: Azure Blob Storage. The pipeline in this walkthrough copies data from a
folder to another folder in the same blob container. This walkthrough is intentionally simple to show you
settings or properties when using Blob Storage as a source or sink.
Prerequisites
1. Create a general-purpose Azure Storage Account if you don't have one already. You use the blob
storage as both source and destination data store in this walkthrough. if you don't have an Azure
storage account, see the Create a storage account article for steps to create one.
2. Create a blob container named adfblobconnector in the storage account.
3. Create a folder named input in the adfblobconnector container.
4. Create a file named emp.txt with the following content and upload it to the input folder by using tools
such as Azure Storage Explorer json John, Doe Jane, Doe ### Create the data factory
5. Sign in to the Azure portal.
6. Click + NEW from the top-left corner, click Intelligence + analytics, and click Data Factory.
7. In the New data factory blade:
a. Enter ADFBlobConnectorDF for the name. The name of the Azure data factory must be globally
unique. If you receive the error: *Data factory name ADFBlobConnectorDF is not available , change
the name of the data factory (for example, yournameADFBlobConnectorDF) and try creating again.
See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
b. Select your Azure subscription.
c. For Resource Group, select Use existing to select an existing resource group (or) select Create
new to enter a name for a resource group.
d. Select a location for the data factory.
e. Select Pin to dashboard check box at the bottom of the blade.
f. Click Create.
8. After the creation is complete, you see the Data Factory blade as shown in the following image:
Copy Wizard
1. On the Data Factory home page, click the Copy data [PREVIEW] tile to launch Copy Data Wizard in
a separate tab.

NOTE
If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third-party cookies and
site data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try
launching the wizard again.

2. In the Properties page:


a. Enter CopyPipeline for Task name. The task name is the name of the pipeline in your data
factory.
b. Enter a description for the task (optional).
c. For Task cadence or Task schedule, keep the Run regularly on schedule option. If you want to
run this task only once instead of run repeatedly on a schedule, select Run once now. If you select,
Run once now option, a one-time pipeline is created.
d. Keep the settings for Recurring pattern. This task runs daily between the start and end times you
specify in the next step.
e. Change the Start date time to 04/21/2017.
f. Change the End date time to 04/25/2017. You may want to type the date instead of browsing
through the calendar.
g. Click Next.
3. On the Source data store page, click Azure Blob Storage tile. You use this page to specify the source
data store for the copy task. You can use an existing data store linked service (or) specify a new data store.
To use an existing linked service, you would select FROM EXISTING LINKED SERVICES and select the
right linked service.
4. On the Specify the Azure Blob storage account page:
a. Keep the auto-generated name for Connection name. The connection name is the name of the
linked service of type: Azure Storage.
b. Confirm that From Azure subscriptions option is selected for Account selection method.
c. Select your Azure subscription or keep Select all for Azure subscription.
d. Select an Azure storage account from the list of Azure storage accounts available in the selected
subscription. You can also choose to enter storage account settings manually by selecting Enter
manually option for the Account selection method.
e. Click Next.
5. On Choose the input file or folder page:
a. Double-click adfblobcontainer.
b. Select input, and click Choose. In this walkthrough, you select the input folder. You could also
select the emp.txt file in the folder instead.

6. On the Choose the input file or folder page:


a. Confirm that the file or folder is set to adfblobconnector/input. If the files are in sub folders,
for example, 2017/04/01, 2017/04/02, and so on, enter
adfblobconnector/input/{year}/{month}/{day} for file or folder. When you press TAB out of the text
box, you see three drop-down lists to select formats for year (yyyy), month (MM), and day (dd).
b. Do not set Copy file recursively. Select this option to recursively traverse through folders for files
to be copied to the destination.
c. Do not the binary copy option. Select this option to perform a binary copy of source file to the
destination. Do not select for this walkthrough so that you can see more options in the next pages.
d. Confirm that the Compression type is set to None. Select a value for this option if your source
files are compressed in one of the supported formats.
e. Click Next.

7. On the File format settings page, you see the delimiters and the schema that is auto-detected by the
wizard by parsing the file.
a. Confirm the following options: a. The file format is set to Text format. You can see all the
supported formats in the drop-down list. For example: JSON, Avro, ORC, Parquet. b. The column
delimiter is set to Comma (,) . You can see the other column delimiters supported by Data Factory
in the drop-down list. You can also specify a custom delimiter. c. The row delimiter is set to
Carriage Return + Line feed (\r\n) . You can see the other row delimiters supported by Data
Factory in the drop-down list. You can also specify a custom delimiter. d. The skip line count is set
to 0. If you want a few lines to be skipped at the top of the file, enter the number here. e. The first
data row contains column names is not set. If the source files contain column names in the first
row, select this option. f. The treat empty column value as null option is set.
b. Expand Advanced settings to see advanced option available.
c. At the bottom of the page, see the preview of data from the emp.txt file.
d. Click SCHEMA tab at the bottom to see the schema that the copy wizard inferred by looking at the
data in the source file.
e. Click Next after you review the delimiters and preview data.
8. On the Destination data store page, select Azure Blob Storage, and click Next. You are using the
Azure Blob Storage as both the source and destination data stores in this walkthrough.

9. On Specify the Azure Blob storage account page:


a. Enter AzureStorageLinkedService for the Connection name field.
b. Confirm that From Azure subscriptions option is selected for Account selection method.
c. Select your Azure subscription.
d. Select your Azure storage account.
e. Click Next.
10. On the Choose the output file or folder page:
a. specify Folder path as adfblobconnector/output/{year}/{month}/{day}. Enter TAB.
b. For the year, select yyyy.
c. For the month, confirm that it is set to MM.
d. For the day, confirm that it is set to dd.
e. Confirm that the compression type is set to None.
f. Confirm that the copy behavior is set to Merge files. If the output file with the same name
already exists, the new content is added to the same file at the end.
g. Click Next.

11. On the File format settings page, review the settings, and click Next. One of the additional options here
is to add a header to the output file. If you select that option, a header row is added with names of the
columns from the schema of the source. You can rename the default column names when viewing the
schema for the source. For example, you could change the first column to First Name and second column
to Last Name. Then, the output file is generated with a header with these names as column names.
12. On the Performance settings page, confirm that cloud units and parallel copies are set to Auto, and
click Next. For details about these settings, see Copy activity performance and tuning guide.

13. On the Summary page, review all settings (task properties, settings for source and destination, and copy
settings), and click Next.
14. Review information in the Summary page, and click Finish. The wizard creates two linked services, two
datasets (input and output), and one pipeline in the data factory (from where you launched the Copy
Wizard).

Monitor the pipeline (copy task)


1. Click the link Click here to monitor copy pipeline on the Deployment page.
2. You should see the Monitor and Manage application in a separate tab.
3. Change the start time at the top to 04/19/2017 and end time to 04/27/2017 , and then click Apply.
4. You should see five activity windows in the ACTIVITY WINDOWS list. The WindowStart times should
cover all days from pipeline start to pipeline end times.
5. Click Refresh button for the ACTIVITY WINDOWS list a few times until you see the status of all the
activity windows is set to Ready.
6. Now, verify that the output files are generated in the output folder of adfblobconnector container. You
should see the following folder structure in the output folder:
2017/04/21 2017/04/22 2017/04/23 2017/04/24 2017/04/25 For detailed information about monitoring and
managing data factories, see Monitor and manage Data Factory pipeline article.
Data Factory entities
Now, switch back to the tab with the Data Factory home page. Notice that there are two linked services, two
datasets, and one pipeline in your data factory now.

Click Author and deploy to launch Data Factory Editor.


You should see the following Data Factory entities in your data factory:
Two linked services. One for the source and the other one for the destination. Both the linked services
refer to the same Azure Storage account in this walkthrough.
Two datasets. An input dataset and an output dataset. In this walkthrough, both use the same blob
container but refer to different folders (input and output).
A pipeline. The pipeline contains a copy activity that uses a blob source and a blob sink to copy data from
an Azure blob location to another Azure blob location.
The following sections provide more information about these entities.
Linked services
You should see two linked services. One for the source and the other one for the destination. In this
walkthrough, both definitions look the same except for the names. The type of the linked service is set to
AzureStorage. Most important property of the linked service definition is the connectionString, which is
used by Data Factory to connect to your Azure Storage account at runtime. Ignore the hubName property in
the definition.
So u r c e b l o b st o r a g e l i n k e d se r v i c e

{
"name": "Source-BlobStorage-z4y",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString":
"DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********"
}
}
}

D e st i n a t i o n b l o b st o r a g e l i n k e d se r v i c e
{
"name": "Destination-BlobStorage-z4y",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString":
"DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********"
}
}
}

For more information about Azure Storage linked service, see Linked service properties section.
Datasets
There are two datasets: an input dataset and an output dataset. The type of the dataset is set to AzureBlob
for both.
The input dataset points to the input folder of the adfblobconnector blob container. The external property
is set to true for this dataset as the data is not produced by the pipeline with the copy activity that takes this
dataset as an input.
The output dataset points to the output folder of the same blob container. The output dataset also uses the
year, month, and day of the SliceStart system variable to dynamically evaluate the path for the output file.
For a list of functions and system variables supported by Data Factory, see Data Factory functions and
system variables. The external property is set to false (default value) because this dataset is produced by
the pipeline.
For more information about properties supported by Azure Blob dataset, see Dataset properties section.
I n p u t d a t a se t

{
"name": "InputDataset-z4y",
"properties": {
"structure": [
{ "name": "Prop_0", "type": "String" },
{ "name": "Prop_1", "type": "String" }
],
"type": "AzureBlob",
"linkedServiceName": "Source-BlobStorage-z4y",
"typeProperties": {
"folderPath": "adfblobconnector/input/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}

O u t p u t d a t a se t
{
"name": "OutputDataset-z4y",
"properties": {
"structure": [
{ "name": "Prop_0", "type": "String" },
{ "name": "Prop_1", "type": "String" }
],
"type": "AzureBlob",
"linkedServiceName": "Destination-BlobStorage-z4y",
"typeProperties": {
"folderPath": "adfblobconnector/output/{year}/{month}/{day}",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
},
"partitionedBy": [
{ "name": "year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy"
} },
{ "name": "month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" }
},
{ "name": "day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }
]
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
}

Pipeline
The pipeline has just one activity. The type of the activity is set to Copy. In the type properties for the activity,
there are two sections, one for source and the other one for sink. The source type is set to BlobSource as the
activity is copying data from a blob storage. The sink type is set to BlobSink as the activity copying data to a
blob storage. The copy activity takes InputDataset-z4y as the input and OutputDataset-z4y as the output.
For more information about properties supported by BlobSource and BlobSink, see Copy activity properties
section.
{
"name": "CopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink",
"copyBehavior": "MergeFiles",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset-z4y"
}
],
"outputs": [
{
"name": "OutputDataset-z4y"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-Blob path_ adfblobconnector_input_->OutputDataset-z4y"
}
],
"start": "2017-04-21T22:34:00Z",
"end": "2017-04-25T05:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}

JSON examples for copying data to and from Blob Storage


The following examples provide sample JSON definitions that you can use to create a pipeline by using
Azure portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Blob
Storage and Azure SQL Database. However, data can be copied directly from any of sources to any of the
sinks stated here using the Copy Activity in Azure Data Factory.
JSON Example: Copy data from Blob Storage to SQL Database
The following sample shows:
1. A linked service of type AzureSqlDatabase.
2. A linked service of type AzureStorage.
3. An input dataset of type AzureBlob.
4. An output dataset of type AzureSqlTable.
5. A pipeline with a Copy activity that uses BlobSource and SqlSink.
The sample copies time-series data from an Azure blob to an Azure SQL table hourly. The JSON properties
used in these samples are described in sections following the samples.
Azure SQL linked service:

{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}

Azure Storage linked service:

{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and
AzureStorageSas. For the first one, you specify the connection string that includes the account key and for
the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure Blob input dataset:
Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for
the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder
path uses year, month, and day part of the start time and file name uses the hour part of the start time.
external: true setting informs Data Factory that the table is external to the data factory and is not
produced by an activity in the data factory.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/",
"fileName": "{Hour}.csv",
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } }
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}

Azure SQL output dataset:


The sample copies data to a table named MyTable in an Azure SQL database. Create the table in your Azure
SQL database with the same number of columns as you expect the Blob CSV file to contain. New rows are
added to the table every hour.

{
"name": "AzureSqlOutput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

A copy activity in a pipeline with Blob source and SQL sink:


The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled
to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set
to SqlSink.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with copy activity",
"activities":[
{
"name": "AzureBlobtoSQL",
"description": "Copy Activity",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureSqlOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}

JSON Example: Copy data from Azure SQL to Azure Blob


The following sample shows:
1. A linked service of type AzureSqlDatabase.
2. A linked service of type AzureStorage.
3. An input dataset of type AzureSqlTable.
4. An output dataset of type AzureBlob.
5. A pipeline with Copy activity that uses SqlSource and BlobSink.
The sample copies time-series data from an Azure SQL table to an Azure blob hourly. The JSON properties
used in these samples are described in sections following the samples.
Azure SQL linked service:
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}

Azure Storage linked service:

{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and
AzureStorageSas. For the first one, you specify the connection string that includes the account key and for
the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure SQL input dataset:
The sample assumes you have created a table MyTable in Azure SQL and it contains a column called
timestampcolumn for time series data.
Setting external: true informs Data Factory service that the table is external to the data factory and is not
produced by an activity in the data factory.

{
"name": "AzureSqlInput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}

Azure Blob output dataset:


Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is
dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year,
month, day, and hours parts of the start time.

{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
"partitionedBy": [
{
"name": "Year",
"value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } }
],
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"rowDelimiter": "\n"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}

A copy activity in a pipeline with SQL source and Blob sink:


The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled
to run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to
BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to
copy.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "AzureSQLtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "AzureSQLInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >=
\\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}

NOTE
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data
Factory.

Performance and Tuning


See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data
movement (Copy Activity) in Azure Data Factory and various ways to optimize it.
Move data to and from Azure Cosmos DB using
Azure Data Factory
6/27/2017 11 min to read Edit Online

This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Cosmos
DB (DocumentDB API). It builds on the Data Movement Activities article, which presents a general overview of
data movement with the copy activity.
You can copy data from any supported source data store to Azure Cosmos DB or from Azure Cosmos DB to
any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity, see
the Supported data stores table.

IMPORTANT
Azure Cosmos DB connector only support DocumentDB API.

To copy data as-is to/from JSON files or another Cosmos DB collection, see Import/Export JSON documents.

Getting started
You can create a pipeline with a copy activity that moves data to/from Azure Cosmos DB by using different
tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy
Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell,
Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step
instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a
source data store to a sink data store:
1. Create linked services to link input and output data stores to your data factory.
2. Create datasets to represent input and output data for the copy operation.
3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the
pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data
Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are
used to copy data to/from Cosmos DB, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities
specific to Cosmos DB:

Linked service properties


The following table provides description for JSON elements specific to Azure Cosmos DB linked service.
PROPERTY DESCRIPTION REQUIRED

type The type property must be set to: Yes


DocumentDb

connectionString Specify information needed to connect Yes


to Azure Cosmos DB database.

Example:

{
"name": "CosmosDbLinkedService",
"properties": {
"type": "DocumentDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
}
}
}

Dataset properties
For a full list of sections & properties available for defining datasets please refer to the Creating datasets
article. Sections like structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure
SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of
the data in the data store. The typeProperties section for the dataset of type DocumentDbCollection has the
following properties.

PROPERTY DESCRIPTION REQUIRED

collectionName Name of the Cosmos DB document Yes


collection.

Example:

{
"name": "PersonCosmosDbTable",
"properties": {
"type": "DocumentDbCollection",
"linkedServiceName": "CosmosDbLinkedService",
"typeProperties": {
"collectionName": "Person"
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 1
}
}
}

Schema by Data Factory


For schema-free data stores such as Azure Cosmos DB, the Data Factory service infers the schema in one of
the following ways:
1. If you specify the structure of data by using the structure property in the dataset definition, the Data
Factory service honors this structure as the schema. In this case, if a row does not contain a value for a
column, a null value will be provided for it.
2. If you do not specify the structure of data by using the structure property in the dataset definition, the Data
Factory service infers the schema by using the first row in the data. In this case, if the first row does not
contain the full schema, some columns will be missing in the result of copy operation.
Therefore, for schema-free data sources, the best practice is to specify the structure of data using the structure
property.

Copy activity properties


For a full list of sections & properties available for defining activities please refer to the Creating Pipelines
article. Properties such as name, description, input and output tables, and policy are available for all types of
activities.

NOTE
The Copy Activity takes only one input and produces only one output.

Properties available in the typeProperties section of the activity on the other hand vary with each activity type
and in case of Copy activity they vary depending on the types of sources and sinks.
In case of Copy activity when source is of type DocumentDbCollectionSource the following properties are
available in typeProperties section:

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

query Specify the query to read Query string supported by No


data. Azure Cosmos DB.
If not specified, the SQL
Example: statement that is executed:
SELECT select <columns
c.BusinessEntityID, defined in structure>
c.PersonType, from mycollection
c.NameStyle, c.Title,
c.Name.First AS
FirstName, c.Name.Last
AS LastName, c.Suffix,
c.EmailPromotion FROM
c WHERE c.ModifiedDate
> \"2009-01-
01T00:00:00\"
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

nestingSeparator Special character to indicate Any character. No


that the document is
nested Azure Cosmos DB is a
NoSQL store for JSON
documents, where nested
structures are allowed.
Azure Data Factory enables
user to denote hierarchy
via nestingSeparator, which
is . in the above examples.
With the separator, the
copy activity will generate
the Name object with
three children elements
First, Middle and Last,
according to Name.First,
Name.Middle and
Name.Last in the table
definition.

DocumentDbCollectionSink supports the following properties:

PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

nestingSeparator A special character in the Character that is used to Character that is used to
source column name to separate nesting levels. separate nesting levels.
indicate that nested
document is needed. Default value is . (dot). Default value is . (dot).

For example above:


Name.First in the output
table produces the
following JSON structure in
the Cosmos DB document:

"Name": {
"First": "John"
},
PROPERTY DESCRIPTION ALLOWED VALUES REQUIRED

writeBatchSize Number of parallel requests Integer No (default: 5)


to Azure Cosmos DB
service to create
documents.

You can fine-tune the


performance when copying
data to/from Cosmos DB
by using this property. You
can expect a better
performance when you
increase writeBatchSize
because more parallel
requests to Cosmos DB are
sent. However youll need
to avoid throttling that can
throw the error message:
"Request rate is large".

Throttling is decided by a
number of factors,
including size of
documents, number of
terms in documents,
indexing policy of target
collection, etc. For copy
operations, you can use a
better collection (e.g. S3) to
have the most throughput
available (2,500 request
units/second).

writeBatchTimeout Wait time for the operation timespan No


to complete before it times
out. Example: 00:30:00 (30
minutes).

Import/Export JSON documents


Using this Cosmos DB connector, you can easily
Import JSON documents from various sources into Cosmos DB, including Azure Blob, Azure Data Lake, on-
premises File System or other file-based stores supported by Azure Data Factory.
Export JSON documents from Cosmos DB collecton into various file-based stores.
Migrate data between two Cosmos DB collections as-is.
To achieve such schema-agnostic copy,
When using copy wizard, check the "Export as-is to JSON files or Cosmos DB collection" option.
When using JSON editing, do not specify the "structure" section in Cosmos DB dataset(s) nor
"nestingSeparator" property on Cosmos DB source/sink in copy activity. To import from/export to JSON
files, in the file store dataset specify format type as "JsonFormat", config "filePattern" and skip the rest
format settings, see JSON format section on details.

JSON examples
The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure
portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Cosmos DB and
Azure Blob Storage. However, data can be copied directly from any of the sources to any of the sinks stated
here using the Copy Activity in Azure Data Factory.

Example: Copy data from Azure Cosmos DB to Azure Blob


The sample below shows:
1. A linked service of type DocumentDb.
2. A linked service of type AzureStorage.
3. An input dataset of type DocumentDbCollection.
4. An output dataset of type AzureBlob.
5. A pipeline with Copy Activity that uses DocumentDbCollectionSource and BlobSink.
The sample copies data in Azure Cosmos DB to Azure Blob. The JSON properties used in these samples are
described in sections following the samples.
Azure Cosmos DB linked service:

{
"name": "CosmosDbLinkedService",
"properties": {
"type": "DocumentDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
}
}
}

Azure Blob storage linked service:

{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
}
}
}

Azure Document DB input dataset:


The sample assumes you have a collection named Person in an Azure Cosmos DB database.
Setting external: true and specifying externalData policy information the Azure Data Factory service that the
table is external to the data factory and not produced by an activity in the data factory.
{
"name": "PersonCosmosDbTable",
"properties": {
"type": "DocumentDbCollection",
"linkedServiceName": "CosmosDbLinkedService",
"typeProperties": {
"collectionName": "Person"
},
"external": true,
"availability": {
"frequency": "Day",
"interval": 1
}
}
}

Azure Blob output dataset:


Data is copied to a new blob every hour with the path for the blob reflecting the specific datetime with hour
granularity.

{
"name": "PersonBlobTableOut",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "docdb",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"nullValue": "NULL"
}
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}

Sample JSON document in the Person collection in a Cosmos DB database:

{
"PersonId": 2,
"Name": {
"First": "Jane",
"Middle": "",
"Last": "Doe"
}
}

Cosmos DB supports querying documents using a SQL like syntax over hierarchical JSON documents.
Example:

SELECT Person.PersonId, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last


AS LastName FROM Person

The following pipeline copies data from the Person collection in the Azure Cosmos DB database to an Azure
blob. As part of the copy activity the input and output datasets have been specified.

{
"name": "DocDbToBlobPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName,
Person.Name.Last AS LastName FROM Person",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 1000,
"writeBatchTimeout": "00:00:59"
}
},
"inputs": [
{
"name": "PersonCosmosDbTable"
}
],
"outputs": [
{
"name": "PersonBlobTableOut"
}
],
"policy": {
"concurrency": 1
},
"name": "CopyFromDocDbToBlob"
}
],
"start": "2015-04-01T00:00:00Z",
"end": "2015-04-02T00:00:00Z"
}
}

Example: Copy data from Azure Blob to Azure Cosmos DB


The sample