Anda di halaman 1dari 37

Azure SQL Data Warehouse

Concepts & Tips

Catalin Esanu
Data Solution Architect
cesanu@Microsoft.com
Azure SQL Data Warehouse
End-to-end platform built Market leading price/
Elastic scale & performance
for the cloud performance
• Scales to petabytes of data with • Integrated with Azure platform • Simple compute & storage billing
MPP processing and other Microsoft services • Pay for what you need
• Resize compute nodes <1 minute • Enables hybrid solutions • High performance without rewriting
• Faster time to insight than other • Built on SQL Server experience & applications
SMP offerings technology • Low cost for latent data
• Designed for on-demand workload • Infrastructure, management and
support provided – WITH SLA
Quick glance – Azure SQL Data Warehouse

• Leverages cloud PaaS service

• Massively parallel processing (MPP) to scale to PBs

• In-memory columnstore for 100x speed improvement

• PolyBase enables joining relational and Hadoop data

• Only available in Azure as PaaS MPP DW


What is MPP?
MPP stands for “MASSIVE PARALLEL PROCESSING”

• A divide and conquer strategy


• Take one big problem & break it up & execute it individually
• Team approach “Many hands make light work”

Requires
• A method for scheduling tasks
• A communication plan to maximise efficiency
• A distribution method for exchange of goods
SCALE UP vs OUT
UP – SMP (Symmetric Multi-Processing ) OUT - MPP
• Diminishing returns • Linear scale
• Non-linear costs at scale • Incremental cost
• Parallel execution hard • Parallel execution by default
• Low-mid complexity • Complex queries
• High concurrency • Medium concurrency
• Shared everything • Shared nothing
What will we talk about?
• Supported/Unsupported features (high level)
• High level architecture
• Data distribution
• Concurrency
• Resource classes
• Polybase
• Best Practices
• Getting started - demo
Supported Features
- Partitions
- Stored Procedures
- Functions
- Indexes (Clusterd Columnstore Indexes, Clustered/Nonclusterd with secondary indexes)
- DMVs
- Powershell cmdlets
- Temporary tables
- Variables
- Loops
- Schemas
- Views
- Drivers: ODBC, JDBC, PHP, ADO.NET
- Loading with Polybase and BCP
- Transparent Data Encryption
- Snapshots and Geo-Backup
- Dynamic SQL
- Pivot/Unpivot
- Analytical Functions
Unsupported Features
- Cursors
- Merge statement
- Cross db selects
- Table valued parameters
- Table variables
- CLRs
- Partial support for CTE
- ANSI joins on UPDATE/DELETE
- Full list here with possible workarounds
Azure SQL Data Warehouse Architecture
Storage and Compute are de-coupled,
Application or
User connection DMS (Data enabling a true elastic service and
Movement Service) separate charging for both compute and
Control executes across all storage
Data Loading
(SSIS, REST, OLE, ADO, ODBC,
DMS
Node database nodes
WebHDFS, AZCopy, PS)
Massively Parallel 100 DWU < > 2000 DWU
Processing (MPP) Engine
Compute
DMS DMS DMS DMS Scale compute up or down
SQL SQL SQL SQL when required
DB DB DB DB
(SLA <= 60 seconds).
Compute Compute Compute Compute
Node Node Node Node Pause, Resume, Stop, Start.

Azure Infrastructure and


Storage
Blob storage [WASB(S)]
Storage
Add\Load data to WASB(S)
without incurring compute
costs
HDInsight
Azure SQL Data Warehouse – Control Node
SQL
DB
SQL DB
Control
Node
Massively Parallel
Processing (MPP) Engine Control
Node
SQL SQL SQL SQL
DB DB DB DB

Compute Compute Compute Compute


Node Node Node Node
• Endpoint for connections
• Regular SQL endpoint (TCP 1433)
Blob storage [WASB(S)]
• Persists no user data (metadata
only)
• Coordinates compute activity
HDInsight using MPP
Azure SQL Data Warehouse - Compute Nodes
SQL
DB
SQL DB
Control
Node
Massively Parallel
Processing (MPP) Engine
Compute
Node(s)
SQL SQL SQL SQL
DB DB DB DB

Compute Compute Compute Compute


Node Node Node Node Azure SQL Database

An increase of DWU will


Blob storage [WASB(S)] increase the number of
compute nodes
HDInsight
Azure SQL Data Warehouse – Blob storage
SQL
DB

Control
Node
Massively Parallel
Processing (MPP) Engine

SQL
DB
SQL
DB
SQL
DB
SQL
DB
• RA-GRS storage
Compute Compute Compute Compute • +PB’s of storage
Node Node Node Node • Ingest data without
incurring compute costs

Blob storage [WASB(S)]

HDInsight
Data Distribution Concept
Distributions
CREATE TABLE myTable (column Defs)
WITH ( DISTRIBUTION = HASH (id));

Azure Storage Blob(s)

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D11 D12 D13 D14 D15 D16 D17 D18 D19 D20

D21 D22 D23 D24 D25 D26 D27 D28 D29 D30

D31 D32 D33 D34 D35 D36 D37 D38 D39 D40

D41 D42 D43 D44 D45 D46 D47 D48 D49 D50

D51 D52 D53 D54 D55 D56 D57 D58 D59 D60
Why Distribute Data

• Divide & conquer: lots of small queries to solve

• Evenly spreading the data leads to even use of


the appliance resources
Consequences of Data Distribution
Good for scalability
• Data is spread (distributed) across servers
Introduces overheads
• Data is not all in the same place!
• Don’t know where specific values are located
• May need to move the data then process it
MPP SQL Table Geometries
Distributed:
A table structure that is distributed across all MPP nodes of the Data
Warehouse Database
HASH: where the value of a single column gets hashed to define the distribution
number where the records will get inserted
ROUND_ROBIN: where the records are distributed in a “round robin” manner across all
distributions

Replicated: (not supported in Azure SQL Data Warehouse)


A table structure that exists as a full copy within each MPP node
Data Movement
Data Movement will not be invoked when
• Two distribution compatible tables are joined
• A table is joined to a replicated table
• Aggregation is distribution compatible

Data Movement does occur when


• Two distribution incompatible tables are joined
• Round robin tables are distribution incompatible with all tables,
except replicated tables
• Aggregation is distribution incompatible
Concurrency limits
• Up to 1,024 concurrent connections  All can submit queries
concurrently. But queueing may occur
• Concurrency limits are governed by two concepts: concurrent
queries and concurrency slots.
• Concurrent queries are the queries executing at the same time. Up to
32 concurrent queries.
• Concurrency slots are allocated based on DWU. Each 100 DWU
provides 4 concurrency slots. For example, a DW100 allocates 4
concurrency slots and DW1000 allocates 40. Each query consumes
one or more concurrency slots, dependent on the resource class of
the query.
Resource classes
• Allows control over memory allocation and CPU cycles given to a
query.
• Are assigned in the form of database roles.
• The four resource classes are smallrc,mediumrc, largerc,
and xlargerc.
• smallrc is the default (not changeable).
• Use sp_addrolemember/sp_droprolemember to manage resource
class assignments:
EXEC sp_addrolemember 'largerc', 'loaduser '
• Resource class assignment does not apply to some queries (nor does
the concurrency limit)
Resource Groups Limits Breakdown
Memory (GB) Memory (GB) Memory (GB) Memory (GB)
Maximum concurrent Concurrency Slots used by Slots used by Slots used by Slots used by
DWU allocated to allocated to allocated to allocated to
queries slots allocated smallrc mediumrc largerc xlargerc
smallrc mediumrc largerc xlargerc

DW100 4 4 1 1 2 4 6 6 12 23

DW200 8 8 1 2 4 8 6 12 23 47

DW300 12 12 1 2 4 8 6 12 23 47

DW400 16 16 1 4 8 16 6 23 47 94

DW500 20 20 1 4 8 16 6 23 47 94

DW600 24 24 1 4 8 16 6 23 47 94

DW1000 32 40 1 8 16 32 6 47 94 188

DW1200 32 48 1 8 16 32 6 47 94 188

DW1500 32 60 1 8 16 32 6 47 94 188

DW2000 32 80 1 16 32 64 6 94 188 375

DW3000 32 120 1 16 32 64 6 94 188 375

DW6000 32 240 1 32 64 128 6 188 375 750


Service Capacity Limits – full list here
Category Description Maximum

Data Warehouse Units (DWU) Max DWU for a single SQL Data Warehouse 6000

Database connection Concurrent open sessions 1024

Database connection Maximum memory for prepared statements 20 MB


Workload management Maximum concurrent queries 32

Tempdb Max size of Tempdb 399 GB per DW100. Therefore at DWU1000 Tempdb is sized to 3.99 TB

Database Max size 240 TB compressed on disk

Table Max size 60 TB compressed on disk

Table Tables per database 2 billion

Table Columns per table 1024 columns

Table Bytes per column Dependent on column data type. Limit is 8000 for char data types, 4000 for nvarchar, or 2 GB for MAX data types.

Table Bytes per row, defined size 8060 bytes

Table Partitions per table 15,000

Index Non-clustered indexes per table. 999 - Applies to rowstore tables only.

Index Clustered indexes per table. 1

Query Queued queries on user tables. 1000

Query Concurrent queries on system views. 100

Query Queued queries on system views 1000

Query Maximum parameters 2098

Batch Maximum size 65,536*4096

SELECT results Columns per row 4096

SELECT Nested subqueries 32


PolyBase and queries

PolyBase

Provides a scalable, T-SQL-compatible query processing


framework for combining data from both universes
PolyBase View in SQL Server 2016

Query
PolyBase View
Results
Execute T-SQL queries against
relational data in SQL Server and
semi-structured data in Hadoop or
Azure Blob Storage

Leverage existing T-SQL skills and BI


tools to gain insights from different
data stores
PolyBase use cases
Polybase Queries
SELECT
[vin],
Query external data table as SQL
[speed], data
[datetimestamp]
FROM dbo.SensorData
Data returned as defined in external
data table
SELECT
[make],
Join SQL data with external data
[model],
[modelYear], Join data between internal and external
[speed], table
[datetimestamp]
FROM dbo.AutomobileData
All TSQL commands supported
LEFT JOIN dbo.SensorData PolyBase will optimize between SQL-
ON dbo.AutomobileData.[vin] = dbo.SensorData.[vin]
side query and pushdown to
MapReduce
PolyBase query example #1

Possible execution plan: -- select on external


table (data in HDFS)
SELECT * FROM Customer
3
EXECUTE Select * from T where
QUERY T.c_nationkey =3 and T.c_acctbal < 0 WHERE c_nationkey = 3
and c_acctbal < 0;
2
IMPORT HDFS Customer file read into T
FROM HDFS

CREATE temp
1
table T
Execute on compute nodes
PolyBase query example #2
-- select and aggregate
on external table (data
in HDFS)
Execution plan: SELECT AVG(c_acctbal)
WhatFROM Customer
happens here?
Step 1: QO compiles
WHEREpredicate into Java and
c_acctbal
generates
< 0MapReduce
GROUP BY (MR) job
Step 2: Engine submits MR job to Hadoop
c_nationkey;
cluster. Output left in hdfsTemp.

hdfsTemp
<US, $-975.21>
Run MR job on Apply filter and compute <UK, $-63.52>
1 <FRA, $-119.13>
Hadoop aggregate on Customer.
PolyBase query example #2
-- select and aggregate
on external table (data
in HDFS)
Execution plan: SELECT AVG(c_acctbal)
1. FROM Customer
Predicate and aggregate pushed
RETURN
Select * from T into Hadoop cluster as MapReduce
4 OPERATION WHERE c_acctbal
job
< 0 GROUP BY
IMPORT
3 hdfsTEMP Read hdfsTemp into T 2. c_nationkey;
Query optimizer makes cost-based
decision on what operators to push
CREATE temp
2 table T On DW compute nodes
hdfsTemp
1
Run MR job on Apply filter and compute <US, $-975.21>
Hadoop
aggregate on Customer. <UK, $-63.52>
Output left in hdfsTemp <FRA, $-119.13>
Summary: PolyBase
Query relational and non-relational data with T-SQL

Apps
Query relational
and non-relational
data, on-premises and
in Azure

T-SQL query

SQL Server Hadoop


Best Practices – read more
1. Reduce cost with pause and scale
2. Drain transactions before pausing or scaling
3. Maintain statistics
4. Group INSERT statements into batches
5. Use PolyBase to load and export data quickly – with CTAS, if possible.
6. Load then query external tables
7. Hash distribute large tables
8. Do not over-partition
9. Minimize transaction sizes
10.Use the smallest possible column size
11.Use temporary heap tables for transient data
12.Optimize clustered columnstore tables
13.Use larger resource class to improve query performance / Smaller Resource Class to
Increase Concurrency
Getting started with Azure SQL Data Warehouse
Connect-Query
In order to connect to and query SQL Data Warehouse,
Connect-Query be sure to:
Connect to and query using tools such
as sqlcmd and Visual Studio 2015

Install or update Microsoft Visual Studio 2015 and update SQL


Migrate Server Data Tools (SSDT)
Migrate data and schema from SQL https://msdn.microsoft.com/en-us/library/mt204009.aspx
Server and Azure SQL Database

and/or latest version of SQL Server Management Studio


Load Data https://msdn.microsoft.com/en-us/library/mt238290.aspx
ETL/ELT using Microsoft or third-party
data loading tools

Secure
Connect to SQL Data Warehouse with the sqlcmd command
Including connection, authentication, prompt utility included with SQL Server
authorization, encryption, and auditing
Execute sample queries with SSDT/SSMS/sqlcmd to test your
Visualize
connections
Dynamic reporting and visualization
using Power BI
Getting started with Azure SQL Data Warehouse
Migrate
The Data Warehouse Migration Utility
(https://azure.microsoft.com/en-us/documentation/articles/sql-data-
warehouse-migrate-migration-utility/) is a tool designed to migrate
schema and data from SQL Server and Azure SQL Database to SQL
Data Warehouse, including these steps:
Migrate
Migrate data and schema from SQL
Server and Azure SQL Database Download, install, launch, and connect Data Warehouse
Migration Utility to connect to source and destination databases
Load Data Click Migrate Schema to generate schema migration script for
ETL/ELT using Microsoft or third-party
data loading tools
selected tables
Click Migrate Data to generate scripts that will move data first to
Secure flat files on your server, and then directly into SQL Data
Including connection, authentication, Warehouse
authorization, encryption, and auditing
You may also use a tool such as DWUCalculator
Visualize
(http://dwucalculator.azurewebsites.net) for rough estimation of
Dynamic reporting and visualization
required DWU
using Power BI
Getting started with Azure SQL Data Warehouse
Load Data
Perform ETL/ELT and load data into your SQL Data Warehouse
using:
Azure Data Factory

AzCopy

bcp command-line utility

Load Data PolyBase


ETL/ELT using Microsoft or third-party
data loading tools
SQL Server Integration Services (SSIS)
Secure
Including connection, authentication, Third-party data loading tools
authorization, encryption, and auditing

Visualize
Dynamic reporting and visualization
using Power BI
Getting started with Azure SQL Data Warehouse
Secure
Establishing security for your SQL Data Warehouse is critical to
getting started:

Connection security: Restrict and secure connections to your


database using firewall rules and connection encryption
Authentication: Establish SQL authentication with username and
password
Authorization: Define role memberships and permissions for
user accounts
Encryption: Use Transparent Data Encryption to encrypt data
Secure when it is at rest or stored in database files and backups
Including connection, authentication,
authorization, encryption, and auditing Auditing: Set up SQL Data Warehouse Auditing to record events
in your database to audit log in Azure Storage account
Visualize
Dynamic reporting and visualization
using Power BI
Getting started with Azure SQL Data Warehouse
Visualize
Power BI integration allows users to heavily leverage compute power
of SQL Data Warehouse with dynamic reporting and visualization of
Power BI, including:

Direct connect: A more advanced connection with logical


pushdown against SQL Data Warehouse, allowing for quicker
analysis on larger scale
Open in Power BI: Open in Power BI button passes instance
information to Power BI, allowing for more seamless connection

Read about other BI solutions supporting Azure SQL DW here

Visualize
Dynamic reporting and visualization
using Power BI
What’s next?
- Read more about Azure
https://azure.microsoft.com/en-us/solutions/?v=4
https://azure.microsoft.com/en-us/get-started/
https://azure.microsoft.com/en-us/documentation/samples/
- Go to https://azure.microsoft.com/en-us/documentation/services/sql-data-warehouse for more
details about the product.
- Vote on User Voice page for your favorite upcoming features -
https://feedback.azure.com/forums/307516-sql-data-warehouse
- Contact me - cesanu@Microsoft.com 
I’d love to hear your thoughts/impressions about the product.

Anda mungkin juga menyukai