Anda di halaman 1dari 18

QUERY OPTIMIZATION IN MICROSOFT

SQL SERVER PDW Naveen Baskaran


M I C R O S O F T C O R P O R AT I O N
PAPER EXPLAINS
Why we need Massive parallel processing(MPP)?
What is PDW?
Why QO is complex in PDW?
What are the changes done in SQL Server optimizer?
How to calculate cost ?
PDW
Composed of Hardware and Software
Shared nothing or loosely coupled
SQL Server uses Symmetric multiprocessing (SMP) - uses only one server
MPP runs several servers in parallel and independent
Cost effective
Easy to add extra server and storage
Easy upgraded or individually replaced (CPU’s,Memory,Storage,.Etc.)
SQL SERVER PDW
ARCHITECTURE
CONTROL NODE & COMPUTE NODE
Control Node:
Distribute Queries among the compute nodes
Accepts Client connection –ODBC, OLE DB, ADO.NET
Contains additional S/W to support distributed architecture of PDW
Manages DMS – communication layer between nodes
Contains shell database
Performs data base authentication and authorization
Compute Node:
Host for single server instance
Runs for communication and data transfer
Each node contains portion of user data (hash-partitioned)
SHELL DATABASE
Contains metadata of user tables
No user data
Undistinguishable among the one contains actual data
Used for testing and debugging of compilations issues
Additionally stores,
 Users and privileges
 Check for security and access rights
Provides same security as SQL server
Contains global statistics
Calculate local statistics among each nodes
DATA MOVEMENT SERVICE
Responsible for moving data between all the nodes and appliance
One instance, intermediate results move to one compute node to another
Another instance, one or more compute nodes move intermediate results to control
node
Control nodes computes final aggregations and sorting prior to returning the results
Uses temp tables to store intermediate tables
In some cases, queries generate direct results without intermediate tables and sent
back to client (DMS will not involve in the process)
DSQL PLAN AND ITS EXECUTION
DSQL Plan contains

SQL operations – executed directly in SQL server


DMS operations – Moves data among nodes
Temp table operations
Return operations – push data to clients

Query plans executed serially, one step at a time.


However, serial process runs parallelly across nodes
DSQL PLAN EXAMPLE
Assume the customer table (custkey)and order table (orderkey)
SELECT c_custkey, o_orderdate
FROM Orders, Customer
WHERE o_custkey = c_custkey AND o_totalprice > 100
Not compatible with join because of primary key
DSQL plan
1. DMS operation: repartitions the table according custkey
2. Return SQL operation: sent the final tuples to the client
QUERY
OPTIMIZATION
PDW Parser
SQL Server compilation
XML generator
PDW Query Optimizer
COST BASED QUERY OPTIMIZATION IN PDW
DSQL PLAN GENERATION
Input: Physical operator tree
Output: DSQL formatted plan
Framework: QRel Programming
Sends SQL queries to the compute nodes, instead of the operator tree (unlike other
MPP systems E.g: GreenPlum)
SQL statements executed in underlying compute nodes and DMS operations used to
transfer data.
Similar to Asterdata Approach
DSQL PLAN GENERATIONS – QREL FRAMEWORK
DMS OPERATIONS
7 DATA MOVEMENT OPERATIONS:
1. Shuffle Move (many-to-many). Rows are moved from each compute node to target table based on a hash of
the value in the specified distribution column.
2. Partition Move (many-to-one). Rows are moved from each compute node to the target table on the target
node (typically the control node but this is not a requirement).
3. Control-Node Move (From the control node to the compute nodes). A table in the control node is replicated
to all compute nodes.
4. Broadcast Move. Rows are moved from each compute node to the target table on all compute nodes.
5. Trim Move. Trim move is initiated against a replicated table on all compute nodes where the destination is to
a distributed table on its own nodes. Hashing will take place so that only rows that this node is responsible for
will be kept.
6. Replicated broadcast. A table which is only in one compute node it is replicated via a broadcast move.
7. Remote copy to single node . Can be either a remote copy of a replicated table (from control node or from
compute node) or, a Remote copy of a distributed table.
COST OF DMS OPERATIONS
Separated as two components: Source (sending side) and Target (Receiving side)
Source and target are divided into sub components
Source:
 = max ( ,  )
Target:
 = max ( ,  )
DMS operation cost:
 !  = max ( ,  )
Data transmission happens asynchronously
Source and targets operation performs parallelly in each node
QUERIES?
THANK YOU!!

Anda mungkin juga menyukai