All rights
reserved.
Oracle Data Guard Switchover and Failover
Internals: How Fast Can You Go?
Presenting with
Michael Smith, Oracle
2 Srinagesh Battula, Intel
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
3 Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
Latin America 2011
December 6–8, 2011
Tokyo 2012
April 4–6, 2012
O
Oracle
l Store
St
• Overview
– What is Data Guard
– Disaster Recovery and High Availability
– Switchover / Failover, how fast?
– Alerting / Monitoring
• Intel Case Study
– HA / DR Objectives
– Overall Architecture
– Observed Opportunities / Benefits
– HA / DR Metrics
• Failover / Switchover Details
– Failover process flow
– Switchover process flow
– Demo
– Best Practices
Primary
D t b
Database
58
– Used for database upgrades,
upgrades tech
Secon
SQL*Plus
56 Broker refresh, data center moves, OS or
hardware maintenance …
54
• M
Manually
ll execute
t via
i SQL or
52 Enterprise Manager GUI, or
50
Broker CLI
Database Application
15
Man al - SQL*Plus
Manual SQL*Pl s
Secon
Manual - Broker
• Manually execute via SQL or
10 Automatic - FSFO Enterprise Manager GUI, or
Broker CLI
5
• Automate failover using Data
0 Guard (Fast-Start Failover)
Database Application
175
reinstates the failed primary
170
as a standby database
• Built-in controls p
prevent anyy
nds
165
Secon
• Overview
– What is Data Guard
– Disaster Recoveryy and Highg Availability
y
– Switchover / Failover, how fast?
– Alerting / Monitoring
• Intel Case Study
– HA / DR Objectives
– Overall Architecture
– Observed Opportunities / Benefits
– HA / DR Metrics
• Failover / Switchover Details
– Failover process flow
– Switchover process flow
– Demo
– Best Practices
16
Agenda
Profile of Intel’s DB Ecosystem
y & HA/DR
/ Challenges
g
HA/DR Objectives
HA 1.0 vs HA 2.0
pp
Opportunities with HA 2.0
Conclusion
17
Intel Corporation
The World’s Largest Semiconductor Manufacturer
18
Intel’s DB Ecosystem Profile & HA/DR Challenges
Intel’s Factory Automation DBs are used for critical Manufacturing decisions.
(Operational and Planning,
Planning Engineering Analysis,
Analysis Process control)
Include both Mission Critical OLTP and Mission Important DSS type systems
(Ranging from few hundred Gigs up to multi TB)
Applications clients include both vendor & homegrown apps (ODP .NET based)
High Availability of data pertinent to advanced manufacturing processes of ever
p g Intel Product p
expanding pipeline
p is vital.
• Current blue print of DB HA/DR for Mission Critical Databases
• Oracle Fail Safe /Storage based replication and “Vanilla” Data Guard – (codename: HA 1.0)
• HA/DR
/ Challenges
g
• Current HA implementations are too complex to leverage for operational needs such as MTTR
and Maintenance Downtime reduction.
• Activation of redundant env is time consuming and requires decisions to be made during
failure incidents.
19
HA/DR Objectives
• Comprehensive High Availability & Disaster Recovery
– Holistic end to end HA/DR across all tiers of the stack
stack. i.e.,
i e Storage,
Storage Network,
Network
Server, Database and Application
• Simplified Manageability of the Stack
• Leverage HA/DR investment for Managed DT Reduction during
OS/Server/SAN and DB patches/migrations
• Cost Effective
• HA/DR Target
g Metrics
Recovery Point Objective (RPO) Zero Data Loss
Recovery Time Objective (RTO) for Failovers < 60 secs
20
HA 1.0 DB Architecture HA 2.0 DB Architecture
Standby – Data Center B Standby – Data Center B
Primary – Data Center A Primary – Data Center A
OBSERVER OBSERVER
Public network Public network
DB DB
DB
DB Instance Instance
Instance
Instance
Oracle Broker Enabled Grid
Data Oracle Grid Infra
Oracle Oracle Failsafe Failsafe Data Guard/SYNC
Guard/ Infra
Failsafe Failsafe
ASYNC Standby Primary
Primary Standby Primary Standby
Active Passive
Active Passive
Private network
Private network
Storage
Storage Storage network
Network Storage network
Network
Storage Cost 33% DB Storage cost reduction compared to HA 1.0 •Number of copies of db for a given application reduced from 3
Reduction to 2. (Eliminated storage mirroring for the db)
•Leveraged ASM for Dynamic Capacity Rightsizing
•Lackk off Clear
Cl visibility
bl into the
h ddata growth
h fforces the
h
databases NTFS LUNs to be pre allocated and oversized.
•Eliminates the issue of pockets of unused free space on
various NTFS LUNS.
Leverage DR •Reduction of Maintenance DT for SAN and DB Patching SAN and DB patching take 2+ hours in HA 1.0 compared to
investment for MDT 120 seconds in HA 2.0
20
Reduction
Rolling Upgrade for OS/Patching/Server Maintenance takes
120 seconds in HA 2.0
Enhanced Data Superior Data protection and Corruption prevention •Automatic validation of redo blocks before they are applied.
Protection capability p
•Fast failover to an uncorrupted standby
y database upon
p prod
p
db corruption
Simplified Operational Efficiency via reduced manageability •Switchovers and failovers performed via single broker CLI
Manageability overhead command compared to numerous steps that needs to be
executed otherwise.
•Automatic reinstatement of Standby (via Flashback
technology)
h l ) upon ffailover
il compared
d to 1 d
day off time
i to rebuild
b ild
standby
P bli network
Public t k
DB DB
Instance Instance
Grid Grid
Infra Infra
Broker Enabled Data Guard/SYNC
Primary Standby
Private network
Storage
Network Storage
Network
Data Guard (Broker Enabled) Fast -Start Failover Zero Data Loss Configuration (SYNC/Max Availability Mode/Real
Time Apply).
Cost effective HA/DR architecture with a single standby database running on Intel XEON™ servers.
24 High Availability of Observers driven by Enterprise Manager Grid Control.
HA 2.0 Stack Composition
Public Network
Service Service
Listeners Listeners
Database Instance A Database Instance B
with DataGuard Broker with DG Broker
ASM Instance
I t A ASM Instance B
Oracle Restart A Oracle Restart B
Operating System
Operating System
Node A N d B
Node
Private Network
DATA Disks
Redo LUN
i k
DATA Disks
HA 2.0
2 0 Stack Composition: Oracle 11gR211gR2, Windows 2008 R2R2, Oracle ASM,
ASM Oracle Restart,
Restart Oracle Data
Guard Broker – Fast Start Failover, Enterprise Manager Grid Control for monitoring and FSFO Observer High
Availability.
25
HA 2.0 Platform Reference Architecture
Stack Component Technology Description
Database Oracle 11g R2 64 bit (11.2.0.2 )
HA/DR Broker Managed Oracle Data Guard (Max Availability Mode) in a Fast Start Failover configuration
Oracle Restart (Part of Grid Infra Install)
Observer HA Via OEM GC; Custom script (OBMAN) to automatically move the Observer away from the PRY data center
upon role transition
Monito ing
Monitoring Oracle
O acle Enterprise
Ente p ise Manager
Manage - Grid
G id Control
Cont ol (10
(10.2.0.5
2 0 5 OMS;
OMS 10.2.0.4
10 2 0 4 OMR) and 11gR1 OMA or
o later
late
Separate Observer Home on GC Nodes – 11gR2 32bit Client
Backups RMAN integration with GC ; Centralized RCAT for all OLTP Apps running on GC Repository DB as a
separate schema; Basic Compressed Backup sets Wkly Full and Daily Incrementals; Backups to run on
PRY (with Block Change Tracking)
OS 64 bit Windows 2008 R2 Server
Storage/File System •EVA8400; App specific Data/Index, Backup, Redo, Archive, Control Æ ASM
•Two ASM disk groups (+DATA, +FRA) and a REDO LUN (RAID 1)
•2nd member of Redo/Arch/control file on a dedicated SAN Mirrored LUN for double failure coverage.
•Oracle and Grid Infrastructure (ASM) Binaries Æ NTFS
26
Lessons Learned
Intel created Homegrown custom scripts/Utilities to:
– Auto Relocate the FSFO Observer (where the current PSB’s Data Center is) upon role
transition
– Upload OMA dynamic properties so that Grid Control reflects real state of PRY/PSB
Next Steps:
• Automatic failover for ODP.NET
ODP NET client apps via FAN/FCF
• Optimized Connect Time failover
27
Conclusion
Auto-Failovers via “Data Guard Broker enabled Fast-Start Failover”
allowed
ll d
• Elimination of human element in role transition decisions during DB infra
failures
• Simplified Manageability to perform planned role transitions
• Reduction in number of multi-vendor integration touch points in the DB
/
HA/DR Stack
– Very thin database infrastructure footprint
• Zero-Data loss and Near-Zero (few seconds) downtime for Auto-Failovers
• A single cost-effective
cost effective vehicle for Database HA,
HA DR,
DR Data protection,
protection MDT
reduction with superior operational efficiency and provides low MTTR
28
Thanks to TMG Engg Team for the contributions.
Thank
h k You for
f attending
d the
h session.
Contact Info:
srinagesh.battula@intel.com
Program Agenda
• Overview
– What is Data Guard
– Disaster Recoveryy and Highg Availability
y
– Switchover / Failover, how fast?
– Alerting / Monitoring
• Intel Case Study
– HA / DR Objectives
– Overall Architecture
– Observed Opportunities / Benefits
– HA / DR Metrics
• Failover / Switchover Details
– Failover process flow
– Switchover process flow
– Demo
– Best Practices
Primary Connection
• JDBC: Subscribe to ONS
• Connected to OrderEntry Service
JDBC: ONS Daemon JDBC: ONS Daemon JDBC: ONS Daemon JDBC: ONS Daemon
2 node RAC
2-node Data Guard Redo Transport 2 node RAC
2-node
Primary Standby
Primary Connection 2
• JDBC: Timeout ensues
• No longer connected to OrderEntry Service
JDBC: ONS Daemon JDBC: ONS Daemon JDBC: ONS Daemon JDBC: ONS Daemon
2 node RAC
2-node 2 node RAC
2-node
Primary Standby
1
3
Failover Started by Data Guard Broker
AUSTIN Data Center HOUSTON Data Center
33 Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
Failover – New Primary
6
App Server
Farm Standby
New Connection
Connections Directed to New Primary
• • JDBC:
JDBC:Subscribe
SubscribetotoONS
ONS
• Connected to OrderEntry Service
• Old Primary Existing Connections
• JDBC: Timeout continues
• No longer connected to OrderEntry Service
2 node RAC
2-node 2 node RAC
2-node
Primary Primary
4 4
App Server
Farm
2 node RAC
2-node 2 node RAC
2-node
Primary Primary
2 node RAC
2-node 2 node RAC
2-node
Primary Primary
• You get client failover for switchover free if you have followed
the steps (well….sorta)
• Physical standby
– Clients disconnected as primary is converted to a standby
– Clients go through TAF retry logic (OCI) or application retry logic (JDBC)
– Clients connected to the standby disconnected as it is converted to primary
– Once both databases come up in new roles, services start and clients reconnect
• Logical
g standby
y
– Services are stopped automatically if Data Guard Broker switchover
– Manually disconnect connections to both primary and standby
– Perform switchover
– Once both databases come up in new roles,
roles services start and clients reconnect