Anda di halaman 1dari 9

Case Study: The Mysterious Performance Drop

Author: Roderick Manalac, Consulting Technical Advisor, Oracle USA

Skill Level Rating for this Case Study: Expert

About Oracle Case Studies


Oracle Case Studies are intended as learning tools and for sharing information or
knowledge related to a complex event, process, procedure, or to a series of related
events. Each case study is written based upon the experience that the writer/s
encountered.

Each Case Study contains a skill level rating. The rating provides an indication of what
skill level the reader should have as it relates to the information in the case study.
Ratings are:

• Expert: significant experience with the subject matter


• Intermediate: some experience with the subject matter
• Beginner: little experience with the subject matter

Case Study Abstract


Sometimes the simplest or seemingly innocent actions can have significant ramifications
on the performance of a very busy system. Diagnosing these types of problems
sometimes requires some understanding of obscure Oracle behaviors. This article will
describe how two minor features combined can cause an interesting performance issue,
and how the issue was diagnosed and resolved.

Case History
A customer's Applications environment slowed down every afternoon for several
consecutive workdays. On most days, the slowdown would only last 10 or 20 minutes
and then return to normal. However on a few days, the performance would degrade and
remain very unacceptable or continually worsen until they were forced to shutdown and
restart (“bounce”) the database during business hours. Then, good performance would
return until the following afternoon. The Application had been running fine for the
months prior to these events. The customer stated that nothing had changed recently in
the environment to trigger this behavior – no patches were applied; no hardware was
added or removed.
Analysis
Fortunately, the customer already had statspack configured to capture performance
snapshots every 30 minutes, so it was time to glance at some Statspack reports. Ideally,
one would look for significant differences between a normal processing day with
acceptable performance and a bad day. In this case, we also had the luxury of analyzing
the periods immediately before during and after the performance issue.

On acceptable days before the problems appeared, the top “Timed Events” were “CPU
time” and some IO related events. In the first Statspack period including the problem
window, "latch free" jumped to the top. On days where the performance corrected itself,
CPU returned to the top and everything generally reverted to stats seen in the “BEFORE”
problem reports. However, on the days where the problem did not correct itself, IO
related events appeared on top followed by "latch free" and "buffer busy waits" while
CPU was not getting used as much.

So, even the summary info at the beginning of each statspack report was telling a
different story. The system was displaying three distinct performance profiles, which
could humorously be labeled “The Good”, “The Bad”, and “The Ugly”.

The “Good”, Statspack Report 1 :


Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time 38,323 28.81
db file scattered read 4,115,952 28,707 21.58
db file sequential read 2,169,347 18,995 14.28
buffer busy waits 1,722,685 15,287 11.49
log file sync 208,260 12,209 9.18

The “Bad”, Statspack Report1:


Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free 413,223 1,667 35.14
db file sequential read 241,965 1,218 25.68
db file scattered read 485,092 601 12.68
buffer busy waits 38,232 259 5.45
CPU time 205 4.32

1
Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.

Page 2
The “Ugly”, Statspack Report 2 :
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file scattered read 3,111,783 118,912 43.42
db file sequential read 1,408,059 43,565 15.91
latch free 4,281,865 30,869 11.27
buffer busy waits 1,146,414 23,682 8.65
CPU time 20,754 7.58

In “The Good” a more detailed survey of the statspack report showed little CPU used for
parsing or recursive calls. This meant that most of the CPU time was likely getting used
to process SQL and the few IO waits meant that most of the popular data was residing
happily in cache.

The “Good”, Statspack Report, Statistics2:


. . .
Instance Activity Stats for DB: PROD Instance: prod1 Snaps: 3140 -3141
Statistic Total per Second per Trans
--------------------------------- ------------------ -------------- ------------
CPU time 3,832,300 11,207 7.58
. . .
parse count (hard) 21 0.0 0.0
parse time cpu 975 0.3 0.5
. . .
recursive cpu usage 312 0.1 0.1
. . .

2
Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.

Page 3
In “The Bad” the latch free waits, lead one to study the latch section of statspack in more
detail. In that section, it appeared most of the sleeps revolved around library cache. The
statistics section also showed a higher “parse count (hard)” compared to “The Good” and
the library cache section added more corroboration by reporting a large number of
reloads and invalidations in the SQL AREA (with reloads almost equal to invalidations).

The “Bad”, Statspack Report, Latch and Library Cache Statistics 3 :


. . .
Instance Activity Stats for DB: PROD Instance: prod1 Snaps: 3340 -3341

Statistic Total per Second per Trans


--------------------------------- ------------------ -------------- ------------
. . .
parse count (failures) 83 0.0 0.0
parse count (hard) 1,521 0.4 0.8
parse count (total) 10,780 8.3 14.9
. . .

Latch Sleep breakdown for DB: PROD Instance: prod1 Snaps: 3340 -3341
-> ordered by misses desc

Get Spin &


Latch Name Requests Misses Sleeps Sleeps 1->4
-------------------------- -------------- ----------- ----------- ------------
library cache 143,525,937 1,344,491 218,264 1161117/1551
99/22432/574
3/0
shared pool 56,948,537 446,574 105,545 353300/81370
/11553/351/0
. . .

Library Cache Activity for DB: PROD Instance: prod1 Snaps: 3340 -3341
->"Pct Misses" should be very low

Get Pct Pin Pct Invali-


Namespace Requests Miss Requests Miss Reloads dations
--------------- ------------ ------ -------------- ------ ---------- --------
BODY 6,679 0.0 6,679 0.0 0 0
CLUSTER 223 0.4 284 0.7 0 0
INDEX 781 11.8 710 13.0 0 0
SQL AREA 2,490,099 3.5 37,707,666 0.6 28,618 341
TABLE/PROCEDURE 3,015,349 0.2 940,923 7.2 23,303 0
TRIGGER 12,400 0.0 12,400 0.0 0 0
. . .

3
Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.

Page 4
Finally, “The Ugly” report showed much higher IO activity than the other two profiles.
The hard parsing activity had disappeared. More telling, a regularly executed SQL
statement that appeared in “Top SQL sorted by most Reads” was nowhere to be found in
the “Top Reads” in other reports. It was listed in the “Top SQL sorted by Executions” in
all reports.

The “Ugly”, Statspack Report, Top SQL 4 :


. . .
SQL ordered by Reads for DB: : PROD Instance: prod1 Snaps: 3440 -3441
-> End Disk Reads Threshold: 1000

CPU Elapsd
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
490,078 58 8,449.6 14.0 367.73 664.48 3381540416
SELECT col1, col2, col3 FROM large_table

. . .

SQL ordered by Executions for DB: PROD Instance: prod1 Snaps: 3440 -3441
-> End Executions Threshold: 10

CPU per Elap per


Executions Rows Processed Rows per Exec Exec (s) Exec (s) Hash Value
------------ --------------- ---------------- ----------- ---------- ----------
58 390,304 6729.4 0.00 0.00 1208562063
SELECT col1, col2, col3 FROM large_table

. . .

Conclusion and Learnings


At this point, it would be easy to jump to the conclusion that some user or job had
executed DBMS_STATS (or ANALYZE) against some key tables. This would invalidate
many SQL statements referencing those tables, so that they could be reparsed with the
new cost-based optimizer (CBO) statistics (explaining the high latch contention reported
in the “Bad” profile). Then on certain days, these statistics led CBO to choose poor
execution plans (causing the “Ugly” profile). Thus, one potential solution would be to
stop gathering table statistics in the middle of the day. Another possible fix would be to
preserve the good execution plan for the rogue SQL statement using an OUTLINE.

Unfortunately, there were a couple of holes with that theory that would at least rule out
the first solution. Chiefly, the customer was insistent that no DBA or scheduled job was
gathering new statistics against the objects. Secondly, even if this led to bad execution
plans, why would performance be restored after the database was shutdown and

4
Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.

Page 5
restarted? [SQL trace output later confirmed that the execution plans would change after
the SQL was invalidated, but good execution plans were restored after a database
instance bounce.]

Some activity was definitely invalidating SQL. So if it was not DBMS_STATS, then
what? If the cursors were just aging out under shared pool space pressure the report
would just show reloads without invalidations. No patches were being applied; no
columns were being added, etc. No space maintenance was going on, so no indexes were
rebuilt or tables moved. It turns out just granting or revoking privileges on objects to
users is sufficient to invalidate dependent SQL as well. At that point, the customer did
admit they had been giving database access to new employees over the last few weeks.
They did not consider that a change in operational issues could have this impact.

But what could cause the executions plan to change once in a while when a particular
SQL statement was hard parsed again? Dynamic sampling was not enabled. This left a
more obscure feature called bind peeking.

If optimizer_features_enable = 9.0.0 or higher, then CBO will calculate some costs for
inequality predicates (or equality predicates against columns with histograms) based on
the bind values supplied by the first person who (re)loads SQL into the shared pool. The
SQL in question happened to fit the criteria where the bind value supplied could
significantly impact what CBO thought was the best execution plan. A simple example
appears at the end of this study.

With all the observations and evidence now falling into place, the consensus was that the
safest solution was to limit new user additions to off-hours maintenance windows. Also
they created an OUTLINE for the problem SQL. Adding a HINT was not possible given
it was a third party application. Changing optimizer_features_enabled to disable bind
peeking may have had negative impacts on other SQL.

References
The scripts below illustrate how invalidation and bind peeking work. They were tested
against a recently installed vanilla 10gR2 seed database. Ideally you need to manually go
back and forth between two SQL*Plus sessions sitting side-by-side to best view what is
happening.

Session 1:
REM T1: Set up test case
var x number;
var y number;

create table bigtab as select * from all_objects;


create index bt_ix on bigtab (object_id);

execute dbms_stats.gather_table_stats (ownname=>'SCOTT', -

Page 6
tabname=>'BIGTAB', CASCADE => TRUE, -
method_opt => 'FOR ALL COLUMNS SIZE 1');

REM Start with narrow range of values


begin :x := 1000; :y := 1001; end;
/

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM Goto session 2 - exec plan should be index range scan


REM T3: now lets re-execute with a wide range

begin :x := 0; :y := 50000; end;


/

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM Goto session 2 again - exec plan should still be the same

REM T5: So lets invalidate this puppy

GRANT SELECT ON BIGTAB TO ORDSYS;

REM now query against v$sql_plan and v$sql in session 2 shows


REM "no rows selected"
REM T7: so now lets reload the cursor

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM now session 2 shows a new plan with FAST FULL SCAN and it will
REM be used from now on no matter what the bind values are.

REM T9: so let's turn off bind peeking off

alter session set "_optim_peek_user_binds" = false;

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM now v$sql_area will show version count of 2 (cause some other
REM session may still have bind peeking enabled)

Session 2:
column operation format a20
column options format a20
column object_name format a20

REM T2: After initial load with narrow range binds

select sql_id, sql_text from v$sql where sql_text like


'SELECT COUNT(*) FROM BIGTAB%';

SQL_ID

Page 7
-------------
SQL_TEXT
-----------------------------------------------------------------
2aa40mj45939v
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x AND :y

select operation, options, object_name from v$sql_plan


where sql_id = '2aa40mj45939v';

OPERATION OPTIONS OBJECT_NAME


-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX RANGE SCAN BT_IX

REM T4: Plan for second SQL still the same


select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';

OPERATION OPTIONS OBJECT_NAME


-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX RANGE SCAN BT_IX

REM T6: after grant is issued

select operation, options, object_name from v$sql_plan


where sql_id = '2aa40mj45939v';

no rows selected

select sql_id from v$sql where sql_id = '2aa40mj45939v';

no rows selected

REM T8: Now cursor is reloaded

select operation, options, object_name from v$sql_plan


where sql_id = '2aa40mj45939v';

OPERATION OPTIONS OBJECT_NAME


-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX FAST FULL SCAN BT_IX

select loads, invalidations, executions, version_count


from v$sqlarea
where sql_id = '2aa40mj45939v';

LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT

Page 8
---------- ------------- ---------- -------------
2 1 1 1

REM T10: After session altered "_optim_peek_user_binds" = FALSE


REM and query re-executed.

select loads, invalidations, executions, version_count


from v$sqlarea
where sql_id = '2aa40mj45939v';

LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT


---------- ------------- ---------- -------------
3 1 2 2

select operation, options, object_name from v$sql_plan


where sql_id = '2aa40mj45939v' order by plan_hash_value;

OPERATION OPTIONS OBJECT_NAME


-------------------- -------------------- --------------------
FILTER
SORT AGGREGATE
SELECT STATEMENT
INDEX FAST FULL SCAN BT_IX
FILTER
SORT AGGREGATE
SELECT STATEMENT
INDEX RANGE SCAN BT_IX

Page 9

Anda mungkin juga menyukai