Each Case Study contains a skill level rating. The rating provides an indication of what
skill level the reader should have as it relates to the information in the case study.
Ratings are:
Case History
A customer's Applications environment slowed down every afternoon for several
consecutive workdays. On most days, the slowdown would only last 10 or 20 minutes
and then return to normal. However on a few days, the performance would degrade and
remain very unacceptable or continually worsen until they were forced to shutdown and
restart (“bounce”) the database during business hours. Then, good performance would
return until the following afternoon. The Application had been running fine for the
months prior to these events. The customer stated that nothing had changed recently in
the environment to trigger this behavior – no patches were applied; no hardware was
added or removed.
Analysis
Fortunately, the customer already had statspack configured to capture performance
snapshots every 30 minutes, so it was time to glance at some Statspack reports. Ideally,
one would look for significant differences between a normal processing day with
acceptable performance and a bad day. In this case, we also had the luxury of analyzing
the periods immediately before during and after the performance issue.
On acceptable days before the problems appeared, the top “Timed Events” were “CPU
time” and some IO related events. In the first Statspack period including the problem
window, "latch free" jumped to the top. On days where the performance corrected itself,
CPU returned to the top and everything generally reverted to stats seen in the “BEFORE”
problem reports. However, on the days where the problem did not correct itself, IO
related events appeared on top followed by "latch free" and "buffer busy waits" while
CPU was not getting used as much.
So, even the summary info at the beginning of each statspack report was telling a
different story. The system was displaying three distinct performance profiles, which
could humorously be labeled “The Good”, “The Bad”, and “The Ugly”.
1
Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.
Page 2
The “Ugly”, Statspack Report 2 :
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file scattered read 3,111,783 118,912 43.42
db file sequential read 1,408,059 43,565 15.91
latch free 4,281,865 30,869 11.27
buffer busy waits 1,146,414 23,682 8.65
CPU time 20,754 7.58
In “The Good” a more detailed survey of the statspack report showed little CPU used for
parsing or recursive calls. This meant that most of the CPU time was likely getting used
to process SQL and the few IO waits meant that most of the popular data was residing
happily in cache.
2
Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.
Page 3
In “The Bad” the latch free waits, lead one to study the latch section of statspack in more
detail. In that section, it appeared most of the sleeps revolved around library cache. The
statistics section also showed a higher “parse count (hard)” compared to “The Good” and
the library cache section added more corroboration by reporting a large number of
reloads and invalidations in the SQL AREA (with reloads almost equal to invalidations).
Latch Sleep breakdown for DB: PROD Instance: prod1 Snaps: 3340 -3341
-> ordered by misses desc
Library Cache Activity for DB: PROD Instance: prod1 Snaps: 3340 -3341
->"Pct Misses" should be very low
3
Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.
Page 4
Finally, “The Ugly” report showed much higher IO activity than the other two profiles.
The hard parsing activity had disappeared. More telling, a regularly executed SQL
statement that appeared in “Top SQL sorted by most Reads” was nowhere to be found in
the “Top Reads” in other reports. It was listed in the “Top SQL sorted by Executions” in
all reports.
CPU Elapsd
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
490,078 58 8,449.6 14.0 367.73 664.48 3381540416
SELECT col1, col2, col3 FROM large_table
. . .
SQL ordered by Executions for DB: PROD Instance: prod1 Snaps: 3440 -3441
-> End Executions Threshold: 10
. . .
Unfortunately, there were a couple of holes with that theory that would at least rule out
the first solution. Chiefly, the customer was insistent that no DBA or scheduled job was
gathering new statistics against the objects. Secondly, even if this led to bad execution
plans, why would performance be restored after the database was shutdown and
4
Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.
Page 5
restarted? [SQL trace output later confirmed that the execution plans would change after
the SQL was invalidated, but good execution plans were restored after a database
instance bounce.]
Some activity was definitely invalidating SQL. So if it was not DBMS_STATS, then
what? If the cursors were just aging out under shared pool space pressure the report
would just show reloads without invalidations. No patches were being applied; no
columns were being added, etc. No space maintenance was going on, so no indexes were
rebuilt or tables moved. It turns out just granting or revoking privileges on objects to
users is sufficient to invalidate dependent SQL as well. At that point, the customer did
admit they had been giving database access to new employees over the last few weeks.
They did not consider that a change in operational issues could have this impact.
But what could cause the executions plan to change once in a while when a particular
SQL statement was hard parsed again? Dynamic sampling was not enabled. This left a
more obscure feature called bind peeking.
If optimizer_features_enable = 9.0.0 or higher, then CBO will calculate some costs for
inequality predicates (or equality predicates against columns with histograms) based on
the bind values supplied by the first person who (re)loads SQL into the shared pool. The
SQL in question happened to fit the criteria where the bind value supplied could
significantly impact what CBO thought was the best execution plan. A simple example
appears at the end of this study.
With all the observations and evidence now falling into place, the consensus was that the
safest solution was to limit new user additions to off-hours maintenance windows. Also
they created an OUTLINE for the problem SQL. Adding a HINT was not possible given
it was a third party application. Changing optimizer_features_enabled to disable bind
peeking may have had negative impacts on other SQL.
References
The scripts below illustrate how invalidation and bind peeking work. They were tested
against a recently installed vanilla 10gR2 seed database. Ideally you need to manually go
back and forth between two SQL*Plus sessions sitting side-by-side to best view what is
happening.
Session 1:
REM T1: Set up test case
var x number;
var y number;
Page 6
tabname=>'BIGTAB', CASCADE => TRUE, -
method_opt => 'FOR ALL COLUMNS SIZE 1');
REM Goto session 2 again - exec plan should still be the same
REM now session 2 shows a new plan with FAST FULL SCAN and it will
REM be used from now on no matter what the bind values are.
REM now v$sql_area will show version count of 2 (cause some other
REM session may still have bind peeking enabled)
Session 2:
column operation format a20
column options format a20
column object_name format a20
SQL_ID
Page 7
-------------
SQL_TEXT
-----------------------------------------------------------------
2aa40mj45939v
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x AND :y
no rows selected
no rows selected
Page 8
---------- ------------- ---------- -------------
2 1 1 1
Page 9