Solving Library Cache Lock


                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
library cache lock                        4         868   2.2E+05   42.7 Concurrency
DB CPU                                              165           8.1
db file sequential read              20,734         139      7    6.9 User I/O
latch free                           22,348           2      0     .1 Other
log file sync                           595           1      1     .0 Commit

The one that stands out is obviously the “library cache lock”: it’s a concurrency-class event so it should always raise suspicion. Wait time for this event is 868 s , i.e. it corresponds to 1.8 sessions waiting on it on the average, which sounded about right (the subsystem would have about a dozen sessions running out of which 2-5 would typically be active at the same time, so this event would have had a massive effect on its performance).
The next step would to look in the V$ACTIVE_SESSION_HISTORY to confirm that waits on “library cache lock” were indeed associated with the subsystem in question — and yes, indeed it turned out to be the case. Now I had to identify the source of this contention, which was a bit tricky, because blocking_session_id field of V$ACTIVE_SESSION_HISTORY was blank. There are ways to find the object holding the lock using X$ tables, but I didn’t have SYS access so instead I decided to check one common scenario for library cache locking: stats gathering. And indeed, there was stats gathering job in progress, and it was stuck waiting on a library cache lock held by a runaway session (KILLED status, but still not cleaned up by PMON for whatever reason), while holding a library cache lock itself, thus making yet more sessions wait. Killing the stats gathering job resolved the issue.
I’d like to focus on several points here.
1. Identifying the problem. There is a general rule, that states that you obtain the maximum return on your tuning effort by addressing events with largest % DB time ((see great article by Milsap explaining it’s application to database performance tuning here). But it’s important to understand that this law doesn’t apply to database performance incidents of a local scale (i.e. when a significant part of processes on the database is experiencing a significant performance degradation, but the database as a whole is still running more or less ok). So we have to focus not on biggest events, but on any events that are out of ordinary. However, we have to only look at events whose contribution to the overall wait time is comparable to the size of the problem. That’s why it’s convenient to express performance impact of events in average sessions.
2. In a situation like this, people are often tempted to use the traditional top-down approach, i.e. they turn to top resource consuming SQL and pronounce it a root cause. Many innocent long-running queries were killed for that. Well, that’s wrong: top resource consumers only matter when the affected SQL is experiencing deficit of this specific resource. E.g. it needs CPU time, but the all CPUs are taken by another heavy overparallelized query — yes, then it’s justified. But more often than not, this approach won’t work. Instead, you have to go where the symptoms take you, and that’s where ASH steps in.
Diagrammatically, I would picture this as a AWR -> ASH approach: use AWR to identify the suspect(s), use ASH to confirm your suspicions.
3. In this case (as in many other cases) ASH is unable to provide the necessary information about the culprit directly. When this happens, sometimes you can use wait event parameters (P2 column of ASH) to decipher the sid of the session holding the resource, like with cursor: pin S waits in 10g. Sometimes you can use X$ tables to obtain this information. But you can always use good judgement to make a reasonable guess and confirm it more quickly, than any accurate analysis could, and that’s very important when resolving ongoing production issues. In this particular case, once I suspected gather stats job, it took me just a few seconds to confirm that it was indeed responsible.
-- Thanks 

Comments