Latch, mutex and beyond

July 11, 2010

Latch get and spin instrumentation. The unknown knowns. V2

Filed under: Instrumentation,Latch — andreynikolaev @ 12:11 am
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total     Waits Time (s)
Event                                               Waits    Time (s) Ela Time   per sec  per sec
-------------------------------------------- ------------ ----------- -------- --------- --------
enqueue                                         1,801,215   3,281,392    59.82     499.9   910.74
buffer busy waits                               1,984,703   1,235,865    22.53     550.8   343.01
latch free                                      6,425,043     847,386    15.45   1,783.2   235.19
SQL*Net break/reset to client                      50,394      35,937      .66      14.0     9.97
CPU time                                                       23,828      .43               6.61

This is the statspack report for 9.2.0.8 instance suffered from heavy latch contention.
Every time I saw such CPU bound Oracle instance with latch contention, I asked myself. Which part of this CPU power is currently burned for useless latch spin attempts? How many processes spin for the latch? How can we estimate this?

Unfortunately I still do not have contemporary answer yet. But in this post I would like to show that we had had such estimations before 11g.

We all do know that latch wait is instrumented well in Oracle wait interface. Oracle 11.2 has 32 specific latch wait events and one general ‘latch free’. But all these events are only for latch sleeps. Oracle Wait Interface don’t know anything about latch gets and spins.

It occurs that Oracle had instrumented the latch acquisition also, and even documented it. I do not know why it is not popular enough. This instrumentation resides in process array v$process. The fixed table behind v$process view is x$ksupr.

Of course, my post is about v$process.latchwait and v$process.latchspin. Corresponding x$ksupr columns are named ksllawat and ksllaspn. Four other interesting x$ksupr columns named ksllalaq, ksllawhy, ksllawer and ksllalow are not included in v$process view. This is why I had to use x$ksupr , not v$process in this post. Note, that all this cryptic column names begin with kslla. This is the name of latch related structure embedded into each x$ksupr row.

As I wrote in my previous post, latch acquisition functions, like kslgetl(laddress,wait,why,where) use “where” and “why” parameters to specify the reason for the latch get.

For the demonstration I choose the latch named “second spare latch”. This “spare” level 0 latch exists, but not used in Oracle 10.2.0.4.
I took this latch with “why” and “where” parameters equal to 4,5 using oradebug:


SQL> select addr,level#,gets from v$latch where name='second spare latch';
ADDR         LEVEL#       GETS
-------- ---------- ----------
50006434          0          0
SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x50006434 1 4 5
...

Then I made another request for this latch from the second SQLPlus connection with pid=22


SQL> oradebug call kslgetl 0x50006434 1 2 3

I will monitor this second latch acquisition in corresponding x$ksupr/v$process row.

At first the kslgetl function sets the value of x$ksupr.ksllalaq to the address of latch to acquire (laddress) and tries to get latch by “immediate get”:


SQL> select indx pid,ksllalaq,ksllawat,ksllalow,ksllawhy,ksllawer from x$ksupr
where indx=22;
       PID KSLLALAQ KSLLAWAT   KSLLALOW   KSLLAWHY   KSLLAWER
---------- -------- -------- ---------- ---------- ----------
        22 50006434 00                0          0          0

In 10g and below, ksllalaq equals to laddress during all the active phases of latch acquisition, both for immediate get and spin. Unfortunately 11g set ksllalaq during immediate latch get only. I hope this is a bug. My guess is that ksllalaq mean KSLLA Latch AcQuired.

Of course this is very short state. To see these latch get steps one needs to use either ultrafast sampling technique invented by Tanel Poder, or artifically slow down the latch acquisition using gdb.

If immediate get failed, Oracle process began to spin for the latch. In the past, Oracle filled ksllaspn (v$process.latchspin) during this spin. This is obsolete now. The last Oracle version, which set ksllaspn to laddress while spinning for the latch was 8i. I haven’t any idea why Oracle decided do not set up this instrumentation in contemporary versions. The ksllaspn always have value ‘00‘ now.

After insuccessfull immediate get and spin, Oracle process began to wait (sleep) for the latch:


       PID KSLLALAQ KSLLAWAT   KSLLALOW   KSLLAWHY   KSLLAWER
---------- -------- -------- ---------- ---------- ----------
        22 00       50006434          0          2          3

Process set its ksllawat (v$process.latchwait) column to laddress during the latch wait. And only during the latch wait, columns ksllawhy and ksllawer set to ”why” and ”where” parameters of kslgetl correspondingly.

This is also the time when Oracle Wait Interface (v$session_wait) exhibits the wait for the ‘latch free’ event.


SQL>select sid,event,p1raw,p2,p3,state  from v$session_wait where sid=153 ;

 SID EVENT                P1RAW            P2         P3 STATE
---- -------------------- -------- ---------- ---------- ----------
 153 latch free           50006434         46          0 WAITING

When latch finally acquired, all these columns become ‘00‘. But column ksllalow comes to action. This column is a bitmask which indicates levels of all the latches currently held by the process. I think that Oracle used this column to check for “increasing latch level” rule to avoid deadlocks. As a result, if process hold any latch, the ksllalow is not ‘00‘. For example, ksllalow has the value 1 (2^0) for latch level 0:


       PID KSLLALAQ KSLLAWAT   KSLLALOW   KSLLAWHY   KSLLAWER
---------- -------- -------- ---------- ---------- ----------
        22 00       00                1          0          0

Of course we can not see the complete story from x$ksupr only. Inside each slot of the process array Oracle maintains array of latches currently held by the process. This data externalized to SQL via x$ksuprlat/v$latchholder. When we query this fixed table, Oracle actually scan through process table (x$ksupr) and report the latches held. Tanel Poder used this for systematically troubleshooting latch contention. Also, when the latch acquired, Oracle save the values of “where” and “why” parameters into the latch structure. This can be seen in kslltwhr and kslltwhy columns of x$ksllt (or x$kslltr_parent and x$kslltr_children in 11g) fixed table.

When the process finally release the latch, all the described columns become ‘00‘.

This instrumentation gives us a lot of possibilities. The most exiting one is:

select count(*) from x$ksupr where ksllalaq <> hextoraw(’00’);

For 10g and below this query shows number of processes that actively acquire latch. This is also the number of CPUs that currently busy for latch acquisitions!

Since this number fluctuate quickly, we should sample this table and average over time. Such average provide the estimate for the part of “load average” consumed by latches. The following block samples x$ksupr ten times per second for 10 seconds and reports average:


SET SERVEROUTPUT ON
DECLARE
   i          BINARY_INTEGER;
   Samples    BINARY_INTEGER;
   SampleFreq      NUMBER;
   lavg   NUMBER;
BEGIN
   SampleFreq := 1 / 10;   -- Hz
   Samples := 100;
   lavg := 0;
   FOR i IN 1 .. Samples
   LOOP
	for lload in (SELECT COUNT(1) laq FROM x$ksupr
                   WHERE ksllalaq != HEXTORAW ('00'))
        LOOP
          lavg := lavg + lload.laq;
        END LOOP;
        DBMS_LOCK.sleep (SampleFreq);
   END LOOP;
   lavg := lavg / Samples;
   DBMS_OUTPUT.put_LINE ('Average lath acquisition load=' || lavg);
END;
/

If it is comparable with the number of processors, we definitely have problem. This number also gives us an estimation of CPU power that we can “tune” by “_SPIN_COUNT” parameter alone.

Unfortunately, in 11g we lost this possibility. I think this is a bug. Column ksllalaq now is not ‘00‘ only during immediate get. This is much less interesting for diagnostics and tuning of latch contention.

Other applications of x$ksupr latch get interface.
One can find all the processes waiting for the latches using x$ksupr alone:

select * from x$ksupr where ksllawat <> hextoraw(’00’);

and all the processes holding any latch

select * from x$ksupr where ksllalow <> hextoraw(’00’);

The number of processes waiting for each latch:

select count(*),ksllawat from x$ksupr

where ksllawat <> hextoraw(’00’)

group by ksllawat order by 1 desc;

But this may be also done from Oracle Wait Interface (v$session_wait) and has no advantages over it.

The v$process.latchwait column actually used in catblock.sql:

...
union all                                         /* procs waiting on latch */
  select sid, 'LATCH', 'None', 'Exclusive', latchwait,' '
     from v$session s, v$process p
    where latchwait is not null
     and  p.addr = s.paddr


I think that the idea to count the number of such processes first appeared in “Scaling Oracle8i “ by James Morle a decade ago:

“Highlights of V$PROCESS (apart from the server side SPID) are the LATCHWAIT and LATCHSPIN columns. These columns show the address of the latch on which the process is waiting or spinning, respectively, and can be a good source of information when aggregated as a count.”

In summary x$ksupr/v$process contains additional info about current state of latch acquisition:

x$ksupr.ksllalaq – latch address, that process currently acquired – during immediate get (and spin before 11g)

x$ksupr.ksllawat – latch address, that process currently waited on (v$process.latchwait)

x$ksupr.ksllalow – if not “00”, process currently holds the latch

x$ksupr.ksllawer – “where” parameter for the latch acquisition. To join to x$ksllwc

x$ksupr.ksllawhy – “why” parameter for the latch acquisition

I do not know why this documented instrumentation not well known enough.

Advertisement

3 Comments »

  1. […] out the trees of processes currently holding and waiting for the latches. The script uses latch get instrumentation from x$ksuprlat (v$latchholder) and x$ksupr (v$process). Two processes that hold the shared latch […]

    Pingback by Shared latch behaves like enqueue « Latch, mutex and beyond — November 17, 2010 @ 10:17 pm | Reply

  2. Andrey, check the link to Oracle documentation – seems to be broken

    Comment by Andrey Kriushin — February 16, 2011 @ 2:35 am | Reply

    • Thank you! Now link points to 11.2.0.2 documentation.

      Comment by andreynikolaev — February 16, 2011 @ 6:02 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: