Latch, mutex and beyond

October 25, 2011

Mutex waits. Part II. “Cursor: Pin S” in Oracle 11.2 _mutex_wait_scheme=0. Steps out of shadow.

Filed under: 11.2,Contention,DTrace,Instrumentation,Mutex,OS tuning,Patches — andreynikolaev @ 4:23 pm

I would like to describe how Oracle versions 11.2.0.1-11.2.0.2.1 waited for mutexes. This algorithm also appears to be used in post-11.2.0.2.2 PSUs and new 11.2.0.3 patchset as _mutex_wait_scheme=0.

My previous post demonstrated that before version 11.2:

  • “Cursor: pin S” was pure wait for CPU. Long “cursor: pin S” waits indicated CPU starvation.
  • Mutex contention was almost invisible to Oracle Wait Interface
  • Spin time to acquire mutex was accounted as CPU time. It was service time, not waiting time.

Things changed. Mutex waits in Oracle 11.2 significantly differ from previous versions. Contemporary mutex waits are not CPU aggressive anymore, completely visible to Oracle Wait Interface and highly tunable.

I. Oracle 11.2.0.1. Mutex spin visibility

In previous post, I introduced the testcase in which session waited for mutex during 49 sec. In Oracle 11.2.0.1 for Solaris SPARC on T2000 server, this testcase results in the following AWR report:

sqlplus /nolog @cursor_pin_s_waits.sql
...

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
DB CPU   55   99.84  
cursor: pin S 1 49 49261 89.05 Concurrency
db file scattered read 87 0 1 0.14 User I/O
db file sequential read 360 0 0 0.06 User I/O
control file sequential read 89 0 0 0.00 System I/O

Previously, in versions 10.2-11.1 Oracle Wait Interface registered millions “cursor: pin S” waits of microseconds duration. In 11.2.0.1 the testcase demonstrated single “wait” for 49 sec. However, truss reveals many yield() syscalls exactly like in previous versions:

truss -p 14287
...
/1:	yield()						= 0
/1:	yield()						= 0
/1:	yield()						= 0
/1:	yield()						= 0
...

Essentially, this is busy wait exactly as in 10.2. It is not surprising to find millions of “sleeps” in v$mutex_sleep_history section of AWR report:

Mutex Sleep Summary

Mutex Type Location Sleeps Wait Time (ms)
Cursor Pin kksfbc [KKSCHLCREA] 2,734,166 0


Like 10.2, the “wait time” appeared in this table supposedly is the summary time spent in CPU runqueue during yield() syscalls. It is zero in my testcase, I think due to a bug. Oracle 11.2.0.1 mutex is the same “classic” spinlock without sleeps as before.

On the other hand, Oracle 11.2 Wait Interface computes “wait time” as a duration between the first spin and successful mutex acquisition. As a result we observe only one wait in v$system_event.

Oracle 11.2 mutex wait includes both spinning and waiting. Therefore, it is “expected” to observe longer mutex waits after upgrade. You may see a lot of such bugs on MOS. Previously invisible, mutex contention steps out of the shadow in 11.2

Naturally, in 11.2 the significant part of “cursor: pin S” mutex wait time is pure spinning. This spinning also accounted as CPU time. However, the double accounting is not propagated into “DB time” statistics. As a result, in previous “Top 5 Events” table we observe two rows constituting 90% of “DB Time” each.

It is straightforward to compute an average time of “spin and yield” mutex cycle. We observed 2,734,166 cycles during 49.261 sec. One cycle took around 49/2,734,166 \approx 18 us. This is the large time. Another core of my 1.2GHz T2000 can execute 22 thousands instructions during this time and release the mutex.

In summary, the 11.2.0.1 the “Cursor: Pin S” waits became more visible, but still aggressive as were in previous versions. This also applies to “library cache: mutex X” events.

II. Oracle 11.2.0.2. No more excessive CPU consumption

The testcase in 11.2.0.2 on the same server demonstrated almost twice less CPU consumption and sleeps:

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
cursor: pin S 1 49 49259 91.71 Concurrency
DB CPU   36   67.21  
db file scattered read 1175 0 0 0.07 User I/O
db file sequential read 482 0 0 0.06 User I/O
control file sequential read 91 0 0 0.01 System I/O


Mutex Sleep Summary

Mutex Type Location Sleeps Wait Time (ms)
Cursor Pin kkslce [KKSCHLPIN2] 1,739,264 0

The mpstat utility confirms that the process waiting for “cursor: pin S” no more consumes the processor entirely. The truss output showed that in addition to yield() syscalls the waiting process sometimes issue pollsys() syscall with 1 ms timeout. This is sleeping system call.

DTrace script syscalls_in_mutex_wait.d traces system calls made by Oracle process inside the mutex wait. The script reveals interesting pattern. Process slept during 1 ms after 99 yields:

sql>@cursor_pin_s_waits.sql
...
kgxSharedExamine()
  yield() call repeated 99 times
  pollsys()   timeout=1 ms
  yield() call repeated 99 times
  pollsys()   timeout=1 ms
  yield() call repeated 99 times
  pollsys()   timeout=1 ms
...

It is these short waits that decrease CPU consumption. It is worth to note that for consistency with other platforms I configured millisecond timer resolution in my Solaris using hires_tick=1. See the discussion below.

Let us crosscheck the times. “Spin and yield” cycles consumed 1,739,264\times18us=31 sec. The remaining 49-31=18 seconds were spent in 1,739,264/100=17,392 sleeps. Average sleep duration was 18/17,392\approx 1.03ms as expected.

On Linux platform Oracle 11.2.0.2 uses sched_yield() and select() syscalls in mutex wait. Both calls counted as sleeps in v$mutex_sleep.

Therefore, the 11.2.0.2 the “cursor: pin S” wait event no more starve CPU. This reduces probability of CPU thrashing due to mutex waits.

III. Yield and sleep tuning parameters

It makes sense to dig deeper inside this waits pattern because it is used since 11.2.0.2.2 PSU as _mutex_wait_scheme=0. In fact, my further experiments used 11.2.0.2.2.

One millisecond sleep per hundred yields? These values should be tunable. Quick search found interesting family of parameters:

SELECT indx + 1 num,ksppinm NAME,ksppstvl VALUE,
       DECODE (BITAND (ksppiflg / 65536, 3),1, 'IMMEDIATE',2, 'DEFERRED',3, 'IMMEDIATE','FALSE') 
           ISSYS_MODIFIABLE,
       ksppdesc DESCRIPTION
  FROM x$ksppi x  join x$ksppcv using (indx ) 
 where ksppinm like '%yield%';
NUM NAME VALUE ISSYS_MOD DESCRIPTION
31 _wait_yield_mode yield IMMEDIATE Wait Yield – Mode
32 _wait_yield_hp_mode yield IMMEDIATE Wait Yield – High Priority Mode
33 _wait_yield_sleep_time_msecs 1 IMMEDIATE Wait Yield – Sleep Time (in milliseconds)
34 _wait_yield_sleep_freq 100 IMMEDIATE Wait Yield – Sleep Frequency
35 _wait_yield_yield_freq 20 IMMEDIATE Wait Yield – Yield Frequency

All these “_wait_yield…” parameters were introduced in 11.2.0.2.

The default value of _wait_yield_mode is “yield”. In this mode Oracle process first yields the CPU during mutex waits. The _wait_yield_sleep_freq value (minus one) is the number of yield() syscalls attempted before sleep. The _wait_yield_sleep_time_msecs defines the sleep duration. Let me experiment with them:


SQL>alter system set "_mutex_wait_scheme"=0;
SQL>alter system set "_wait_yield_sleep_time_msecs"=2  "_wait_yield_sleep_freq"=3;
...
kgxSharedExamine()
  yield() call repeated 2 times
  pollsys(...,timeout=2 ms,...)
  yield() call repeated 2 times
  pollsys(...,timeout=2 ms,...)
...


After such change, the process sleeps for mutex during 2 milliseconds on each 3d yield.

The complementary wait mode for scheme 0 mutexes is ‘sleep’. In this mode Oracle process first sleeps then yields. Look at the trace:


sql>alter system set "_wait_yield_mode"='sleep';
sql>@cursor_pin_s_waits.sql
...
kgxSharedExamine()
  pollsys()   timeout=1 ms    call repeated 19 times
  yield()
  pollsys()   timeout=1 ms    call repeated 19 times
  yield()
  pollsys()   timeout=1 ms    call repeated 19 times
  yield()
...


The number of sleeps between yields is determined by the _wait_yield_yield_freq parameter, which defaults to 20. Again, I will demonstrate the effect of parameters change:


sql>alter system set "_wait_yield_mode"='sleep';
sql>alter system set "_wait_yield_sleep_time_msecs"=3 "_wait_yield_yield_freq"=3;
sql>@cursor_pin_s_waits.sql
...
kgxSharedExamine()
  pollsys()   timeout=3 ms    call repeated 2 times
  yield()
  pollsys()   timeout=3 ms    call repeated 2 times
  yield()
  pollsys()   timeout=3 ms    call repeated 2 times
  yield()
...

These parameters provide great flexibility to tune mutex sleeps and yields. I like this mutex wait scheme, because it is more tunable than others. For example, I can easily simulate 10g mutex wait behavior:


sql>alter system set "_wait_yield_mode"='yield' "_wait_yield_sleep_time_msecs"=0;
sql>@cursor_pin_s_waits.sql
...
kgxSharedExamine()
 yield() call repeated 1021648 times
...

In addition, I can simulate the Patch 6904068 with _first_spare_parameter=1 without installing the patch itself:


sql>alter system set "_wait_yield_sleep_time_msecs"=10 "_wait_yield_sleep_freq"=0;
sql>@cursor_pin_s_waits.sql
...
kgxSharedExamine()
  pollsys()   timeout=10 ms    call repeated 4356 times
...

I will discuss in separate posts when and how this scheme should be used to achieve performance gains. Anticipating schemes comparison, I would like to emphasize that this is the fastest and most CPU consuming scheme under moderate mutex contention.

IV. Mutex waits in high priority processes

The “yield” mutex wait mode may cause starvation when Oracle processes run at different priorities. Higher priority processes yielding the CPU are not preempted by lower priority processes. We know that since 10g some of Oracle processes have Real-Time priority. By default only LMS and VKTM processes scheduled in RT class. These processes do not use mutexes. However, we may need to schedule other processes as real-time.

The _wait_yield_hp_mode parameter controls how high priority Oracle processes wait for mutexes. It allows to specify different wait modes for standard and high priority processes. Look how it works:

SQL>alter system set "_high_priority_processes"='LMS*|VKTM|SMON' scope=spfile;
SQL>startup force;
SQL>alter system set "_mutex_wait_scheme"=0;
sql>alter system set "_wait_yield_mode" ='yield' "_wait_yield_hp_mode" ='sleep';


Now SMON process runs at real-time priority class (RT):

$ ps -afec
     UID   PID  PPID  CLS PRI    STIME TTY         TIME CMD
...
  oracle  24070     1   RT 101 05:50:05 ?           0:14 ora_smon_E112
...

It is easy to find using DTrace that SMON in my instance repeatedly acquires mutex at address 0x42CD4C328.

$/usr/sbin/dtrace -ZCs syscalls_in_mutex_wait.d -64 -p 24070
...
kgxSharedExamine(0x10C1E7BE0,mutex=0x42CD4C328,aol=0x43DF48798)
...


SQL protected by this mutex is: “lock table sys.mon_mods$ in exclusive mode nowait“. I will artificially block this mutex using mutex_e_mode.sql script and wait for the next SQL execution:

$sqlplus /nolog @mutex_e_mode 42CD4C328
...
kgxSharedExamine(0x10C1E7BE0,mutex=0x42CD4C328,aol=0x43DF48798)
  pollsys()   timeout=1 ms    call repeated 19 times
  yield()
  pollsys()   timeout=1 ms    call repeated 19 times
  yield()
...


This confirmed the ‘sleep‘ mode for mutex waits in high priority process. At the same time other processes use ‘yield‘ mode.

IV. Platform sleeps granularity

With default Solaris OS parameters the CPU consumption of my testcase drops to 7 seconds. This is due to default 100Hz Solaris timer granularity. It is easy to check using dtruss that “millisecond” sleep actually took 10ms elapsed time. It looks exactly like setting “_wait_yield_sleep_time_msecs”=10 parameter.

Therefore it is not enough to request the 1 millisecond timeout in pollsys() or select(). Operating system itself must support milliseconds sleep granularity.

Not long ago, most Unixes used coarse-grained sleeps with 1 cs precision. For example, even realtime pre-2.6 Linux kernels could not wait with milliseconds sleeps accuracy. Nowadays, the latest RedHat and Oracle Enterprise Linux versions certified for Oracle 11.2 support milliseconds resolution. The same is true for HP-UX.

Solaris had built-in high-resolution waits functionality for twenty years, but still use centisecond granularity by default. To turn on the high-resolution 1000Hz timer one need to add “set hires_tick = 1” into /etc/system and reboot. Most Oracle instances on Solaris operate at 100Hz timer frequency.

Whether this granularity choice makes sense? It depends upon load and degree of mutex contention. The 1 ms sleeps are 10 times shorter then waits used by Patch 6904068 and default mutex wait scheme 2. If you have enough CPU power, short waits may be more beneficial because 10ms is huge time in spinlock world. Session waiting for 1 cs will skip 20 million instructions executed on another 2GHz CPU. Latch waits, for comparison, normally have less then millisecond duration.

Look at the following graph. It compares time to complete my “library cache mutex X” testcase for Solaris T2000 with 100Hz and 1000Hz timers.

Predictably, when the number of threads is small, both graphs are identical. In region of moderate contention, the high-granularity configuration performs upto 10% percents better. However, in high contention region the 100Hz timer definitely wins.

Complementary to the completion time is throughput. For my testcase it is the number of calls to dbms_application_info.set_client_info() per second. The graph below compares the throughputs achieved for both wait resolutions:

We see almost linear increase of througput until all 8 cores of T2000 became busy. Then the Chip MultiТhreading changed the slope. The linearity continues upto 2 CMT threads per core. After this point thrashing diminishes the throughput. Again, the 1000Hz wait resolution is more effective at moderate-, and 100Hz is more robust at the high-contention regions

This section demonstrates that adjustment of timer granularity provides additional tuning possibility for _mutex_wait_scheme=0.

Summary

Here I discussed how “cursor: pin S” mutex wait works in Oracle 11g Release 2 _mutex_wait_scheme=0.
Contemporary mutex waits are not CPU aggressive anymore and became visible to Oracle Wait Interface. I described related Oracle parameters and influence of OS sleep granularity. The next post will explore other mutex wait schemes.

About these ads

11 Comments »

  1. Very informative blog . Thank You.
    Best regards.
    GregG

    Comment by goryszewskig — October 26, 2011 @ 3:41 pm | Reply

  2. Hi,
    reading this entry third time and still not sure how do You know that:
    “SQL protected by this mutex is: “lock table sys.mon_mods$ in exclusive mode nowait“ .
    I mean how to find which mutex protect which sql.
    Regards
    GregG

    Comment by goryszewskigGregG — December 5, 2011 @ 2:45 pm | Reply

    • Actually I had got the mutex 0x42CD4C328 by mutex_e_mode.sql script first and then observed which SQL hangs.

      Unfortunately there is no fixed table to find “Cursor pin” mutex protecting SQL.

      You can find mutex address in x$mutex_sleep_history only after sleep for mutex occured.

      Comment by andreynikolaev — December 14, 2011 @ 12:52 pm | Reply

  3. Thanks, got that.
    Regards
    GregG

    Comment by Greg — December 14, 2011 @ 6:08 pm | Reply

  4. [...] explored the _mutex_wait_scheme 0 in detail previously. Here is just a brief summary. This [...]

    Pingback by Mutex waits. Part III. Contemporary Oracle wait schemes diversity. « Latch, mutex and beyond — July 30, 2012 @ 12:22 pm | Reply

  5. Can setting this parameter _mutex_wait_scheme and _mutex_wait_time cause SQLs having high version count? We ran into issues with SQLs starting having high version count after we applied some 11.2.0.2 patches and also setting these two parameters.

    Comment by Wenju Weng — October 3, 2012 @ 9:08 pm | Reply

    • The mutex wait scheme change can not directly induce high SQL version counts iself.
      However, the 11.2.0.2 is known to have bugs concerning high SQL version counts.
      This is interesting.

      1. What values of _mutex_wait… parameters did you set?
      2. Can you show example of SQL having high version count and result of script from MOS Note 438755.1?
      3. Please look on MOS Note:
        Patch 10187168 – Enhancement to obsolete parent cursors if VERSION_COUNT exceeds a threshold [ID 10187168.8]
        for possible workarounds.

      Best Regards
      Andrey Nikolaev

      Comment by andreynikolaev — October 4, 2012 @ 9:49 am | Reply

  6. Thanks for your response, Andrey.

    These are the parameter settings –

    NAME TYPE VALUE
    ———————————— ———– ——————————
    _mutex_wait_scheme integer 1
    _mutex_wait_time integer 10

    Our system is a SAP environment, these happen to be recommended values based on SAP note.
    The high version count and Cursor mutex type of waits happened after we applied SBP “11.2.0.2.3 – 201109″. It seems for each thread that runs the same query, one child number is spawned. The reason for cursor not sharing was very generic – Optimizer mismatch (12). But when I looked at v$sql_optimizer_env, the two child were using the same optimizer environments. At one point, when the high version count reaches certain threshold, the Cursor Mutex type of waits accumulated and caused our system freeze.
    It seems this high version count issue is associated with plan baselines or sql profiles. The workaround that I have found was to remove the SQL Profile and SQL Plan baselines associated with top version count/mutex wait query and then flush the query out of memory by running a grant statement on the underlying table to invalidate this query.
    However, this workaround is costing us because we rely on SQL Profiles or Plan baselines to enforce optimal execution plans. Optimizer is unable to derive the best plan for certain top queries on its own because the statistics can’t really capture the data dependencies or relationship. BTW, we turned off bind peek and adaptive cursor sharing based on SAP recommendation also.
    So I am desperate to find a way to enforce optimal execution plan without incurring high version count. I will try outline or sql patch methods next to see whether that helps.
    We opened a SR# with Oracle, but so far it looks like they wanted us to live with it.
    Other aggressive means to enforce optimal execution plans would be to turn on bind peek, reduce or consolidate similar indexes, gather extended column statistics (I tried this one, not much luck), or manipulate the statistics (tried this one also, a bit luck, but very nervous about doing so)
    since these top queries get executed million times a day, we can’t really allow child cursors to creep up like this without sharing – too much hard parses. So I doubt obsoleting parent cursors is an option for us. Also, that would cause SQL_ID to change right? We use SQL_ID for tuning and monitoring a lot, so a bit inconvenient also for performance management.

    Comment by Wenju Weng — October 4, 2012 @ 8:37 pm | Reply

    • Hello!
      This is even more interesting.

      1. What type of mutex events do you observe?
      Can you show the “Top 5″ and “Mutex Sleeps” sections from AWR.

      2. >the two child were using the same optimizer environments.
      How many cursor childs do you observe?
      Are the OPTIMIZER_ENV values in V$SQL identical for all childs?

      3. Run my mutex_waits.sql (or mutex_ash_waits.sql) script.
      This will show object that caused mutex contention.
      Is this object the SQL in problem or something else?

      4. What 11.2.0.2 PSU and other patches did you install?
      Your description looks similar to bug 12902872 having base bug 11719151.
      Was the the patch 11719151 installed already?

      5. Concerning the patch “10187168 – Enhancement to obsolete parent cursors”
      > we can’t really allow child cursors to creep up like this without sharing – too much hard parses.

      This depends on number of child cursors. If you have thousands versions, the hard parses may be preferreable.

      > Also, that would cause SQL_ID to change right?
      No, obsolescence of child cursors will not change SQL_ID.

      Best Regards
      Andrey Nikolaev

      Comment by andreynikolaev — October 5, 2012 @ 7:45 pm | Reply

  7. Hi, Andrey, To answer your questions,
    1. Normally it’s these three types of mutex wait –
    ‘library cache: mutex X’,’cursor: pin S wait on X’,’cursor: pin S’
    2. The child version count can go up to more than 1000 versions for certain queries. The Mutex wait events are only triggered when version count reaches certain threshold. But the threshold is not consistent with all queries we had problem with – it depends on the query itself and also how many multiple sessions are running the query at the time.
    3. We don’t have mutex wait as of right now. I set up a job to monitor the mutex wait events using v$active_session_history, and if it reaches certain thresholds, then the job would drop the plan baselines or sql profiles associated with the query. The Mutex wait events are always associated with Queries with sql plan baselines or SQL Profiles. And our workaround was to drop them and then flush the objects out of memory to allow reloads.
    4. We applied SBP “11.2.0.2.3 – 201109″ and started to see the issue in July. Then we applied the bundle patch for 11.2.0.2.6, but it didn’t help with the issue. Then Oracle support devised a backported patch for us (patch 14641531), it didn’t resolve the high version count issue.
    The bug 12902872 descriptions do sound like the problem we had. But in my opinion, the problem is not within the parameter optimizer_use_sql_plan_baselines setting itself, but with the sql plan baselines for certain type of queries. Because for most of queries that are using plan baselines in our environments , we don’t have the high version count or mutex issue. Only for a handful queries (20+ queries among over 10K different queries, (We have been using plan baselines extensively since our upgrade to 11.2.0.2 from 10g in 2011, the base release was working fine for us until we applied the SBP 11.2.0.2.3 in July this year)). In addition, not only we can not use plan baselines for these queries, we can’t use SQL Profiles either to enforce the optimal plan usage for these queries.

    I have a good news to share with you though. Since the Oracle support was unable to provide a good solution to us (I was told to live with it), and we really do need plan enforcements for these top queries, I experienced with the other plan enforcement feature in 11g – SQL Patch and it seems that it’s working like a charm. I was able to enforce the correct execution plans for these queries and also there’s only one version associated with each of these queries. My next step is to completely breakaway from sql plan baselines and sql profiles by converting them into SQL Patches. We have about 200 queries that need plan enforcements – I will be writing a script to convert them to use SQL Patches.
    If you are interested in the details of our problems, please refer to SR 3-6200142891 – Not sure whether you are able to view it. So far I had not told Oracle support that we have found an alternative.

    Thanks for your help and interest in this – Wenju

    Comment by Wenju Weng — October 8, 2012 @ 7:09 pm | Reply

    • Hi Wenju,

      We are facing similar problem in our environment due to mutex X waits. would you able to share the logic/code of your job “to monitor the mutex wait events using v$active_session_history, and if it reaches certain thresholds, then the job would drop the plan baselines or sql profiles associated”

      Thanks,
      Nilesh

      Comment by Nilesh — November 21, 2012 @ 5:21 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 65 other followers

%d bloggers like this: