In previous posts, I investigated how the Oracle process spins and waits for the latch. Now we need the tool to estimate when the latch acquisition works efficiently and when we need to tune it. This tool is the latch statistics. Contemporary Oracle documentation describes v$latch statistics columns as:
Statistic: | x$ksllt column | Documentation description: | When and how it changed: |
GETS | kslltwgt “wait gets” |
Number of times the latch was requested in willing-to-wait mode | Incremented by one after latch acquisition. Therefore protected by latch |
MISSES | kslltwff “wait fails” |
Number of times the latch was requested in willing-to-wait mode and the requestor had to wait | Incremented by one after latch acquisition if miss occured |
SLEEPS | kslltwsl “wait sleeps” |
Number of times a willing-to-wait latch request resulted in a session sleeping while waiting for the latch | Incremented by number of times process slept during latch acquisition |
SPIN_GETS | ksllthst0 | Willing-to-wait latch requests which missed the first try but succeeded while spinning | Incremented by one after latch acquisition if miss but not sleep occured. Counts only the first spin |
WAIT_TIME | kslltwtt “wait time” |
Elapsed time spent waiting for the latch (in microseconds) | Incremented by wait time spent during latch acquisition. |
IMMEDIATE_GETS | kslltngt “nowait gets” |
Number of times a latch was requested in no-wait mode | Incremented by one after each no-wait latch get. May not be protected by latch |
IMMEDIATE_MISSES | kslltnfa “nowait fails” |
Number of times a no-wait latch request did not succeed | Incremented by one after unsuccessful no-wait latch get. Not protected by latch |
Since Oracle 10.2 all the following columns described as “have been deprecated … ;they will always have a value of zero”:
- SLEEP1-SLEEP10
- WAITERS_WOKEN
- WAITS_HOLDING_LATCH
- MISSES_WAITLIST: Number of latch misses on the associated wait list latch
- SLEEPS_WAITLIST: Number of sleeps while trying to acquire the associated wait list latch
- WAIT_TIME_WAITLIST: Wait time for the associated wait list latch
Several other never documented latch statistics have been deprecated also in Oracle 10.2. We have lost important additional information about latch performance. Here I will discuss the remaining statistics set.
As I demonstrated in previous post, since 9.2 Oracle uses completely new latch acquisition algorithm:
Immediate latch get Spin latch get Add the process to the queue of latch waiters Sleep until posted
We need to interpret the latch statistics from this new algorithm point of view.
GETS, MISSES, etc. are the integral statistics counted from the start of the instance. These values depend on complete workload history. AWR and Statspack report changes of integral statistics per snapshot interval. Usually these values are “averaged by hour”, which is much longer then typical latch performance spike.
Another problem with AWR/Statspack report is averaging over child latches. By default AWR gathers only summary data from v$latch. This greatly distorts latch efficiency coefficients.
The latch statistics should not be averaged over child latches.
Parent and child latch statistics are stored in AWR repository only when “statistics_level”=’ALL’ or the snapshots were manually created using dbms_workload_repository.create_snapshot(‘ALL’). The Statspack gathers parent and child latch statistics separately only at 10 level. To avoid averaging distortions the following analysis uses the latch statistics from v$latch_parent and v$latch_children (or x$ksllt in Oracle version less then 11g)
The current workload is characterized by differential latch statistics and ratios. There exist several ways to choose the basic set of statistics. I will choose the most close to Statspack way and define key statistics as:
Description: | Definition: | AWR equivalent: | Dimensions: |
Latch requests arrival rate | |
|
|
Latch gets efficiency | |
– | |
Latch sleeps ratio | |
|
– |
Latch wait time per second | |
||
Latch spin efficiency | |
– |
At this post, I will consider only wait latch gets. The no-wait (“IMMEDIATE”) gets add some complexity only for several latches. I will also assume to be small enough that workload do not change significantly.
Other statistics reported by AWR depend on these key statistics:
- Latch miss rate is
. This is “Misses” from AWR divided by snapshot interval
- Latch waits (sleeps) rate is
. This is per second value of “Sleeps” from AWR
From the queuing theory point of view, the latch is G/G/1/(SIRO+FIFO) system with interesting queue discipline including Serve In Random Order spin and First In First Out sleep wait. Using the latch statistics, I can roughly estimate queuing characteristics of latch. I expect that the accuracy of such estimations is about 20-30%.
As a first approximation, I will assume that incoming latch requests stream is Poisson distributed and latch holding (service) times are exponentially distributed. Therefore, our first latch model will be M/M/1/(SIRO+FIFO).
Oracle measures more statistics then usually consumed by classic queuing models. It is interesting what these additional statistics can be used for.
Average service time:
The PASTA (Poisson Arrivals See Time Averages) property connects ratio with the latch utilization. For Poisson streams the latch gets efficiency should be equal to utilization:
However, this is not exact for server with finite number of processors. The Oracle process occupies the CPU while acquiring the latch. As a result, the latch get see the utilization induced by other processors only. Compare this with MVA arrival theorem. In some benchmarks there may be only
Oracle shadow processes that generates the latch load. In such case we should substitute
instead
in the following estimate:
Here I introduced the the multiplier to correct naive utilization estimation. Clearly, the
multiplier confirms that the entire approach is inapplicable to single CPU machine. Usually
significantly differs from one only for demonstrations on my Dual-Core notebook. For servers its impact is below precision of my estimates. For example for small 8 CPU server the
multiplier adds only 14% correction.
We can experimentally check the accuracy of these formulas and, therefore, Poisson arrivals approximation. can be independently measured by sampling of v$latchholder. The latchprofx.sql script by Tanel Poder do this at high frequency. Within our accuracy we can expect that
and
should be at least of the same order.
To calculate the complete statistics set, I wrote latch_stats_10g.sql script. The script uses latch address as the parameter. It computes differential latch statistics for small interval (30sec) and estimates latch utilization and queue length by sampling. Note that this is only demonstration script. To measure such quantities accurately we need to perform statistical analysis and least squares fits.
We know that , where
is average service (latch holding) time. This allows us to estimate the latch holding time as:
This is interesting. We obtained the first estimation of latch holding time directly from statistics. In AWR terms this formula looks like
The inverse of service time is the service rate . This is the estimation of the upper limit for latch request frequency the instance can serve. As we will see later, this estimation is unreliable.
Wait time:
Look more closely on the summary wait time per second . Each latch acquisition increments the “WAIT_TIME” statistics by time it waited for the latch. According to the Little law, average latch sleeping time is related the length of wait (sleep) queue:
The right hand side of this identity is exactly the “wait time per second” statistic. Therefore:
We can experimentally confirm this conclusion because can be independently measured by sampling of v$process.latchwait column.
Recurrent sleeps:
In ideal situation the process spins and sleeps only once. Consequently, the latch statistics should satisfy the following identity:
Or, equivalently:
In reality, some processes had to sleep for the latch several times. This occurred when the sleeping process was posted, but another process got the latch before the first process received the CPU. The awakened process spins and sleeps again. As a results the previously equality became invalid.
Before 10.2 Oracle directly counted these sequential sleeps in SLEEP? statistics. In 10.2 these statistics became obsolete. However, we can estimate the rate of such “sleep misses” from other basic statistics. The recurrent sleep increments only the SLEEP counter. The SPIN_GETS statistics not changed. The is the ratio of inefficient latch sleeps to misses. The ratio of “unsuccessful sleep” to “sleeps” is given by:
Recurrent sleeps ratio
If this ratio exceeds several percents, it should be investigated. Frequent “unsuccessful sleeps” are inefficient and usually a symptom of OS waits posting problems.
Latch acquisition time:
Latch acquisition time is the average time that process spent to acquire the latch. This is the sum of spin time and wait time. Oracle does not measure the spin time explicitly. However, we can measure it on Solaris platform using simple DTrace script:
#!/usr/sbin/dtrace -s #pragma D option quiet BEGIN { in_latch_acquisition=0; ts=0; n=0; } pid$target::kslgetl:entry / arg0 == $1 / { ts=timestamp; in_latch_acquisition=1; } pid$target::kslgetl:return / in_latch_acquisition / { @timeavg = avg(timestamp - ts); in_latch_acquisition=0; n++; } tick-10s { normalize(@timeavg,1000); printf("Latch gets=%d",n); printa(" average acquisition time=%@d mus\n",@timeavg); trunc(@timeavg); n=0; }
This script sets DTrace triggers (probes) at entry and return of kslgetl() function inside the Oracle shadow process. The script measures timestamp difference between entry to and return from the latch acquisition function. This is the time the process spent acquiring the latch. Using DTrace aggregation this time is averaged over latch acquisitions and reported each 10 sec. Typical output looks like:
/usr/sbin/dtrace -32 -s latch_time.d -p 17141 0x5B7C75F8 ... Latch gets=45810 average acquisition time=24 mus Latch gets=48422 average acquisition time=20 mus Latch gets=49989 average acquisition time=19 mus ...
On other platforms we should rely on statistics. Fortunately in Oracle 9.2-10.2 one can count the average number of spinning processes by sampling x$ksupr.ksllalaq. The process set this column equal to address of acquired latch during active phase of latch get. Oracle 8i and before even fill the v$process.latchspin during latch spinning. Little law allows us to connect average number of spinning processes with the spinning time:
As a result the average latch acquisition time is:
Note that according to queuing theory the “Serve In Random Order” discipline of latch spin do not affect average latch acquisition time. It does not depend on queuing discipline. In steady state the number of processes served during the passage of incoming request through the system should be equal to the number of spinning and waiting processes.
In Oracle 11g the latch spin is no longer instrumented due to a bug. The 11g spin is invisible for SQL. This do not allow us to estimate and related quantities. I will be very usefull if Oracle Develoment fix this and reimplements the LATCHSPIN functionality in future versions.
Comparison of results:
Let me compare the results of DTrace measurements and latch statistics discussed. I wrote latch_times.d script to measure both latch acquisition and holding times and their distribution. For the following demonstration, I artifically induced the library cache contention repeatedly selecting from v$sql_plan. Typical script results look like:
/usr/sbin/dtrace -32 -s latch_times.d -p 17242 0x5B7C75F8 ... latch gets traced: 165180 latch acquisition time histogram for "Library cache latch" laddress=5b7c75f8 Acquisition time: value ------------- Distribution ------------- count 4096 | 0 8192 |@@ 7324 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 151748 32768 |@ 4493 65536 | 1676 131072 | 988 262144 | 464 524288 | 225 1048576 | 211 2097152 | 53 4194304 | 21 8388608 | 1 16777216 | 1 33554432 | 0 Holding time: value ------------- Distribution ------------- count 8192 | 0 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@ 105976 32768 |@@@@@@@@@@@@ 50877 65536 |@@ 6962 131072 | 1986 262144 | 829 524288 | 330 1048576 | 205 2097152 | 34 4194304 | 6 8388608 | 0 average acquisition time =26 mus average holding time =37 mus
Here the histogram shows latch acquisition and holding time distributions in logarithmic scale. The values are in nanoseconds. Compare the above average times with with the results of latch_stats_10g.sql script under the same conditions:
Latch statistics for 0x5B7C75F8 "library cache" level#=5 child#=1 Requests rate: lambda= 20812.2 Hz Miss /get: rho= .078164 Est. Utilization: eta*rho= .156327 Sampled Utilization: U= .143333 Slps /Miss: kappa= .013027 Wait_time/sec: W= .025497 Sampled queue length L= .043333 Spin_gets/miss: sigma= .987073 Sampled spinnning :Ns= .123 Derived statistics: Secondary sleeps ratio = 1.02E-04 Avg latch holding time = 7.5 musec . sleeping time = 1.2 musec . acquisition time = 7.2 musec context switch time estimation tau/S = 5.7
We can see that and
are really close to sampled
and
respectively. The holding and acquisition times from both methods are of the same order. Since both methods are intrusive I conclude that this is remarkable agreement. On the other hand for the purpose of contention diagnostics the main result of this post is:
The latch holding time for the contemporary servers should be normally in microseconds range.
[…] Avg “Library Cache” latch holding time (10-20us) […]
Pingback by Latch Timescales « Latch, mutex and beyond — December 22, 2011 @ 12:44 pm |