Oracle latches and general Spinlocks

October 4, 2009

Oracle latches and general Spinlocks

Filed under: Latch,Spinlock,Theory — andreynikolaev @ 7:45 am
Tags: Oracle latch, Spinlock

First of all we need to describe common background. According to Oracle® Database Concepts 11g Release 2 (11.2) latch is: “A simple, low-level serialization mechanism to protect shared data structures in the System Global Area.”

Latches and mutexes are the Oracle proprietary realizations of general spinlock concept. Later I will show that Oracle latch is one of the simplest and “obsolete” spinlock – TTS plus wait. Here I will describe where (and why) latch is placed in general spinlock theory.

According to Wikipedia, the spinlock is “a lock where the thread simply waits in a loop (“spins”) repeatedly checking until the lock becomes available. As the thread remains active but isn’t performing a useful task, the use of such a lock is a kind of busy waiting”

Use of spinlocks for multiprocessor synchronization were first introduced by Edsger Dijkstra in “Solution of a Problem in Concurrent Programming Control CACM”. 1965. Since that time a lot of researches were done in the field of mutual exclusion algorithms. Various sophisticated spinlock realizations were proposed and evaluated. The contemporary review of such algorithms may be found in J.H. Anderson, Yong-Jik Kim “Shared-memory Mutual Exclusion: Major Research Trends Since 1986 (2003)” .

Most of the contemporary spinlock algorithms used much more complex structures than Oracle latch. These algorithms were designed and benchmarked to work in the conditions of ~100% latch utilization. “Classic” spinlock optimization goal is to minimize the number of cache invalidations (or Remote Memory References) as a measure of effectiveness and performance. For the current state of spinlock theory see The Art of Multiprocessor Programming, M. Herlihy and N. Shavit, 2008, Chapter 07 Spin Locks and Contention

To relate “classic” spinlock theory and Oracle latch we need to dig deeper in contemporary SMP/CMT architecture. Oversimplified, but sufficient for our purposes view of it should look like:

Each CPU has its local very fast cache memory. Typically it consists of two or three levels. In modern days this caches commonly shared between several CPUs/Cores/HyperThreads. RAM access is much slower and will need hundreds of clock cycles. Date transfer unit between local caches and cache and RAM is a cache line – 128bytes in Pentium (64bytes in SPARC v9).

When CPU updates the location inside the cache line, it became dirty, and will eventually be copied to RAM. Shared bus may be any type of interconnect, including single shared bus, crossbar, NUMA interconnect, hypercube, etc…

This picture looks familiar to any Oracle DBA and resembles Real Application Clusters. Indeed, each CPU looks analogous to Cluster Node, Shared Bus analogous to Cluster interconnect, Memory to Disk. Local CPU cache has its analogue in Buffer Cache for each instance. Cache lines looks like Oracle 8 PCM Locks (remember GC_FILES_TO_LOCKS) and cache line transfers resemble the Cache Fusion! And immediately we realize to improve latch performance we need to reduce the Shared bus/Interconnect transfers.

PL/SQL style pseudocode for the simplest spinlock should looks like:


function latch_get(latch byte) is
   loop
     /*+ Atomically */ if (latch==0) then latch := 0xFF; 
        break;
     end if;
   end loop;
   return 1

The code with “atomic” hint must be processed as whole in a single SMP command. This command must also block the shared bus and invalidate other CPU cache lines to prevent several CPU’s to update this location simultaneously.

Such atomic instruction is commonly named Test_and_Set. It atomically tests the memory (cache) address and set it to the desired value. Previous contents of memory address returned as a result of command. If command returned 0xFF – latch was busy. If the return value is 0 – latch was free, and this process just took it. The actual name (and work) is depending on platform. Test_and_Set instruction is named XCHG in Intel, ldstub in SPARC. I will discuss the platform dependent realization details in separate post.

Using this “common” instruction name, one can write C style spinlock pseudocode in more compact way:

while( Test_and_Set(lock));

This spinlock realization called TS (Test and Set). It cyclically tests latch location atomically until it become free. Each T&S operation requires a shared bus cycle. Overuse of atomic operations during latch contention will saturate shared bus and seriously influence all the other simultaneous memory operations.

Processor releasing lock must also contend with processors trying to acquire lock for bus cycles. SMPs do not give the priority to processor releasing lock. This saturation becomes even worth with increase of number of CPUs.

Obvious spinlock optimization, which is used for Oracle latch is TTS (“Test and Test-and-Set)”. TTS spinlock poll by reading lock nonatomically and only use Test&Set if lock not busy.

It has slightly more complex pseudocode:

while(lock|| Test_and_Set(lock));

Latency for this spinlock is good; it has only one additional memory reference to acquire lock. If lock is Busy, TTS spinlock don’t consume bus cycles, and read from local cache. But it has serious drawback – invalidation storms on latch release.

When TTS lock released, its value set to 0 atomically. This caused corresponding cache line to be updated, and lines in other caches to be invalidated. One of spinning processors acquires lock; others continue to spin.

On SMP, this caused flurry of bus activity when lock released. Indeed, when lock was set to 0, some of spinning processors saw the cache miss, and ask to load new lock into their local cache pending TS request. Remaining processors have not yet read lock, pending the read request.

One processor executes TS, and acquires lock. This TS operation invalidates caches of processors with pending TS requests. Some processors with pending reads load (busy) lock into cache. Other processors with pending TS now do the TS, fail to get lock, but invalidate all the caches! And this repeated again and again.

Each subsequent TS uses a bus cycle, invalidates cached copies of the lock, even though the TS write does not change the memory location’s value. And each processor again must reloads lock value into cache. This degrades the performance.

T. Anderson in “The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors” investigated this phenomenon experimentally. His graphs looks like quisce time is quadratic for small number of spinning processor (n), but behave linearly at large n:

We can roughly explain this as follows. The time to quiesce this invalidation storm should be proportional to the number of bus operations (cycles). If n CPUs are spinning for lock, then each cache invalidation causes O(n-1) bus cycles to transfer the new value. When lock freed, this free value propagated to other local caches. In n is small, this free value propagated to all n spinning CPUs and each of them took TS attempt to lock causing n*(n-1) cache invalidation. Only one of spinning CPUs took the lock.

This is why should estimate O(n^2) time to quiesce this storm. If n become large, free lock value propagates only to the limited number of CPUs around (processor board), and invalidation storm took O(n) bus cycles. The transition regime happened somewhere around number of CPU’s on single board. But we do not have enough data to confirm such estimations.

To deal with invalidation storms, T. Anderson proposed several alternative approaches. The simplest one was to introduce the artificial delay after noticing lock was released:

while(lock|| Delay()||Test_and_Set(lock));

This greatly reduces invalidation storm bus traffic for SMPs, but increases the spinlock acquisition latency. Delay value may be static (different value for each CPU), or exponentially increased on each collision.

To overcome both invalidation and bus saturation bounds, sophisticated queuing spinlock realization were proposed by T.Anderson, as well as several other authors. Anderson queuing locks used by Linux as a Fairlocks/Jlocks. It worth to mention also famous MCS spinlocks, which were introduced in “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors” by John M. Mellor-Crummey and Michael L. Scott. In such spinlocks each process spinning on its local variable, and awake the next spinner on release. This paper received the 2006 Dijkstra Prize in Distributed Computing. It is these MCS spinlocks which are widely used in Java.

T. Anderson provided benchmarks for several spinlock algorithms in his article:

I compare described spinlocks realizations in single table:

Spinlock realization	Pseudocode	Drawbacks
TS	while( Test_and_Set(lock));	Bus saturation
TTS	while(lock\|\| Test_and_Set(lock));	Invalidation storms (open door)
Delay	Adjustable delay after noticing that lock was released	Slow under contention
Anders, MCS, etc.	Queues	Higher CPU and memory overhead

I compare described spinlocks realizations in single table:

As I mentioned already, Oracle latch used simple TTS with subsequent sleep. Why Oracle does not use more advanced spinlocks?

I suppose that the reason is that these algorithms were designed to work in the conditions of ~100% lock and bus utilizations. All the benchmarks that I found compared spinlocks under such conditions. Most benchmarks compare only “empty” spinlocks, without any “critical session”. Under such load, CPUs are fully loaded by acquiring and immediately releasing spinlock. But this is clearly not the Oracle latch case.

Oracle server always do somewhat under the latch protection. CPU consumed in this critical section should be much more than CPU consumed for the latch acquisition itself.

Latch contention will, usually, saturate CPUs at low latch utilization values. Typically statspack/AWR report shows latch miss/get ratio (=latch utilization) at the order of 5-10%. I never saw bus saturation by Oracle latches in modern hardware. It was critical in 1990-91, when the above theory developed, but not now.

To evaluate bus health one can use Hardware Performance Counters presented in contemporary processors. Utilities like VTune, OProfile, busstat, etc provide access to this counters.

Comments (5)

5 Comments »

[…] @ 10:16 pm We know a lot about the exclusive latches. This is Oracle realization of TTS spinlocks. Only one process at a time can hold the exclusive latch. Other processes compete for the exclusive […]

Pingback by Shared latch behaves like enqueue « Latch, mutex and beyond — November 17, 2010 @ 10:16 pm | Reply
Hi, Andrey,

“In modern days this caches commonly shared between several CPUs/Cores/HyperThreads” – Only level 3 is shared, levels 1 and 2 are PRIVATE to core. See IBM Power 7, Sun Sparc IV, Intel x86-64 Nehalem architectures.

Comment by Yuri Pudovchenko — January 31, 2011 @ 9:26 am | Reply
- Thank you. Of course complicated contemporary L? cache architecture affect latches and mutexes.
  It will be very interesting if someone repeat Anderson experiments nowadays.
  
  Comment by andreynikolaev — January 31, 2011 @ 10:19 am | Reply
[…] KGX MUTEXES (USING VOLATILES & END WAIT BEFORE MUTEX GET). With this patch Oracle mutex become TTS spinlock. Mutex contention no longer saturates the bus and “poisons” overall system […]

Pingback by 11.2.0.2 is the right patchset for mutexes « Latch, mutex and beyond — March 19, 2011 @ 10:05 am | Reply
[…] unlike latch, 10.2-11.1 mutex is "classic" spinlock without sleeps. If the mutex holding time is always small, this algorithm minimizes the elapsed […]

Pingback by Mutex waits. Part 1. “Cursor: Pin S” in Oracle 10.2-11.1. Invisible and aggressive. « Latch, mutex and beyond — July 9, 2011 @ 1:19 pm | Reply

RSS feed for comments on this post. TrackBack URI

Latch, mutex and beyond

October 4, 2009