Latch, mutex and beyond

September 20, 2013

V$KERNEL_IO_OUTLIER

Filed under: 12.1,DTrace — andreynikolaev @ 9:10 pm

In these pre-OOW 2013 days, everyone is excited about new features of Oracle 12c. This post describes particularly interesting one – IO outliers profiling in V$KERNEL_IO_OUTLIER.

According to Oracle Documentation and Glenn Fawcett blog we know that V$KERNEL_IO_OUTLIER uses the DTrace to collect information about the amount of time an IO spends in operating system components. DTrace is the Solaris dynamic tracing tool which allows to observe all OS activity.

V$KERNEL_IO_OUTLIER should provide an operating system-based latency breakdown of IO taking an excessive amount of time to complete. This will be essential diagnostics feature. But, many people complained that V$KERNEL_IO_OUTLIER is always empty. This was enough for me to start investigating it.

I will discuss the view, when and how it works, the underlying DTrace script and how it can be used for pre-12c Oracle.

I. Oracle and Solaris configuration.

Oracle documentation explicitly states that the view is only populated on the Solaris platform. Of course, oracle user needs the OS privileges to run DTrace:

$ su - root
Password:
Oracle Corporation      SunOS 5.10      Generic Patch   January 2005
# usermod -K defaultpriv=basic,dtrace_kernel,dtrace_proc,dtrace_user oracle
#


To encourage outliers’ detection, I also lowered the parameter _io_outlier_threshold to 100ms in my Oracle 12.1 test database on VMware Solaris 10 sandbox. However, this was not enough. V$IO_OUTLIER and V$LGWRIO_OUTLIER were full of rows, but V$KERNEL_IO_OUTLIER still was empty.

It is worth to look closely at the documentation description of V$KERNEL_IO_OUTLIER columns:

Column Description
TIMESTAMP Epoch Time
IO_SIZE Size of the I/O, in KB.
IO_OFFSET Offset into the device of the I/O
DEVICE_NAME Name of the device to which the I/O was targeted
PROCESS_NAME Name of the process that issued the I/O
TOTAL_LATENCY Total time in microseconds the I/O spent in the kernel
SETUP_LATENCY Time in microseconds spent during initial I/O setup before sending to SCSI target device driver
QUEUE_TO_HBA_LATENCY Time in microseconds spent in the SCSI target device driver before being sent to the Host Bus Adaptor
TRANSFER_LATENCY Time in microseconds spent in the Host Bus Adaptor and physically transferring the I/O to the storage device
CLEANUP_LATENCY Time in microseconds spent freeing resources used by the completed I/O
PID Process ID that issued the I/O

From the above table, I can conclude that:

  • The columns do not contain any information about filesystem latencies. Therefore, V$KERNEL_IO_OUTLIER was designed to work for ‘raw devices’ or ASM disks only.
  • Such set of latencies describes the synchronous IO. No AIO related info is present. Therefore, V$KERNEL_IO_OUTLIER was designed to work with disk_asynch_io=false only.

These are the reasons why most people did not observed any data in this view. Combination of raw devices and synchronous IO is very unusual for Solaris. I suppose it should be used only for diagnostics.

Indeed, when I have placed redo logs on raw devices and restarted Oracle instance with disk_asynch_io=false the V$KERNEL_IO_OUTLIER starts populating:

SQL> select * from v$kernel_io_outlier;
 TIMESTAMP    IO_SIZE  IO_OFFSET DEVICE_NAME     PROCESS_NAME    TOTAL_LATENCY SETUP_LATENCY QUEUE_TO_HBA_LATENCY TRANSFER_LATENCY CLEANUP_LATENCY        PID     CON_ID            
---------- ---------- ---------- --------------- --------------- ------------- ------------- -------------------- ---------------- --------------- ---------- ----------            
1379336049        156   22510592 cmdk@1,0:t,raw  ora_lg00_O12           122298            43                 1263           120436             555       9803          0            
1379336812          0   23491584 cmdk@1,0:t,raw  ora_lgwr_O12           137050            62                  550           136407              30       9797          0            
1379337914        806   26398720 cmdk@1,0:t,raw  ora_lgwr_O12           535984        487226                  489            48237              29       9797          0            
1379337923        806   28894208 cmdk@1,0:t,raw  ora_lgwr_O12           181463        177612                  657             3167              24       9797          0            
1379337926        806   29720064 cmdk@1,0:t,raw  ora_lgwr_O12           849048        707903                  560           140553              31       9797          0            
...

In my case, the majority of IO time has been spent inside SETUP and TRANSFER stages.

II. How it works.

First of all, we need to find the process responsible for DTrace profiling. Fixed table X$KSBTABACT lists all actions performed by Oracle background processes. We are interesting in the following:

SQL>select PROCESS_NAME_KSBTABACT,ACTION_DESCRIPTION_KSBTABACT from X$KSBTABACT;
PROCESS_NAME_KSBTABACT ACTION_DESCRIPTION_KSBTABACT
---------------------- ---------------------------------------------
...
GEN0                   DTrace based Kernel IO  Outlier  Processing
...
3210 rows selected.

Therefore, this is the GEN0 process. Now we can use the DTrace to trace DTrace function calls during GEN0 operation:

$ pgrep -lf gen0
 2860 ora_gen0_O12
$ /usr/sbin/dtrace -n 'pid$target::dtrace*:entry ' -p 2860 
dtrace: description 'pid$target::dtrace*:entry ' matched 94 probes
CPU     ID                    FUNCTION:NAME
  0  59303              dtrace_status:entry
  0  59223      dtrace_aggregate_snap:entry
  0  59224      dtrace_aggregate_walk:entry
  0  59236     dtrace_aggregate_clear:entry
  0  59303              dtrace_status:entry
  0  59223      dtrace_aggregate_snap:entry
  0  59224      dtrace_aggregate_walk:entry
  0  59236     dtrace_aggregate_clear:entry
...

The process repeatedly reads data from some DTrace aggregate (array). The interval between reads is 3 seconds. It seems that Oracle populates the V$KERNEL_IO_OUTLIER with the results of some DTrace script running in the background.

In order to see underlying script itself, one could examine arguments of dtrace_program_strcompile() function. The GEN0 process called this function and run the compiled DTrace code immediately after its startup.
For the purpose to catch this compilation, I wrote script gen0.d:

#!/usr/sbin/dtrace -ZCws
#pragma D option quiet 

BEGIN {
  printf("Waiting for GEN0 process to start\n");
  init=1;
}

syscall::exece:return 
/ init && execname=="oracle" && (strstr(curpsinfo->pr_psargs, "gen0")!= NULL) / 
{
 printf(" %s,\t %d\n DTrace script is:\n",curpsinfo->pr_psargs,pid);
	stop(); 
	system("/usr/sbin/dtrace -Zq -x strsize=2048 -n 'pid$target::dtrace_program_strcompile:entry{self->s = arg1;}' -n 'pid$target::dtrace_program_strcompile:return{trace(copyinstr(self->s));}' -p %d || prun %d ", pid,pid);
 init=0; 
}

It will start DTracing of GEN0 process just after its creation. Let us run the script and restart the Oracle instance:

$ ./gen0.d
Waiting for GEN0 process to start
 ora_gen0_O12,    3617
 DTrace script is:
syscall::pwrite:entry/execname == "oracle" &&  substr(strtok(curpsinfo->pr_psargs, " "), -3) == "O12" && fds[arg0].fi_fs == "specfs" && strstr(fds[arg0].fi_path
...

Now, we can investigate how V$KERNEL_IO_OUTLIER works and even run it standalone.

Briefly speaking, Oracle script measures time differences between consequentially firing syscall::pwrite:entry, io:::start, io:::wait-start, io:::wait-done, and syscall::pwrite:return DTrace probes. If the overall duration of pwrite() syscall to raw device exceeds the threshold (500 ms by default), the scripts adds data to DTrace aggregate. Oracle GEN0 process reads the aggregate and fills the V$KERNEL_IO_OUTLIER.

But DTrace can do the same itself. Script kernel_io_oulier.d consist of original V$KERNEL_IO_OUTLIER script plus additional printing probes. It takes ORACLE_SID as argument and reports IO outliers in the same format as the view:

$ ./kernel_io_outlier.d O12
 TIMESTAMP    IO_SIZE  IO_OFFSET DEVICE_NAME     PROCESS_NAME    TOTAL_LATENCY SETUP_LATENCY QUEUE_TO_HBA_LATENCY TRANSFER_LATENCY CLEANUP_LATENCY      PID
---------- ---------- ---------- --------------- --------------- ------------- ------------- -------------------- ---------------- --------------- --------
1379698569        807    6185984  cmdk@1,0:t,raw    ora_lg00_O12        287553        283701                  492             2973             385     4348
1379698571        806    7838208  cmdk@1,0:t,raw    ora_lg00_O12        601281        482628                   30           118478             144     4348
1379698571        807    8664064  cmdk@1,0:t,raw    ora_lg01_O12        927317        580352                  397           346413             153     4352
...

The script can be used to “extend” V$KERNEL_IO_OUTLIER diagnostic functionality to pre-12c Oracle databases.

Summary.
Currently as of Oracle 12.1 the V$KERNEL_IO_OUTLIER works only on Solaris platform, for raw devices and synchronous IO. But I hope that this is only the beginning, and in future Oracle database and DTrace will become tightly integrated.

6 Comments »

  1. Hi Andrey,
    thank you very much for the re-engineering of view V$KERNEL_IO_OUTLIER. I think it is very obvious now, why the view does not contain any data in most cases.

    However i interpreted the following column description (from the official documentation) “Time in microseconds spent during initial I/O setup before sending to SCSI target device driver” that this column can include the time from VFS to block devices (next step would be SCSI call), which makes more sense and could be realized by probes on zfs_read, zfs_write, zio_wait, zio_vdev_io_done, vdev_disk_io_done, io:::start, scsi, etc.

    Hopefully Oracle will extend this view / feature in the future with such probes as we all know how ZFS can become slow in various cases :-))

    Thanks once again.

    Regards
    Stefan

    Comment by Stefan Koehler — September 21, 2013 @ 8:04 am | Reply

  2. Thats just awesome! I am very impressed.
    Is the script modified? Or 1:1 read from the GEN0 process. I am just surprised why it uses only the last 3 chars of the SID (“substr(strtok(curpsinfo->pr_psargs, ” “), -3)”) and not the whole SID.

    Comment by Manuel Zach (@mzachh) — September 27, 2013 @ 6:34 pm | Reply

    • The part of the script before /*————*/ line is almost original.
      I only added the outlier flag variable for printing, replaced literal for ORACLE_SID by $$1 argument and rescaled the units to microseconds and kilobytes.

      I suppose that Oracle uses only the last 3 chars of the SID for the performance reasons.
      This was wrong.
      Just discovered that Oracle generates here literal number equal to length of ORACLE_SID.
      I corrected the script.
      Thank you for this observation.

      Comment by andreynikolaev — September 28, 2013 @ 12:32 pm | Reply

  3. […] v$kernel_io_outlier table. And this is were i gave up.. until Google found this amazing post from Andrey Nikolaev He had already solved the parts i didn’t manage to resolve in a very good manner He also […]

    Pingback by The v$KERNEL_IO_OUTLIER | The Oracle DB inn — May 21, 2014 @ 1:38 pm | Reply

  4. […] we can see this in action using  V$KERNEL_IO_OUTLIER view.Like explained by Andrey Nikolaev in  https://andreynikolaev.wordpress.com/2013/09/20/vkernel_io_outlier/ this view is based on a DTrace script running in the background.He also managed to rewrite the […]

    Pingback by Simulating V$KERNEL_IO_OUTLIER on LINUX using Systemtap | Hatem Mahmoud Oracle's blog — February 25, 2015 @ 12:39 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Create a free website or blog at WordPress.com.