Preview only show first 10 pages with watermark. For full document please download

How To Analyze High Cpu Utilization In Solaris

DTrace for analyzing CPU

   EMBED

  • Rating

  • Date

    December 1969
  • Size

    136.1KB
  • Views

    379
  • Categories


Share

Transcript

How to Analyze High CPU Utilization In Solaris (Doc ID 1008930.1) Modified: 10-Oct-2013 Type: TROUBLESHOOTING Migrated ID: 212290 APPLIES TO: Solaris SPARC Operating System - Version 8.0 and later Solaris x64/x86 Operating System - Version 8 6/00 U1 and later Oracle Solaris Express - Version 2010.11 and later All Platforms ***Checked for relevance on 21-Jan-2013 *** PURPOSE High system (kernel) time can effect the system performance negatively. When the cpu starts spending the majority of its time in running kernel code, it triggers performance problems ranging from slow interactive response, database time-outs, cluster fail-over due to heart beat detection failure or slow network etc. Number of events mainly responsible for triggering high kernel overhead are discussed in this document. If you are running Solaris 10 and above, you can use DTrace to monitor events responsible for high cpu kernel usage. See knowledge articles: How to use DTrace and mdb to Interpret vmstat Statistics (Doc ID 1009494.1) TROUBLESHOOTING STEPS System time is a cpu time spent running the kernel code. It is very much dependent on the characteristic of the work load. If the type of the work done by the application invokes high kernel activity then kernel time will be relatively higher. vmstat (cpu:sy) and mpstat (sys) report percent of cpu time spent running kernel code. Mpstat reports percent system time for an individual cpu separately, whereas vmstat reports system time average across all cpus. Kernel Lock contention: smtx (spins on a Mutex lock, as reported by mpstat(1)), srw (spins on a reader/writer lock, as reported by mpstat(1)) Locks are used for protecting critical kernel data structure from simultaneous access. There is always a lock needed for shared writable area. In Solaris thread sleeps at any one of the following synchronization objects: Mutex: It serializes access to critical kernel data structures. Condition variables: Implements kernel sleep/wakeup facility. Semaphores: Synchronize access to a shareable resource. RW lock: Concurrent access to reader, Exclusive access to writer Mutex locks are most commonly used in Solaris[TM] kernel to protect critical data structures. Smtx event indicates that a kernel thread is spinning waiting for a mutex. The thread that is holding the lock is running on some other CPU, and the thread trying to get the lock will spin on the cpu expecting that the mutex owner would release the lock soon. Smtx represents an entry point into the adaptive mutex code. In some cases, we can distinguish between reading a piece of data and modifying a piece of data. For such cases, kernel uses reader/writer locks(rwlock). Contention on the rwlock is reported by "srw" counter. Any number of threads can hold rwlock for reading, as long as no thread holds the rwlock for writing. There are thousands of locks in the kernel, that all get acquired through this same entry point. There's no way to know just by looking at There are thousands of locks in the kernel, that all get acquired through this same entry point. There's no way to know just by looking at the smtx or srw count that the count refers to an attempts to get the same lock, the same 10 locks or 100 different locks. To find out what kernel resource(s) are seeing contention, use lockstat. To record kernel contention lock events, such as spin on adoptive mutex or block on rwlock type: #lockstat -o lockstat.out -C -i 997 -s10 -D20 -n 1000000 sleep 60 It is recommended to run lockstat(1M) for a shorter interval due to overhead of logging the events of interest by modifying the running kernel. See Document: 1005868.1: A Primer on Lockstat, and lockstat(1M) man page for detail. High Paging Activity: minf/mjf(mpstat: minor fault/major fault), mf (vmstat: minor fault) A page fault occurs when a process attempts to access data that it currently does not have mapped. It is considered a minor fault if the data is in memory and the only thing required is the creation of a new mapping for it. A major fault occurs when the page is not in memory, and has to be created, and possibly populated with data from the backing store, then the mapping built. For example, a process maps in the "libc.so" library in its address space and makes a reference to a page within it. A page fault occurs, however the physical page may already be present in the memory, might have been brought in by some other process earlier, then the kernel only needs to establish a virtual to physical mapping without performing an expensive I/O operation. Pages which are no longer mapped by any processes but still have a valid vnode/offset to some data are kept in a list called the cache list. If a page is needed which is in the cache list, then it can be removed from the list and remapped. This is reported by vmstat in the "re" column as reclaimed. Major page fault in conjunction with vmstat:pages-ins, described below, should be looked at when the high system time is reported. pi/po (vmstat: kilobytes pages-in/pages-out) A page is brought into the memory when applications access the virtual address that is not mapped to the physical page. Each page of physical memory has an associated file (vnode) and offset. The file and offset identify the backing store for the page. Backing store is the location to which the physical page contents will be migrated out (paged out) or migrated in (paged in). Process's memory pages that are part of text or data segment use filesystem (where the program's binary resides) as a backing store. Memory pages used for process stack or heap (allocated typically by malloc) are not read from disk, but are created zero-filled. Should those need to be paged out, a swap device is used. These are also known as "anonymous" pages, as they have no identity in a filesystem. One can find out the source of page-in/page-out using "vmstat -p". page-in and page-out activity at the file system level reported as "fpi/fpo/fpf", is a sign that application is actively doing read(),write(), mmap() operations on files residing on the file system. Anonymous memory pagin and pageout are reported as "api/apo/apf". Anonymous memory pagein/pageout activity is a sign of memory pressure since I/O to the swap device is an indication of memory shortage. High Paging activity to/from the file system and swap (anonymous memory pages) can result in high cpu sys usage. Signs of Memory Shortage: sr (vmstat: scan rate) sr is the number of pages scanned by the scanner (page daemon) per second. On Solaris 8 and above, sr column reporting a non-zero value means shortage of memory. For Solaris 7 and below sr column alone cannot tell about memory shortages. Both sr column and IO to the swap device should be looked at to determine memory shortages on Sol7 and below. Multiple factors can cause the memory pressure: Memory requirement of the workload is larger than physical memory. Large memory allocation in kernel due to a bug or an excessive tuning Application or kernel driver memory leak. To find the source of large memory allocation in a user application, use pmap(1), ps(1) or prstat(1M) utilities. For kernel memory allocation, use kstat(1M) # kstat -n system_pages|grep pp_kernel pp_kernel 3345881 where pp_kernel represents number of 8k(sparc) and 4k(intel) pages allocated to kernel You can also get a breakdown of memory usage by application, file system and kernel by running command below: # echo ::memstat | mdb -k Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 158364 1237 15% ZFS File Data 684405 5346 67% Anon 30341 237 3% Exec and libs 5932 46 1% Page cache 26912 210 3% Free (cachelist) 34968 273 3% Free (freelist) 88081 688 9% Total 1029003 8039 Physical 1013090 7914 Where: Kernel: Memory allocated by kernel and drivers ZFS File Data: Memory allocated for caching ZFS file system blocks. Anon: Memory allocated by application for heap, stack, COW Exec and libs: Memory allocated by application libraries Page cache: Memory allocated to UFS file system cache Free (cachelist): Available memory. Memory with vnode and offset intact Free (freelist): Available memory. The page scanner is activated when free memory, represented by kernel variable "freemem", drops below the "lotsfree" threshold. lotsfree is set to 1/64 of total memory). Page scanner steels memory (in 8k page-sized chunk)from the application and filesystem buffers that haven't been referenced recently, syncs them with the backing store if they've been modified, then frees them for other uses. High scan rate is a sign that page scanner daemon is running more frequently and for a longer period and in turn causes increase in cpu kernel usage. One can monitor memory thresholds by running: # kstat -n system_pages name: system_pages class: pages availrmem 13420400 crtime 163.131762 desfree 128635 desscan 25 econtig 189792256 fastscan 677273 freemem 7151671 kernelbase 16777216 lotsfree 257271 minfree 64317 nalloc 50280640 nalloc_calls 18093 nfree 37487807 nfree_calls 12382 nscan 0 pagesfree 7151671 pageslocked 3029092 pageslocked 3029092 pagestotal 16465378 physmem 16487075 pp_kernel 3346379 slowscan 100 snaptime 4952283.6547535 Where values are reported in 8k(sparc) and 4k(intel) units See: Document 1348585.1 Monitoring Memory and Swap usage to avoid hang Document 1012111.1 Solaris:Discrepancy in freemem and available swap on systems configured without physical swap Document 1003383.1 Page Cache on Solaris 8 consumes large amounts of memory for more detail. Fsflush Cpu Usage: Primary task of file system flush daemon, fsflush, is to search and flush dirty file system pages to the backing store of the file system that the file(s) reside in. By default, fsflush is configured to wake up every 5 seconds (in recent releases it is changed to 1 second) and search 1/6th of physical memory for dirty pages - thus flushing all of physical memory within 30 seconds. On large memory systems, the fsflush daemon will consume a large amount of cpu if there are many dirty pages in the file system page cache. In some circumstances, the daemon will not be able to complete it's scan of 1/6th of physical memory at each scheduled wake up. The amount of time being spent by fsflush can be observed by taking multiple snapshots using ps(1) to view the cpu time (TIME column) consumed by the daemon: # ps -aef |grep fsflush UID PID PPID C STIME TTY TIME CMD root 3 0 0 10:14:13 ? 132:13 fsflush The amount of time spent in fsflush can also be observed on an idle machine. A 96GB machine with many file system pages will use 100% of a CPU for approximately 3 seconds, each time the file system daemon wakes up: # mpstat 5 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 4 0 109 201 101 15 0 0 12 0 64 0 99* 0 1 1 39 0 4 1 1 6 0 0 3 0 92 0 0 0 100 2 15 0 1473 10 10 14 0 0 1 0 21 2 0 2 95 3 21 0 1726 1 1 56 0 0 47 0 286 1 0 3 96 4 22 2 123 0 0 11 0 0 1 0 68 0 0 1 98 A larger value for the autoup parameter reduces the fsflush cpu usage at a cost of keeping the dirty buffer in memory for longer period . If fsflush runs for a longer period, it can effect application performance since it runs in kernel mode. It is therefore recommended to increase the default (30 seconds) to (nGB x 4). For example, a system configured with 96 GB should have an autoup value of 384. Edit the /etc/system file and add the following line and reboot. set autoup=384 High Cross-calls: Cross-calls is an operations that must be performed on one or all cpus. Kernel uses cross-calls for two main purposes: To implement interprocessor interrupts: Interprocessor interrupts are delivered by the kernel using "poke_cpu()" function and it is used for such tasks : preempting the dispatcher, delivering a signal to thread running on another cpu, or for starting/stopping threads running on different processor when /proc interface is used. To synchronize the memory management units (MMU) on all cpus. When a page is unmapped due to user process exiting or issuing an unmap operation a cross-call is needed to flush the TLBs (Translation Lookaside Buffer) entries for the unmapped page issuing an unmap operation a cross-call is needed to flush the TLBs (Translation Lookaside Buffer) entries for the unmapped page from all cpus MMU on which the thread last run. TLB is part of the memory management unit (MMU) and used for cacheing translation entries such as (PTE in case of Sparc, TTE in case of UltraSparc). The objective of the TLB is to cache as many recent page translations in hardware as possible, so that it can satisfy a process's or thread's memory access by performing all of the virtual-to-physical translation on-the-fly. If the translation is not found in the TLB, system has to look for virtual to physical memory translation from the larger table in main memory, which is several order slower. When a TLB entry is invalidated, all CPUs which may have a copy of that TLB must be notified to invalidate their copy. This is done via a cross-call. A cross-call is implemented with a cross-trap. Cross-calls use a special cross-trap which records the presence of a cross-call. Once the cross-trap is done, it immediately executes the cross-call. Majority of cross-call events are triggered by HAT (Hardware Address Transalation) layer flushing stale TLB entries on the other cpus. This is either due to process exiting or the piece of memory is reclaimed for other use. Kernel tracks the cpus on which a process runs so that when the process exits it can send cross-traps to only those cpus the process ran on. For kernel unmapped pages, cross-call is made to all cpus. Some number of cross-traps come from the activities of the cache-scrubber. Trapstat(1M) utility (-t option), available in Solaris 9 and above, breaks out TLB misses by kernel and user ("k" vs. "u"); differentiates between TSB hits and TSB misses ("tlb-miss" vs."tsb-miss") and reports the amount of time spend in each type of handler (%tim) as shown below: cpu m| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim 0 u| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 0 k| 0 0.0 0 0.0 | 31 0.0 0 0.0 | 0.0 1 u| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 1 k| 0 0.0 0 0.0 | 11 0.0 0 0.0 | 0.0 2 u| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 2 k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 3 u| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 3 k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 It is not recommended to run "trapstat -t" during busy hours due to its high overhead (upto 7% on memory intensive workload). If interested only the summary of tlb misses on each cpu, use trapstat without any option. It has relatively lower overhead than running it with "-t" option. MPSS (Multiple Page Size Support) introduced in Solaris[TM] 9 improves the TLB hit rate by increasing the page size cached in the hardware TLB from 8KB to 4 MB. Also, LPOOB (Large Page Out Of Box) feature introduced in Solaris 10 use large pages for process heap, stack and private anonymous memory. Large pages allow better performance for some workload due to larger working set can be cache and thus avoids frequent TLB misses Detail on how MPSS improves application performance and how to measure TLB hits using trapstat, see: Document: 1010458.1 , Taking advantage of Multiple Page Size Support in Solaris[TM] 9 and later. Frequency of Interrupts: intr/ithr (mpstat: interrupt/Interrupt as thread) Interrupt is a mechanism used by devices such as: disk, network card etc. to signal a completion of an io or some activity. Interrupt causes cpu to stop the execution of running thread since interrupt requires immediate processing. Kernel services the interrupt by vector transfer the cpu execution to the interrupt handler routine that matches the interrupting device. High number of interrupts to a particular cpu could be caused by a bad hardware or binding several high interrupt generating device to the same processor. "ithr" column records low level interrupts that are handled as a separate thread. Interrupts below level 10 are handled the same way as other solaris threads. In contrast, interrupts above level 10 uses (borrows) the LWP from the executing thread. The interrupted thread in that case is pinned, which avoids the overhead of context switch. Interrupts handler routines of level 10 or below (interrupts handler above 10 cannot block) can block if necessary, using regular synchronization primitives (see above). Each processor maintains a pool of partially initialized interrupt threads, one for each of the lower 9 levels plus a system wide thread for the clock interrupts. This approach partially initialized interrupt threads, one for each of the lower 9 levels plus a system wide thread for the clock interrupts. This approach allows simple, fast allocation of threads at the interrupt dispatch time, and low latency interrupt response time without impacting the scaling caused by interrupt blocking for synchronous data access. intrstat(1M) utility available in Sol10 that report interrupt statistics. kstat(1M) can be also be used to capture interrupt statistics per cpu level-1 288227202512057 level-10 32914470420237 level-11 0 level-12 108930 level-13 86205745317 level-14 2341214414862 level-15 0 level-2 0 level-3 0 level-4 349787125 level-5 52446672 level-6 1114480 level-7 10737622850102 level-8 0 level-9 697429014107 trapstat(1M) utility (-i option) on Solaris 9 breaks out the int-vec and level-1 through level-15 traps by interrupt number, programmable interrupt level (PIL), interrupting device, and device instance number as shown below: # trapstat -i inum pil device | cpu0 cpu1 cpu4 cpu5 db 5 fas#0 | 51 0 0 0 e2 5 soc#0 | 0 0 0 123 d3 5 soc#1 | 2 0 0 0 .... .... 343 5 soc#17 | 0 0 0 0 dc 7 hme#0 | 125 0 0 0 1dc 7 hme#1 | 0 0 170 0 25c 7 hme#2 | 71 0 0 0 ... Number Of System calls: System call is the only way for user application to access the kernel resources. When the user program makes a system call, such as open file, the open routine (a wrapper to open system call) pushes the system call number (which is used to identify the particular system call in the kernel) onto the user stack and then invokes a special SPARC trap instruction , Tcc, aka. Software trap. The primary function of this instruction is to change the execution mode of the processor to kernel. Kernel then invokes system call handler called syscall(). Not all system calls have same overhead. Type of system call issued by the application and time taken to service that system call can be monitored by truss(1) utility. # truss -c ls # truss -aefd -o truss.out -p pid_of_process For detail about how to use truss, see Document: 1005788.1 Using truss for program and error analysis Document: 1011708.1 Basics of Using the truss Command Document: 1007414.1 truss interpretation: an example Truss reports system calls generated by the particular process. If interested in system calls generated system wide than use prex(1). prex is used for enabling and traceing TNF probes in a user process or the kernel. # prex -k prex> list probes syscall name=syscall_end enable=off trace=off file=../../sparc/os/syscall.cline=853 funcs= name=syscall_start enable=off trace=off file=../../sparc/os/syscall.cline=541 funcs= prex> buffer alloc 2m Buffer of size 2097152 bytes allocated prex> enable syscall prex> trace syscall prex> ktrace on Use tnfxtract(1) to dump the buffer containing syscall probes into file Use tnfdump(1) to convert the binary file into an ascii. File /etc/name_to_sysnum provides sysycall name to number mapping. This information may be used for interpreting the file containing syscal TNF probes data. Kernel Profiling: All the system activities discussed above invoke various kernel routines to perform the task. In Solaris 8 and above lockstat(1M) utility has a kernel profile option that offers excellent tracing inside the kernel. Kernel profiling offers an insight into where in the kernel cpus are spending majority of their time or what kernel routines are monopolizing the cpus in the kernel mode. Kernel profiling lists all the active kernel routines during the sample period and the cpu time spend in running these kernel routines. To collect kernel profile data for 30 seconds, type: #lockstat -o lockstat_profile.out -I -i 971 -s 40 -n 720000 -W sleep 30 Instead of recording lock events lockstat collects profiling data when "I" option is specified. "-i" option requests that the profiling is performed at 971 Hz (maximum value is 100 Hz). By default, lockstat uses 10,000 buckets for recording profiling events such as: PC, CPU, PIL. When lockstat runs out of buckets for recording events, it prints out the message: lockstat: warning: ran out of data records. To validate that no events were dropped by lockstat follow the steps below: $ head -2 lockstat.out Profiling interrupt: 1165219 events in 600.018 seconds (1942 events/sec) versus the expected number of events: NCPU x frequency x sample period: $ bc -l 2*971*600.018 1165234.956 To make sure there are plenty of buckets available to do profiling of the busy workload, specify larger value to "-n" option. A simple way to check the number of buckets actually used: $ wc -l lockstat.out 6234 lockstat.out For more information, see lockstat man page and Document: 1001812.1 - How to determine what is consuming CPU system time using the lockstat command. Prstat: When it comes to performance, what matters most is how the application is getting impacted by the activities described above. Prstat(1M) utility has an option to collect microstate process accounting that provides information similar to what is reported by mpstat and vmstat. However, instead of reporting system wide information, prstat "-m" option reports process specific events such as: SYS The percentage of time the process has spent in system mode. DFL The percentage of time the process has spent processing data page faults. It is a sign that memory shortage is effecting application performance. LCK The percentage of time the process has spent waiting for user locks. This is a sign of resource contention in user code. SLP The percentage of time the process has spent sleeping. LAT The percentage of time the process has spent waiting for the CPU. It means there are too many runnable threads in the cpu dispatch queue competing for the cpu resources. VCX The number of voluntary context switches. ICX The number of involuntary context switches. It means thread is getting preempted due to high priority runnable thread or has exhausted its time slot. SCL The number of system calls To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community, Oracle Solaris Kernel Community. REFERENCES NOTE:1278725.1 - Using DTrace to understand mpstat output NOTE:1285485.1 - GUDS - A Script for Gathering Solaris Performance Data NOTE:1001812.1 - How to Determine What is Consuming CPU System Time Using the lockstat Command NOTE:1005868.1 - A Primer On Lockstat NOTE:1007815.1 - Care needed interpreting vmstat runqueue statistics NOTE:1009494.1 - How to use DTrace and mdb to Interpret vmstat Statistics NOTE:1009871.1 - Cpu usage statistics are misleading before Oracle Solaris[TM] 10 Operating System NOTE:1010458.1 - Taking advantage of Multiple Page Size Support (MPSS) in Solaris[TM] 9 and later