linux-kernel - Re: Unusually high system CPU usage with recent kernels

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20130824210336.GJ3871@linux.vnet.ibm.com>
Date:	Sat, 24 Aug 2013 14:03:36 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Tibor Billes <tbilles@....com>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: Unusually high system CPU usage with recent kernels

On Sat, Aug 24, 2013 at 09:59:45PM +0200, Tibor Billes wrote:
> From: Paul E. McKenney Sent: 08/22/13 12:09 AM
> > On Wed, Aug 21, 2013 at 11:05:51PM +0200, Tibor Billes wrote:
> > > > From: Paul E. McKenney Sent: 08/21/13 09:12 PM
> > > > On Wed, Aug 21, 2013 at 08:14:46PM +0200, Tibor Billes wrote:
> > > > > > From: Paul E. McKenney Sent: 08/20/13 11:43 PM
> > > > > > On Tue, Aug 20, 2013 at 10:52:26PM +0200, Tibor Billes wrote:
> > > > > > > > From: Paul E. McKenney Sent: 08/20/13 04:53 PM
> > > > > > > > On Tue, Aug 20, 2013 at 08:01:28AM +0200, Tibor Billes wrote:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > I was using the 3.9.7 stable release and tried to upgrade to the 3.10.x series.
> > > > > > > > > The 3.10.x series was showing unusually high (>75%) system CPU usage in some
> > > > > > > > > situations, making things really slow. The latest stable I tried is 3.10.7.
> > > > > > > > > I also tried 3.11-rc5, they both show this behaviour. This behaviour doesn't
> > > > > > > > > show up when the system is idling, only when doing some CPU intensive work,
> > > > > > > > > like compiling with multiple threads. Compiling with only one thread seems not
> > > > > > > > > to trigger this behaviour.
> > > > > > > > > 
> > > > > > > > > To be more precise I did a `perf record -a` while compiling a large C++ program
> > > > > > > > > with scons using 4 threads, the result is appended at the end of this email.
> > > > > > > > 
> > > > > > > > New one on me! You are running a mainstream system (x86_64), so I am
> > > > > > > > surprised no one else noticed.
> > > > > > > > 
> > > > > > > > Could you please send along your .config file?
> > > > > > > 
> > > > > > > Here it is
> > > > > > 
> > > > > > Interesting. I don't see RCU stuff all that high on the list, but
> > > > > > the items I do see lead me to suspect RCU_FAST_NO_HZ, which has some
> > > > > > relevance to the otherwise inexplicable group of commits you located
> > > > > > with your bisection. Could you please rerun with CONFIG_RCU_FAST_NO_HZ=n?
> > > > > > 
> > > > > > If that helps, there are some things I could try.
> > > > > 
> > > > > It did help. I didn't notice anything unusual when running with CONFIG_RCU_FAST_NO_HZ=n.
> > > > 
> > > > Interesting. Thank you for trying this -- and we at least have a
> > > > short-term workaround for this problem. I will put a patch together
> > > > for further investigation.
> > > 
> > > I don't specifically need this config option so I'm fine without it in
> > > the long term, but I guess it's not supposed to behave like that.
> > 
> > OK, good, we have a long-term workload for your specific case,
> > even better. ;-)
> > 
> > But yes, there are situations where RCU_FAST_NO_HZ needs to work
> > a bit better. I hope you will bear with me with a bit more
> > testing...
> >
> > > > In the meantime, could you please tell me how you were measuring
> > > > performance for your kernel builds? Wall-clock time required to complete
> > > > one build? Number of builds completed per unit time? Something else?
> > > 
> > > Actually, I wasn't all this sophisticated. I have a system monitor
> > > applet on my top panel (using MATE, Linux Mint), four little graphs,
> > > one of which shows CPU usage. Different colors indicate different kind
> > > of CPU usage. Blue shows user space usage, red shows system usage, and
> > > two more for nice and iowait. During a normal compile it's almost
> > > completely filled with blue user space CPU usage, only the top few
> > > pixels show some iowait and system usage. With CONFIG_RCU_FAST_NO_HZ
> > > set, about 3/4 of the graph was red system CPU usage, the rest was
> > > blue. So I just looked for a pile of red on my graphs when I tested
> > > different kernel builds. But also compile speed was horrible I couldn't
> > > wait for the build to finish. Even the UI got unresponsive.
> > 
> > We have been having problems with CPU accounting, but this one looks
> > quite real.
> > 
> > > Now I did some measuring. In the normal case a compile finished in 36
> > > seconds, compiled 315 object files. Here are some output lines from
> > > dstat -tasm --vm during compile:
> > > ----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- ----swap--- ------memory-usage----- -----virtual-memory----
> > >     time     |usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw | used  free| used  buff  cach  free|majpf minpf alloc  free
> > > 21-08 21:48:05| 91   8   2   0   0   0|   0  5852k|   0     0 |   0     0 |1413  1772 |   0  7934M| 581M 58.0M  602M 6553M|   0    71k   46k   54k
> > > 21-08 21:48:06| 93   6   1   0   0   0|   0  2064k| 137B  131B|   0     0 |1356  1650 |   0  7934M| 649M 58.0M  604M 6483M|   0    72k   47k   28k
> > > 21-08 21:48:07| 86  11   4   0   0   0|   0  5872k|   0     0 |   0     0 |2000  2991 |   0  7934M| 577M 58.0M  627M 6531M|   0    99k   67k   79k
> > > 21-08 21:48:08| 87   9   3   0   0   0|   0  2840k|   0     0 |   0     0 |2558  4164 |   0  7934M| 597M 58.0M  632M 6507M|   0    96k   57k   51k
> > > 21-08 21:48:09| 93   7   1   0   0   0|   0  3032k|   0     0 |   0     0 |1329  1512 |   0  7934M| 641M 58.0M  626M 6469M|   0    61k   48k   39k
> > > 21-08 21:48:10| 93   6   0   0   0   0|   0  4984k|   0     0 |   0     0 |1160  1146 |   0  7934M| 572M 58.0M  628M 6536M|   0    50k   40k   57k
> > > 21-08 21:48:11| 86   9   6   0   0   0|   0  2520k|   0     0 |   0     0 |2947  4760 |   0  7934M| 605M 58.0M  631M 6500M|   0   103k   55k   45k
> > > 21-08 21:48:12| 90   8   2   0   0   0|   0  2840k|   0     0 |   0     0 |2674  4179 |   0  7934M| 671M 58.0M  635M 6431M|   0    84k   59k   42k
> > > 21-08 21:48:13| 90   9   1   0   0   0|   0  4656k|   0     0 |   0     0 |1223  1410 |   0  7934M| 643M 58.0M  638M 6455M|   0    90k   62k   68k
> > > 21-08 21:48:14| 91   8   1   0   0   0|   0  3572k|   0     0 |   0     0 |1432  1828 |   0  7934M| 647M 58.0M  641M 6447M|   0    81k   59k   57k
> > > 21-08 21:48:15| 91   8   1   0   0   0|   0  5116k| 116B    0 |   0     0 |1194  1295 |   0  7934M| 605M 58.0M  644M 6487M|   0    69k   54k   64k
> > > 21-08 21:48:16| 87  10   3   0   0   0|   0  5140k|   0     0 |   0     0 |1761  2586 |   0  7934M| 584M 58.0M  650M 6502M|   0   105k   64k   68k
> > > 
> > > The abnormal case compiled only 182 object file in 6 and a half minutes,
> > > then I stopped it. The same dstat output for this case:
> > > ----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- ----swap--- ------memory-usage----- -----virtual-memory----
> > >     time     |usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw | used  free| used  buff  cach  free|majpf minpf alloc  free
> > > 21-08 22:10:49| 27  62   0   0  11   0|   0     0 | 210B    0 |   0     0 |1414  3137k|   0  7934M| 531M 57.6M  595M 6611M|   0  1628  1250   322 
> > > 21-08 22:10:50| 25  60   4   0  11   0|   0    88k| 126B    0 |   0     0 |1337  3110k|   0  7934M| 531M 57.6M  595M 6611M|   0    91   128   115 
> > > 21-08 22:10:51| 26  63   0   0  11   0|   0   184k| 294B    0 |   0     0 |1411  3147k|   0  7934M| 531M 57.6M  595M 6611M|   0  1485   814   815 
> > > 21-08 22:10:52| 26  63   0   0  11   0|   0     0 | 437B  239B|   0     0 |1355  3160k|   0  7934M| 531M 57.6M  595M 6611M|   0    24    94    97 
> > > 21-08 22:10:53| 26  63   0   0  11   0|   0     0 | 168B    0 |   0     0 |1397  3155k|   0  7934M| 531M 57.6M  595M 6611M|   0   479   285   273 
> > > 21-08 22:10:54| 26  63   0   0  11   0|   0  4096B| 396B  324B|   0     0 |1346  3154k|   0  7934M| 531M 57.6M  595M 6611M|   0    27   145   145 
> > > 21-08 22:10:55| 26  63   0   0  11   0|   0    60k|   0     0 |   0     0 |1353  3148k|   0  7934M| 531M 57.6M  595M 6610M|   0    93   117    36 
> > > 21-08 22:10:56| 26  63   0   0  11   0|   0     0 |   0     0 |   0     0 |1341  3172k|   0  7934M| 531M 57.6M  595M 6610M|   0   158    87    74 
> > > 21-08 22:10:57| 26  62   1   0  11   0|   0     0 |  42B   60B|   0     0 |1332  3162k|   0  7934M| 531M 57.6M  595M 6610M|   0    56    82    78 
> > > 21-08 22:10:58| 26  63   0   0  11   0|   0     0 |   0     0 |   0     0 |1334  3178k|   0  7934M| 531M 57.6M  595M 6610M|   0    26    56    56 
> > > 21-08 22:10:59| 26  63   0   0  11   0|   0     0 |   0     0 |   0     0 |1336  3179k|   0  7934M| 531M 57.6M  595M 6610M|   0     3    33    32 
> > > 21-08 22:11:00| 26  63   0   0  11   0|   0    24k|  90B  108B|   0     0 |1347  3172k|   0  7934M| 531M 57.6M  595M 6610M|   0    41    73    71
> > > 
> > > I have four logical cores so 25% makes up 1 core. I don't know if the ~26% user CPU usage has anthing to do with this fact or just coincidence. The rest is ~63% system and ~11% hardware interrupt. Do these support what you suspect?
> > 
> > The massive increase in context switches does come as a bit of a surprise!
> > It does rule out my initial suspicion of lock contention, but then again
> > the fact that you have only four CPUs made that pretty unlikely to begin
> > with.
> > 
> > 2.4k average context switches in the good case for the full run vs. 3,156k
> > for about half of a run in the bad case. That is an increase of more
> > than three orders of magnitude!
> > 
> > Yow!!!
> > 
> > Page faults are actually -higher- in the good case. You have about 6.5GB
> > free in both cases, so you are not running out of memory. Lots more disk
> > writes in the good case, perhaps consistent with its getting more done.
> > Networking is negligible in both cases.
> > 
> > Lots of hardware interrupts in the bad case as well. Would you be willing
> > to take a look at /proc/interrupts before and after to see which one you
> > are getting hit with? (Or whatever interrupt tracking tool you prefer.)
> 
> Here are the results.
> 
> Good case before:
>            CPU0       CPU1       CPU2       CPU3       
>   0:         17          0          0          0   IO-APIC-edge      timer
>   1:        356          1         68          4   IO-APIC-edge      i8042
>   8:          0          0          1          0   IO-APIC-edge      rtc0
>   9:        330         14        449         71   IO-APIC-fasteoi   acpi
>  12:         10        108        269       2696   IO-APIC-edge      i8042
>  16:         36         10        111          2   IO-APIC-fasteoi   ehci_hcd:usb1
>  17:         20          3         25          4   IO-APIC-fasteoi   mmc0
>  21:          3          0         34          0   IO-APIC-fasteoi   ehci_hcd:usb2
>  40:          0          1         12         11   PCI-MSI-edge      mei_me
>  41:      10617        173       9959        292   PCI-MSI-edge      ahci
>  42:        862         11        186         26   PCI-MSI-edge      xhci_hcd
>  43:        107         77         27        102   PCI-MSI-edge      i915
>  44:       5322         20        434         22   PCI-MSI-edge      iwlwifi
>  45:        180          0        183         86   PCI-MSI-edge      snd_hda_intel
>  46:          0          3          0          0   PCI-MSI-edge      eth0
> NMI:          1          0          0          0   Non-maskable interrupts
> LOC:      16312      15177      10840       8995   Local timer interrupts
> SPU:          0          0          0          0   Spurious interrupts
> PMI:          1          0          0          0   Performance monitoring interrupts
> IWI:       1160        523       1031        481   IRQ work interrupts
> RTR:          3          0          0          0   APIC ICR read retries
> RES:      14976      16135       9973      10784   Rescheduling interrupts
> CAL:        482        457        151        370   Function call interrupts
> TLB:         70        106        352        230   TLB shootdowns
> TRM:          0          0          0          0   Thermal event interrupts
> THR:          0          0          0          0   Threshold APIC interrupts
> MCE:          0          0          0          0   Machine check exceptions
> MCP:          2          2          2          2   Machine check polls
> ERR:          0
> MIS:          0
> 
> Good case after:
>            CPU0       CPU1       CPU2       CPU3       
>   0:         17          0          0          0   IO-APIC-edge      timer
>   1:        367          1         81          4   IO-APIC-edge      i8042
>   8:          0          0          1          0   IO-APIC-edge      rtc0
>   9:        478         14        460         71   IO-APIC-fasteoi   acpi
>  12:         10        108        269       2696   IO-APIC-edge      i8042
>  16:         36         10        111          2   IO-APIC-fasteoi   ehci_hcd:usb1
>  17:         20          3         25          4   IO-APIC-fasteoi   mmc0
>  21:          3          0         34          0   IO-APIC-fasteoi   ehci_hcd:usb2
>  40:          0          1         12         11   PCI-MSI-edge      mei_me
>  41:      16888        173       9959        292   PCI-MSI-edge      ahci
>  42:       1102         11        186         26   PCI-MSI-edge      xhci_hcd
>  43:        107        132         27        136   PCI-MSI-edge      i915
>  44:       6943         20        434         22   PCI-MSI-edge      iwlwifi
>  45:        180          0        183         86   PCI-MSI-edge      snd_hda_intel
>  46:          0          3          0          0   PCI-MSI-edge      eth0
> NMI:          4          3          3          3   Non-maskable interrupts
> LOC:      26845      24780      19025      17746   Local timer interrupts
> SPU:          0          0          0          0   Spurious interrupts
> PMI:          4          3          3          3   Performance monitoring interrupts
> IWI:       1637        751       1287        695   IRQ work interrupts
> RTR:          3          0          0          0   APIC ICR read retries
> RES:      26511      26673      18791      20194   Rescheduling interrupts
> CAL:        510        480        151        370   Function call interrupts
> TLB:        361        292        575        461   TLB shootdowns
> TRM:          0          0          0          0   Thermal event interrupts
> THR:          0          0          0          0   Threshold APIC interrupts
> MCE:          0          0          0          0   Machine check exceptions
> MCP:          2          2          2          2   Machine check polls
> ERR:          0
> MIS:          0
> 
> Bad case before:
>            CPU0       CPU1       CPU2       CPU3       
>   0:         17          0          0          0   IO-APIC-edge      timer
>   1:        172          3         78          3   IO-APIC-edge      i8042
>   8:          0          1          0          0   IO-APIC-edge      rtc0
>   9:       1200        148        395         81   IO-APIC-fasteoi   acpi
>  12:       1625          2        348         10   IO-APIC-edge      i8042
>  16:         26         23        115          0   IO-APIC-fasteoi   ehci_hcd:usb1
>  17:         16          3         12         21   IO-APIC-fasteoi   mmc0
>  21:          2          2         33          0   IO-APIC-fasteoi   ehci_hcd:usb2
>  40:          0          0         14         10   PCI-MSI-edge      mei_me
>  41:      15776        374       8497        687   PCI-MSI-edge      ahci
>  42:       1297        829        115         24   PCI-MSI-edge      xhci_hcd
>  43:        103        149          9        212   PCI-MSI-edge      i915
>  44:      13151        101        511         91   PCI-MSI-edge      iwlwifi
>  45:        153        159          0        122   PCI-MSI-edge      snd_hda_intel
>  46:          0          1          1          0   PCI-MSI-edge      eth0
> NMI:         32         31         31         31   Non-maskable interrupts
> LOC:      82504      82732      74172      75985   Local timer interrupts
> SPU:          0          0          0          0   Spurious interrupts
> PMI:         32         31         31         31   Performance monitoring interrupts
> IWI:      17816      16278      13833      13282   IRQ work interrupts
> RTR:          3          0          0          0   APIC ICR read retries
> RES:      18784      21084      13313      12946   Rescheduling interrupts
> CAL:        393        422        306        356   Function call interrupts
> TLB:        231        176        235        191   TLB shootdowns
> TRM:          0          0          0          0   Thermal event interrupts
> THR:          0          0          0          0   Threshold APIC interrupts
> MCE:          0          0          0          0   Machine check exceptions
> MCP:          3          3          3          3   Machine check polls
> ERR:          0
> MIS:          0
> 
> Bad case after:
>            CPU0       CPU1       CPU2       CPU3       
>   0:         17          0          0          0   IO-APIC-edge      timer
>   1:        415          3         85          3   IO-APIC-edge      i8042
>   8:          0          1          0          0   IO-APIC-edge      rtc0
>   9:       1277        148        428         81   IO-APIC-fasteoi   acpi
>  12:       1625          2        348         10   IO-APIC-edge      i8042
>  16:         26         23        115          0   IO-APIC-fasteoi   ehci_hcd:usb1
>  17:         16          3         12         21   IO-APIC-fasteoi   mmc0
>  21:          2          2         33          0   IO-APIC-fasteoi   ehci_hcd:usb2
>  40:          0          0         14         10   PCI-MSI-edge      mei_me
>  41:      17814        374       8497        687   PCI-MSI-edge      ahci
>  42:       1567        829        115         24   PCI-MSI-edge      xhci_hcd
>  43:        103        177          9        242   PCI-MSI-edge      i915
>  44:      14956        101        511         91   PCI-MSI-edge      iwlwifi
>  45:        153        159          0        122   PCI-MSI-edge      snd_hda_intel
>  46:          0          1          1          0   PCI-MSI-edge      eth0
> NMI:         36         35         34         34   Non-maskable interrupts
> LOC:      92429      92708      81714      84071   Local timer interrupts
> SPU:          0          0          0          0   Spurious interrupts
> PMI:         36         35         34         34   Performance monitoring interrupts
> IWI:      22594      19658      17439      14257   IRQ work interrupts
> RTR:          3          0          0          0   APIC ICR read retries
> RES:      21491      24670      14618      14569   Rescheduling interrupts
> CAL:        441        439        306        356   Function call interrupts
> TLB:        232        181        274        465   TLB shootdowns
> TRM:          0          0          0          0   Thermal event interrupts
> THR:          0          0          0          0   Threshold APIC interrupts
> MCE:          0          0          0          0   Machine check exceptions
> MCP:          3          3          3          3   Machine check polls
> ERR:          0
> MIS:          0

Lots more local timer interrupts, which is consistent with the higher
time in interrupt handlers for the bad case.

> > One hypothesis is that your workload and configuration are interacting
> > with RCU_FAST_NO_HZ to force very large numbers of RCU grace periods.
> > Could you please check for this by building with CONFIG_RCU_TRACE=y,
> > mounting debugfs somewhere and dumping rcu/rcu_sched/rcugp before and
> > after each run?
> 
> Good case before:
> completed=8756  gpnum=8757  age=0  max=21
> after:
> completed=14686  gpnum=14687  age=0  max=21
> 
> Bad case before:
> completed=22970  gpnum=22971  age=0  max=21
> after:
> completed=26110  gpnum=26111  age=0  max=21

In the good case, (14686-8756)/40=148.25 grace periods per second, which
is a fast but reasonable rate given your HZ=250.  Not a large difference
in the number of grace periods, but extrapolating for the longer runtime,
maybe ten times as much.  But not much change in grace-period rate per
unit time.

> The test scenario was the following in both cases (mixed english and pseudo-bash):
> reboot, login, start terminal
> cd project
> rm -r build
> cat /proc/interrupts >> somefile ; cat /sys/kernel/debug/rcu/rcu_sched/rcugp >> somefile
> scons -j4
> wait ~40 sec (good case finished, Ctrl-C in bad case)
> cat /proc/interrupts >> somefile ; cat /sys/kernel/debug/rcu/rcu_sched/rcugp >> somefile
> 
> I stopped the build in the bad case after about the same time the good
> case finished, so the extra interrupts and RCU grace periods because of the
> longer runtime don't fake the results.

That procedure works for me, thank you for laying it out carefully.

I believe I see what is going on and how to fix it, though it may take
me a bit to work things through and get a good patch.

Thank you very much for your testing efforts!

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/