lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 23 Feb 2012 16:46:54 +0100
From:	Lesław Kopeć <leslaw.kopec@...za-klasa.pl>
To:	Aman Gupta <aman@...1.net>
CC:	linux-kernel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>,
	Chase Douglas <chase.douglas@...onical.com>,
	Damien Wyart <damien.wyart@...e.fr>,
	Kyle McMartin <kyle@...hat.com>,
	Venkatesh Pallipadi <venki@...gle.com>,
	Jonathan Nieder <jrnieder@...il.com>
Subject: Re: Inconsistent load average on tickless kernels

On 02/06/2012 07:51 AM, Aman Gupta wrote:
> I have an LVS/DR cluster of 10 machines that receive similar traffic
> via a round-robin strategy. These machines run Debian Lenny with
> 2.6.26, and consistently have a 15-minute load average between 4-12
> depending on the time of day.
> 
> Upgrading any one of these machines to a newer kernel compiled with
> NO_HZ=y causes the reported load average to drop significantly. [...]

I can confirm Aman's results on kernels 2.6.32 and higher on a similar
setup. I did a test on a cluster of diskless PHP workers. Servers were
running on identical hardware and software platform. The workload should
have been the same. However load average was reporting different values
depending on which kernel the host was running.

I have tested the following vanilla kernels:
* 2.6.32.55-*
* 2.6.32.55-*-74f5187ac8 (2.6.32.55 with patch 74f5187ac8)
* 2.6.32.55-*-0f004f5a69 (2.6.32.55 with patch 74f5187ac8 and 0f004f5a69)
* 2.6.37-rc5-*-0f004f5a69 (2.6.37 at commit 0f004f5a69)
* 2.6.37-rc5-*-pre-0f004f5a69 (2.6.37 at commit 6313e3c217)

Each kernel was compiled with CONFIG_NO_HZ enabled (no-hz variant) and
disabled (hz variant). Here's a snapshot of load 15 on each kernel:
				no-hz	hz
2.6.32.55-*			0.59	0.57
2.6.32.55-*-74f5187ac8		3.56	11.79
2.6.32.55-*-0f004f5a69		0.61	11.76
2.6.37-rc5-*-0f004f5a69		0.67	11.65
2.6.37-rc5-*-pre-0f004f5a69	3.97	12.05

I've also uploaded load average [1] and CPU utilization [2] charts for a
visual comparison.

My observations are:

1. On tickless kernels load is very low where no or both patches
(74f5187ac8 and 0f004f5a69) are applied.

2. Kernels that have only patch 74f5187ac8 applied have the smallest
difference between hz and no-hz variants. Still no-hz kernels are
returning values lower than their hz siblings.

3. Non-tickless kernels seem to be reporting correct load values.
Overall trend and values are matching CPU utilization. Only exception is
2.6.32.55-hz which reports the same values as 2.6.32.55-no-hz.

4. If x processes are using all available cycles load is correctly
incremented by x. This behavior is consistent on all kernels.


Steps to reproduce: run a bunch of CPU bound processes that will not use
all available cycles. The biggest difference between expected and
measured load is around 30% CPU utilization in my case.


Has there been any other patches that correct load calculation? Maybe
I'm testing it in a wrong way? I'd appreciate any suggestions. I'd be
happy to test new patches. Sadly, I cannot propose any fixes as kernel
sources are still a mystery to me.


[1] http://img841.imageshack.us/img841/2204/kernelload.png
[2] http://img854.imageshack.us/img854/8194/kernelcpu.png

-- 
Lesław Kopeć

View attachment "dmesg-2.6.32.55.log" of type "text/plain" (82738 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (263 bytes)

Powered by blists - more mailing lists