linux-kernel - Re: [PATCH] x86/hpet: Reduce HPET counter read contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 7 Apr 2016 11:07:41 -0400
From:	Waiman Long <waiman.long@....com>
To:	Andy Lutomirski <luto@...capital.net>
CC:	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	X86 ML <x86@...nel.org>, Jiang Liu <jiang.liu@...ux.intel.com>,
	Borislav Petkov <bp@...e.de>,
	Andy Lutomirski <luto@...nel.org>,
	Scott J Norton <scott.norton@....com>,
	Douglas Hatch <doug.hatch@....com>,
	Randy Wright <rwright@....com>
Subject: Re: [PATCH] x86/hpet: Reduce HPET counter read contention

On 04/07/2016 12:58 AM, Andy Lutomirski wrote:
> On Wed, Apr 6, 2016 at 7:02 AM, Waiman Long<Waiman.Long@....com>  wrote:
>> On a large system with many CPUs, using HPET as the clock source can
>> have a significant impact on the overall system performance because
>> of the following reasons:
>>   1) There is a single HPET counter shared by all the CPUs.
>>   2) HPET counter reading is a very slow operation.
>>
>> Using HPET as the default clock source may happen when, for example,
>> the TSC clock calibration exceeds the allowable tolerance. Something
>> the performance slowdown can be so severe that the system may crash
>> because of a NMI watchdog soft lockup, for example.
>>
>> This patch attempts to reduce HPET read contention by using the fact
>> that if more than one task are trying to access HPET at the same time,
>> it will be more efficient if one task in the group reads the HPET
>> counter and shares it with the rest of the group instead of each
>> group member reads the HPET counter individually.
>>
>> This is done by using a combination word with a sequence number and
>> a bit lock. The task that gets the bit lock will be responsible for
>> reading the HPET counter and update the sequence number. The others
>> will monitor the change in sequence number and grab the HPET counter
>> accordingly.
>>
>> On a 4-socket Haswell-EX box with 72 cores (HT off), running the
>> AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
>> with and without the patch has the following performance numbers
>> (with HPET or TSC as clock source):
>>
>> TSC             = 646515 jobs/min
>> HPET w/o patch  = 566708 jobs/min
>> HPET with patch = 638791 jobs/min
>>
>> The perf profile showed a reduction of the %CPU time consumed by
>> read_hpet from 4.99% without patch to 1.41% with patch.
>>
>> On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
>> other hand, the performance numbers of the same benchmark were:
>>
>> TSC             = 3145329 jobs/min
>> HPET w/o patch  = 1108537 jobs/min
>> HPET with patch = 3019934 jobs/min
>>
>> The corresponding perf profile showed a drop of CPU consumption of
>> the read_hpet function from more than 34% to just 2.96%.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@....com>
>> ---
>>   arch/x86/kernel/hpet.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 files changed, 109 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>> index a1f0e4a..9e3de73 100644
>> --- a/arch/x86/kernel/hpet.c
>> +++ b/arch/x86/kernel/hpet.c
>> @@ -759,11 +759,112 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
>>   #endif
>>
>>   /*
>> + * Reading the HPET counter is a very slow operation. If a large number of
>> + * CPUs are trying to access the HPET counter simultaneously, it can cause
>> + * massive delay and slow down system performance dramatically. This may
>> + * happen when HPET is the default clock source instead of TSC. For a
>> + * really large system with hundreds of CPUs, the slowdown may be so
>> + * severe that it may actually crash the system because of a NMI watchdog
>> + * soft lockup, for example.
>> + *
>> + * If multiple CPUs are trying to access the HPET counter at the same time,
>> + * we don't actually need to read the counter multiple times. Instead, the
>> + * other CPUs can use the counter value read by the first CPU in the group.
>> + *
>> + * A sequence number whose lsb is a lock bit is used to control which CPU
>> + * has the right to read the HPET counter directly and which CPUs are going
>> + * to get the indirect value read by the lock holder. For the later group,
>> + * if the sequence number differs from the expected locked value, they
>> + * can assume that the saved HPET value is up-to-date and return it.
>> + *
>> + * This mechanism is only activated on system with a large number of CPUs.
>> + * Currently, it is enabled when nr_cpus>  64.
>> + */
> Reading the HPET is so slow that all the atomic ops in the world won't
> make a dent.  Why not just turn this optimization on unconditionally?
>
> --Andy

I am constantly on the alert that we should not introduce regression on 
lesser systems like a single socket machine with a few cores. That is 
why I put the check to conditionally enable this optimization. I have no 
issue of taking that out and let it be the default as long as no one object.

Cheers,
Longman