linux-kernel - Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <578657EF.6010500@hpe.com>
Date:	Wed, 13 Jul 2016 11:02:07 -0400
From:	Waiman Long <waiman.long@....com>
To:	Waiman Long <Waiman.Long@....com>
CC:	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, <linux-kernel@...r.kernel.org>,
	<x86@...nel.org>, Jiang Liu <jiang.liu@...ux.intel.com>,
	Borislav Petkov <bp@...e.de>,
	Andy Lutomirski <luto@...nel.org>,
	Scott J Norton <scott.norton@....com>,
	Douglas Hatch <doug.hatch@....com>,
	Randy Wright <rwright@....com>
Subject: Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention

On 06/17/2016 04:20 PM, Waiman Long wrote:
> On a large system with many CPUs, using HPET as the clock source can
> have a significant impact on the overall system performance because
> of the following reasons:
>   1) There is a single HPET counter shared by all the CPUs.
>   2) HPET counter reading is a very slow operation.
>
> Using HPET as the default clock source may happen when, for example,
> the TSC clock calibration exceeds the allowable tolerance. Something
> the performance slowdown can be so severe that the system may crash
> because of a NMI watchdog soft lockup, for example.
>
> This patch attempts to reduce HPET read contention by using the fact
> that if more than one CPUs are trying to access HPET at the same time,
> it will be more efficient if one CPU in the group reads the HPET
> counter and shares it with the rest of the group instead of each
> group member reads the HPET counter individually.
>
> This is done by using a combination word with a sequence number and
> a bit lock. The CPU that gets the bit lock will be responsible for
> reading the HPET counter and update the sequence number. The others
> will monitor the change in sequence number and grab the HPET counter
> accordingly. This change is enabled on SMP configuration.
>
> On a 4-socket Haswell-EX box with 72 cores (HT off), running the
> AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
> with and without the patch has the following performance numbers
> (with HPET or TSC as clock source):
>
> TSC		= 646515 jobs/min
> HPET w/o patch	= 566708 jobs/min
> HPET with patch	= 638791 jobs/min
>
> The perf profile showed a reduction of the %CPU time consumed by
> read_hpet from 4.99% without patch to 1.41% with patch.
>
> On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
> other hand, the performance numbers of the same benchmark were:
>
> TSC		= 3145329 jobs/min
> HPET w/o patch	= 1108537 jobs/min
> HPET with patch	= 3019934 jobs/min
>
> The corresponding perf profile showed a drop of CPU consumption of
> the read_hpet function from more than 34% to just 2.96%.
>
> Signed-off-by: Waiman Long<Waiman.Long@....com>
> ---
>   v3->v4:
>    - Move hpet_save inside the CONFIG_SMP block to fix a compilation
>      warning in non-SMP build.
>
>   v2->v3:
>    - Make the hpet optimization the default for SMP configuration. So
>      no documentation change is needed.
>    - Remove threshold checking code as it should not be necessary and
>      can be potentially unsafe.
>
>   v1->v2:
>    - Reduce the CPU threshold to 32.
>    - Add a kernel parameter to explicitly enable or disable hpet
>      optimization.
>    - Change hpet_save.hpet type to u32 to make sure that read&  write
>      is atomic on i386.
>
>   arch/x86/kernel/hpet.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 84 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index a1f0e4a..bc5bb53 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -759,12 +759,96 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
>   #endif
>
>   /*
> + * Reading the HPET counter is a very slow operation. If a large number of
> + * CPUs are trying to access the HPET counter simultaneously, it can cause
> + * massive delay and slow down system performance dramatically. This may
> + * happen when HPET is the default clock source instead of TSC. For a
> + * really large system with hundreds of CPUs, the slowdown may be so
> + * severe that it may actually crash the system because of a NMI watchdog
> + * soft lockup, for example.
> + *
> + * If multiple CPUs are trying to access the HPET counter at the same time,
> + * we don't actually need to read the counter multiple times. Instead, the
> + * other CPUs can use the counter value read by the first CPU in the group.
> + *
> + * A sequence number whose lsb is a lock bit is used to control which CPU
> + * has the right to read the HPET counter directly and which CPUs are going
> + * to get the indirect value read by the lock holder. For the later group,
> + * if the sequence number differs from the expected locked value, they
> + * can assume that the saved HPET value is up-to-date and return it.
> + */
> +#define HPET_SEQ_LOCKED(seq)	((seq)&  1)	/* Odd == locked */
> +
> +/*
>    * Clock source related code
>    */
> +#ifdef CONFIG_SMP
> +static struct {
> +	/* Sequence number + bit lock */
> +	int seq ____cacheline_aligned_in_smp;
> +
> +	/* Current HPET value		*/
> +	u32 hpet ____cacheline_aligned_in_smp;
> +} hpet_save;
> +
> +static cycle_t read_hpet(struct clocksource *cs)
> +{
> +	int seq;
> +
> +	seq = READ_ONCE(hpet_save.seq);
> +	if (!HPET_SEQ_LOCKED(seq)) {
> +		int old, new = seq + 1;
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		/*
> +		 * Set the lock bit (lsb) to get the right to read HPET
> +		 * counter directly. If successful, read the counter, save
> +		 * its value, and increment the sequence number. Otherwise,
> +		 * increment the sequnce number to the expected locked value
> +		 * for comparison later on.
> +		 */
> +		old = cmpxchg(&hpet_save.seq, seq, new);
> +		if (old == seq) {
> +			u32 time;
> +
> +			time = hpet_save.hpet = hpet_readl(HPET_COUNTER);
> +
> +			/* Unlock */
> +			smp_store_release(&hpet_save.seq, new + 1);
> +			local_irq_restore(flags);
> +			return (cycle_t)time;
> +		}
> +		local_irq_restore(flags);
> +		seq = new;
> +	}
> +
> +	/*
> +	 * Wait until the locked sequence number changes which indicates
> +	 * that the saved HPET value is up-to-date.
> +	 */
> +	while (READ_ONCE(hpet_save.seq) == seq) {
> +		/*
> +		 * Since reading the HPET is much slower than a single
> +		 * cpu_relax() instruction, we use two here in an attempt
> +		 * to reduce the amount of cacheline contention in the
> +		 * hpet_save.seq cacheline.
> +		 */
> +		cpu_relax();
> +		cpu_relax();
> +	}
> +
> +	return (cycle_t)READ_ONCE(hpet_save.hpet);
> +}
> +#else /* CONFIG_SMP */
> +/*
> + * For UP
> + */
>   static cycle_t read_hpet(struct clocksource *cs)
>   {
>   	return (cycle_t)hpet_readl(HPET_COUNTER);
>   }
> +#endif /* CONFIG_SMP */
>
>   static struct clocksource clocksource_hpet = {
>   	.name		= "hpet",

This patch was created in response to a problem we have on the 16-socket 
Broadwell-EX systems (up to 768 logical CPUs) that were under 
development. About 10% of the kernel boots experienced soft lockups:

[   71.618132] NetLabel: Initializing
[   71.621967] NetLabel:  domain hash size = 128
[   71.626848] NetLabel:  protocols = UNLABELED CIPSOv4
[   71.632418] NetLabel:  unlabeled traffic allowed by default
[   71.638679] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[   71.646504] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[   71.655313] Switching to clocksource hpet
[   95.679135] BUG: soft lockup - CPU#144 stuck for 23s! [swapper/144:0]
[   95.693363] BUG: soft lockup - CPU#145 stuck for 23s! [swapper/145:0]
[   95.694203] Modules linked in:
[   95.694697] CPU: 145 PID: 0 Comm: swapper/145 Not tainted
3.10.0-327.el7.x86_64 #1
[   95.695580] BUG: soft lockup - CPU#582 stuck for 23s! [swapper/582:0]
[   95.696145] Hardware name: HP Superdome2 16s x86, BIOS Bundle: 
008.001.006
SFW: 041.063.152 01/16/2016
[   95.698128] BUG: soft lockup - CPU#357 stuck for 23s! [swapper/357:0]
[   95.699774] task: ffff8cf7fecf4500 ti: ffff89787c748000 task.ti: 
ffff89787c748000

During the bootup process, there is a short time where the clocksource 
is switched to hpet to calibrate the tsc's. Then it will be switched 
back to tsc once the calibration is done. It is during the short period 
that soft lockups may happen.

This patch eliminates the sporadic boot soft lockup problem. Please 
consider merging it upstream.

Thanks,
Longman