linux-kernel - Re: [PATCH v3] x86/hpet: Reduce HPET counter read contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 11 Apr 2016 13:21:27 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Waiman Long <Waiman.Long@....com>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	linux-doc@...nel.org, X86 ML <x86@...nel.org>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Borislav Petkov <bp@...e.de>,
	Andy Lutomirski <luto@...nel.org>,
	Scott J Norton <scott.norton@....com>,
	Douglas Hatch <doug.hatch@....com>,
	Randy Wright <rwright@....com>
Subject: Re: [PATCH v3] x86/hpet: Reduce HPET counter read contention

On Mon, Apr 11, 2016 at 1:09 PM, Waiman Long <Waiman.Long@....com> wrote:
> On a large system with many CPUs, using HPET as the clock source can
> have a significant impact on the overall system performance because
> of the following reasons:
>  1) There is a single HPET counter shared by all the CPUs.
>  2) HPET counter reading is a very slow operation.
>
> Using HPET as the default clock source may happen when, for example,
> the TSC clock calibration exceeds the allowable tolerance. Something
> the performance slowdown can be so severe that the system may crash
> because of a NMI watchdog soft lockup, for example.
>
> This patch attempts to reduce HPET read contention by using the fact
> that if more than one CPUs are trying to access HPET at the same time,
> it will be more efficient if one CPU in the group reads the HPET
> counter and shares it with the rest of the group instead of each
> group member reads the HPET counter individually.
>
> This is done by using a combination word with a sequence number and
> a bit lock. The CPU that gets the bit lock will be responsible for
> reading the HPET counter and update the sequence number. The others
> will monitor the change in sequence number and grab the HPET counter
> accordingly. This change is enabled on SMP configuration.
>
> On a 4-socket Haswell-EX box with 72 cores (HT off), running the
> AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
> with and without the patch has the following performance numbers
> (with HPET or TSC as clock source):
>
> TSC             = 646515 jobs/min
> HPET w/o patch  = 566708 jobs/min
> HPET with patch = 638791 jobs/min
>
> The perf profile showed a reduction of the %CPU time consumed by
> read_hpet from 4.99% without patch to 1.41% with patch.
>
> On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
> other hand, the performance numbers of the same benchmark were:
>
> TSC             = 3145329 jobs/min
> HPET w/o patch  = 1108537 jobs/min
> HPET with patch = 3019934 jobs/min
>
> The corresponding perf profile showed a drop of CPU consumption of
> the read_hpet function from more than 34% to just 2.96%.
>
> Signed-off-by: Waiman Long <Waiman.Long@....com>
> ---
>  v2->v3:
>   - Make the hpet optimization the default for SMP configuration. So
>     no documentation change is needed.
>   - Remove threshold checking code as it should not be necessary and
>     can be potentially unsafe.
>
>  v1->v2:
>   - Reduce the CPU threshold to 32.
>   - Add a kernel parameter to explicitly enable or disable hpet
>     optimization.
>   - Change hpet_save.hpet type to u32 to make sure that read & write
>     is atomic on i386.
>
>  arch/x86/kernel/hpet.c |   83 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 83 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index a1f0e4a..ff1f140 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -759,12 +759,95 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
>  #endif
>
>  /*
> + * Reading the HPET counter is a very slow operation. If a large number of
> + * CPUs are trying to access the HPET counter simultaneously, it can cause
> + * massive delay and slow down system performance dramatically. This may
> + * happen when HPET is the default clock source instead of TSC. For a
> + * really large system with hundreds of CPUs, the slowdown may be so
> + * severe that it may actually crash the system because of a NMI watchdog
> + * soft lockup, for example.
> + *
> + * If multiple CPUs are trying to access the HPET counter at the same time,
> + * we don't actually need to read the counter multiple times. Instead, the
> + * other CPUs can use the counter value read by the first CPU in the group.
> + *
> + * A sequence number whose lsb is a lock bit is used to control which CPU
> + * has the right to read the HPET counter directly and which CPUs are going
> + * to get the indirect value read by the lock holder. For the later group,
> + * if the sequence number differs from the expected locked value, they
> + * can assume that the saved HPET value is up-to-date and return it.
> + */
> +static struct {
> +       /* Sequence number + bit lock */
> +       int seq ____cacheline_aligned_in_smp;
> +
> +       /* Current HPET value           */
> +       u32 hpet ____cacheline_aligned_in_smp;
> +} hpet_save;
> +#define HPET_SEQ_LOCKED(seq)   ((seq) & 1)     /* Odd == locked */
> +
> +/*
>   * Clock source related code
>   */
> +#ifdef CONFIG_SMP
> +static cycle_t read_hpet(struct clocksource *cs)
> +{
> +       int seq;
> +
> +       seq = READ_ONCE(hpet_save.seq);
> +       if (!HPET_SEQ_LOCKED(seq)) {
> +               int old, new = seq + 1;
> +               unsigned long flags;
> +
> +               local_irq_save(flags);
> +               /*
> +                * Set the lock bit (lsb) to get the right to read HPET
> +                * counter directly. If successful, read the counter, save
> +                * its value, and increment the sequence number. Otherwise,
> +                * increment the sequnce number to the expected locked value
> +                * for comparison later on.
> +                */
> +               old = cmpxchg(&hpet_save.seq, seq, new);
> +               if (old == seq) {
> +                       u32 time = hpet_readl(HPET_COUNTER);
> +
> +                       WRITE_ONCE(hpet_save.hpet, time);

I think a plain store is okay due to the smp_store_release below.

> +
> +                       /* Unlock */
> +                       smp_store_release(&hpet_save.seq, new + 1);
> +                       local_irq_restore(flags);
> +                       return (cycle_t)time;
> +               }
> +               local_irq_restore(flags);
> +               seq = new;
> +       }
> +
> +       /*
> +        * Wait until the locked sequence number changes which indicates
> +        * that the saved HPET value is up-to-date.
> +        */
> +       while (READ_ONCE(hpet_save.seq) == seq) {
> +               /*
> +                * Since reading the HPET is much slower than a single
> +                * cpu_relax() instruction, we use two here in an attempt
> +                * to reduce the amount of cacheline contention in the
> +                * hpet_save.seq cacheline.
> +                */
> +               cpu_relax();
> +               cpu_relax();
> +       }
> +
> +       return (cycle_t)READ_ONCE(hpet_save.hpet);
> +}

I wonder if this could be simplified.  Pseudocode:

u32 time;
unsigned long flags;

local_irq_save(flags);

if (spin_trylock(&hpet_lock)) {
  time = hpet_readl(HPET_COUNTER);
  WRITE_ONCE(last_hpet_counter, time);
} else {
  spin_unlock_wait(&hpet_lock);
  /* When this function started, hpet_lock was locked.  Now it's
unlocked, which means that time is at least as new as whatever the
lock holder returned. */
 time = READ_ONCE(last_hpet_counter);
}

local_irq_restore(flags);
return time;

Should be fasterunder heavy contention, too: spinlocks are very nicely
optimized.

I *think* this is sufficiently atomic.

--Andy