linux-kernel - Re: idle issues running sembench on 128 cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.02.1105042348350.3005@ionos>
Date:	Thu, 5 May 2011 00:04:41 +0200 (CEST)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Dave Kleikamp <dkleikamp@...il.com>
cc:	Chris Mason <chris.mason@...cle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	linux-kernel@...r.kernel.org
Subject: Re: idle issues running sembench on 128 cpus

Dave,

On Wed, 4 May 2011, Dave Kleikamp wrote:

> Thomas,
> I've been looking at performance running sembench on a 128-cpu system and I'm
> running into some issues in the idle loop.
> 
> Initially, I was seeing a lot of contention on the clockevents_lock in
> clockevents_notify(). Assuming it is only protecting clockevents_chain, and
> not the handlers themselves, I changed this to an rwlock (with thoughts of
> using rcu if successful).
> 
> This didn't help, but exposed an underlying problem with high contention on
> tick_broadcast_lock in tick_broadcast_oneshot_control(). I think with this
> many cpus, tick_handle_oneshot_broadcast() is holding that lock a long time,
> causing the idle cpus to spin on the lock.
> 
> I am able to avoid this problem with either kernel parameter, "idle=mwait" or
> "processor.max_cstate=1". Similarly, defining CONFIG_INTEL_IDLE=y and using
> the kernel parameter intel_idle.max_cstate=1 exposes a different spinlock,
> pm_qos_lock, but I found this patch which fixes that contention:
> https://lists.linux-foundation.org/pipermail/linux-pm/2011-February/030266.html
> https://patchwork.kernel.org/patch/550721/
> 
> Of course, we'd like to find a way to reduce the spinlock contention and not
> resort to prohibiting the cpus from entering C3 state at all. I don't see a
> simple fix, and want to know if you've seen anything like this before and
> given it any thought.
> 
> I also don't know if it makes sense to be able to tune the cpuidle governors
> to add more resistance to enter the C3 state, or even being able to switch to
> a performance governor at runtime, similar to cpufreq.
> 
> I'd like to hear your thoughts before I dive any deeper into this.

Tick broadcasting for more than a few CPU's simply does not scale and
never will.

There is no real way to avoid the global lock if all what you have is
_ONE_ global working event device and N cpus which try to work around
their f*cked up local apics when deeper C-States are entered.

The same problem is with the TSC which stops in deeper C-States. You
just don't see lock contention because we rely on the HW serialization
of HPET or PM_TIMER which is a huge bottleneck when you try to do
timekeeping related stuff high frequency on more than a handful of
cores at the same time. Just benchmark a tight loop of gettimeofday()
or clock_gettime() on such a machine with and without max_cstate=1 on
the kernel command line.

We could perhaps get away w/o the locking for the NOHZ=n and HIGHRES=n
case, but I doubt that you want to have that given that you don't want
to restrict C-States either. C-states do not make much sense without
NOHZ=y at least.

We tried to beat sense into unnamed HW manufacturers for years and it
took just a little bit more than a decade that they started to act on
it :(

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/