linux-kernel - Re: [Question]: try to fix contention between expire_timers and try_to_del_timer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170728092831.GA24839@arm.com>
Date:   Fri, 28 Jul 2017 10:28:32 +0100
From:   Will Deacon <will.deacon@....com>
To:     Vikram Mulukutla <markivx@...eaurora.org>
Cc:     qiaozhou <qiaozhou@...micro.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        John Stultz <john.stultz@...aro.org>, sboyd@...eaurora.org,
        LKML <linux-kernel@...r.kernel.org>,
        Wang Wilbur <wilburwang@...micro.com>,
        Marc Zyngier <marc.zyngier@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel-owner@...r.kernel.org, sudeep.holla@....com
Subject: Re: [Question]: try to fix contention between expire_timers and
 try_to_del_timer_sync

On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
> On 2017-07-26 18:29, qiaozhou wrote:
> >On 2017年07月26日 22:16, Thomas Gleixner wrote:
> >>On Wed, 26 Jul 2017, qiaozhou wrote:
> >>For that particular timer case we can clear base->running_timer w/o the
> >>lock held (see patch below), but this kind of
> >>
> >>      lock -> test -> unlock -> retry
> >>
> >>loops are all over the place in the kernel, so this is going to hurt you
> >>sooner than later in some other place.
> >It's true. This is the way spinlock is used normally and widely in
> >kernel. I'll also ask ARM experts whether we can do something to avoid
> >or reduce the chance of such issue. ARMv8.1 has one single
> >instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can
> >improve and reduce the chance.
> 
> I think we should have this discussion now - I brought this up earlier [1]
> and I promised a test case that I completely forgot about - but here it
> is (attached). Essentially a Big CPU in an acquire-check-release loop
> will have an unfair advantage over a little CPU concurrently attempting
> to acquire the same lock, in spite of the ticket implementation. If the Big
> CPU needs the little CPU to make forward progress : livelock.
> 
> We've run into the same loop construct in other spots in the kernel and
> the reason that a real  symptom is so rare is that the retry-loop on the
> 'Big'
> CPU needs to be interrupted just once by say an IRQ/FIQ and the live-lock
> is broken. If the entire retry loop is within an interrupt-disabled critical
> section then the odds of live-locking are much higher.
> 
> An example of the problem on a previous kernel is here [2]. Changes to the
> workqueue code since may have fixed this particular instance.
> 
> One solution was to use udelay(1) in such loops instead of cpu_relax(), but
> that's not very 'relaxing'. I'm not sure if there's something we could do
> within the ticket spin-lock implementation to deal with this.

Does bodging cpu_relax to back-off to wfe after a while help? The event
stream will wake it up if nothing else does. Nasty patch below, but I'd be
interested to know whether or not it helps.

Will

--->8

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 64c9e78f9882..1f5a29c8612e 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -149,9 +149,11 @@ extern void release_thread(struct task_struct *);
 
 unsigned long get_wchan(struct task_struct *p);
 
+void __cpu_relax(unsigned long pc);
+
 static inline void cpu_relax(void)
 {
-	asm volatile("yield" ::: "memory");
+	__cpu_relax(_THIS_IP_);
 }
 
 /* Thread switching */
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index 67368c7329c0..be8a698ea680 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -72,6 +72,8 @@ EXPORT_SYMBOL(_mcount);
 NOKPROBE_SYMBOL(_mcount);
 #endif
 
+EXPORT_SYMBOL(__cpu_relax);
+
 	/* arm-smccc */
 EXPORT_SYMBOL(__arm_smccc_smc);
 EXPORT_SYMBOL(__arm_smccc_hvc);
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 659ae8094ed5..c394c3704b7f 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -403,6 +403,31 @@ unsigned long get_wchan(struct task_struct *p)
 	return ret;
 }
 
+static DEFINE_PER_CPU(u64, __cpu_relax_data);
+
+#define CPU_RELAX_WFE_THRESHOLD	10000
+void __cpu_relax(unsigned long pc)
+{
+	u64 new, old = raw_cpu_read(__cpu_relax_data);
+	u32 old_pc, new_pc;
+	bool wfe = false;
+
+	old_pc = (u32)old;
+	new = new_pc = (u32)pc;
+
+	if (old_pc == new_pc) {
+		if ((old >> 32) > CPU_RELAX_WFE_THRESHOLD) {
+			asm volatile("sevl; wfe; wfe\n" ::: "memory");
+			wfe = true;
+		} else {
+			new = old + (1UL << 32);
+		}
+	}
+
+	if (this_cpu_cmpxchg(__cpu_relax_data, old, new) == old && !wfe)
+		asm volatile("yield" ::: "memory");
+}
+
 unsigned long arch_align_stack(unsigned long sp)
 {
 	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)