linux-kernel - Re: Runqueue spinlock recursion on arm64 v4.15

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20180205141538.4l3nbdmkejys7ok4@lakrids.cambridge.arm.com>
Date:   Mon, 5 Feb 2018 14:15:39 +0000
From:   Mark Rutland <mark.rutland@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     efault@....de, linux-kernel@...r.kernel.org,
        alexander.levin@...izon.com, tglx@...utronix.de, mingo@...nel.org,
        linux-arm-kernel@...ts.infradead.org
Subject: Re: Runqueue spinlock recursion on arm64 v4.15

On Mon, Feb 05, 2018 at 03:02:01PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 05, 2018 at 01:36:00PM +0000, Mark Rutland wrote:
> > On Fri, Feb 02, 2018 at 10:07:26PM +0000, Mark Rutland wrote:
> > > On Fri, Feb 02, 2018 at 08:55:06PM +0100, Peter Zijlstra wrote:
> > > > On Fri, Feb 02, 2018 at 07:27:04PM +0000, Mark Rutland wrote:
> > > > > ... in some cases, owner_cpu is -1, so I guess we're racing with an
> > > > > unlock. I only ever see this on the runqueue locks in wake up functions.
> > > > 
> > > > So runqueue locks are special in that the owner changes over a contex
> > > > switch, maybe something goes funny there?
> > > 
> > > Aha! I think that's it!
> > > 
> > > In finish_lock_switch() we do:
> > > 
> > > 	smp_store_release(&prev->on_cpu, 0);
> > > 	...
> > > 	rq->lock.owner = current;
> > > 
> > > As soon as we update prev->on_cpu, prev can be scheduled on another CPU, and
> > > can thus see a stale value for rq->lock.owner (e.g. if it tries to wake up
> > > another task on that rq).
> > 
> > I hacked in a forced vCPU preemption between the two using a sled of WFE
> > instructions, and now I can trigger the problem in seconds rather than
> > hours.
> > 
> > With the patch below applied, things seem to fine so far.
> > 
> > So I'm pretty sure this is it. I'll clean up the patch text and resend
> > that in a bit.
> 
> Also try and send it against an up-to-date scheduler tree, we just
> moved some stuff around just about there.

Ah, will do. I guess I should base on TIP sched/urgent?

Thanks,
Mark.