linux-kernel - Re: [Problem] Cache line starvation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180927142547.ucgh5elb7pxs46dq@linutronix.de>
Date:   Thu, 27 Sep 2018 16:25:47 +0200
From:   Kurt Kanzenbach <kurt.kanzenbach@...utronix.de>
To:     Will Deacon <will.deacon@....com>
Cc:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        linux-kernel@...r.kernel.org,
        Daniel Wagner <daniel.wagner@...mens.com>,
        Peter Zijlstra <peterz@...radead.org>, x86@...nel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        "H. Peter Anvin" <hpa@...or.com>,
        Boqun Feng <boqun.feng@...il.com>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Mark Rutland <mark.rutland@....com>
Subject: Re: [Problem] Cache line starvation

Hi Will,

On Wed, Sep 26, 2018 at 01:53:02PM +0100, Will Deacon wrote:
> Hi all,
>
> On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > cores).
> >
> > Instrumentation show always the picture:
> >
> > CPU0                                         CPU1
> > => do_syscall_64                              => do_syscall_64
> > => SyS_ptrace                                   => syscall_slow_exit_work
> > => ptrace_check_attach                          => ptrace_do_notify / rt_read_unlock
> > => wait_task_inactive                              rt_spin_lock_slowunlock()
> >    -> while task_running()                         __rt_mutex_unlock_common()
> >   /   check_task_state()                           mark_wakeup_next_waiter()
> >  |     raw_spin_lock_irq(&p->pi_lock);             raw_spin_lock(&current->pi_lock);
> >  |     .                                               .
> >  |     raw_spin_unlock_irq(&p->pi_lock);               .
> >   \  cpu_relax()                                       .
> >    -                                                   .
> >     *IRQ*                                          <lock acquired>
> >
> > In the error case we observe that the while() loop is repeated more than
> > 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> > other side does not make progress waiting for the same lock with interrupts
> > disabled.
> >
> > This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
> > the other CPU is able to acquire pi_lock and the situation relaxes.
> >
> > Peter suggested to do a clwb(&p->pi_lock); before the cpu_relax() in
> > wait_task_inactive() which on both the Core2Duo and the SKL gets runtime
> > patched to clflush(). That hides it as well.
>
> Given the broadcast nature of cache-flushing, I'd be pretty nervous about
> adding it on anything other than a case-by-case basis. That doesn't sound
> like something we'd want to maintain... It would also be interesting to know
> whether the problem is actually before the cache (i.e. if the lock actually
> sits in the store buffer on CPU0). Does MFENCE/DSB after the unlock() help at
> all?
>
> We've previously seen something similar to this on arm64 in big/little
> systems where the big cores can loop around and re-take a spinlock before
> the little guys can get in the queue or take a ticket. I bodged that in
> cpu_relax(), but there's a magic heuristic which I couldn't figure out how
> to specify:
>
> https://lkml.org/lkml/2017/7/28/172
>
> For A72 (which is the core I think you're using) it would be interesting to
> try both:
>
> 	(1) Removing the prfm instruction from spin_lock(), and
> 	(2) Setting bit 42 of CPUACTLR_EL1 on each CPU (probably needs a
> 	    firmware change)

correct, we use the Cortex A72.

I followed your suggestions. I've removed the prefetch instructions from
the spin lock implementation in the v4.9 kernel. In addition I've
modified armv8/start.S in U-Boot to setup bit 42 in CPUACTLR_EL1
(S3_1_c15_c2_0). We've also made sure, that this bit is actually written
for each CPU by reading their register value in the kernel.

However, the issue still triggers fine. With stress-ng we're able to
generate latency in millisecond range. The only workaround we've found
so far is to add a "delay" in cpu_relax().

Any ideas, what we can test further?

Thanks,
Kurt

>
> That should prevent the lock() operation from speculatively pulling in the
> cacheline in a unique state.
>
> More recent Arm CPUs have atomic instructions which, apart from CAS,
> *should* avoid this starvation issue entirely.
>
> Will
>