linux-kernel - Re: [GIT pull] timer updates for 4.9

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.20.1610240947160.4872@nanos>
Date:   Mon, 24 Oct 2016 11:39:53 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
cc:     LKML <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [GIT pull] timer updates for 4.9

On Sun, 23 Oct 2016, Linus Torvalds wrote:
> So I found what looks like a bug in lock_timer_base() wrt migration.
> 
> This code:
> 
>         for (;;) {
>                 struct timer_base *base;
>                 u32 tf = timer->flags;
> 
>                 if (!(tf & TIMER_MIGRATING)) {
>                         base = get_timer_base(tf);
>                         spin_lock_irqsave(&base->lock, *flags);
>                         if (timer->flags == tf)
>                                 return base;
>                         spin_unlock_irqrestore(&base->lock, *flags);
>                 }
>                 cpu_relax();
>         }
> 
> looks subtly buggy. I think that load of "tf" needs a READ_ONCE() to
> make sure that gcc doesn't simply reload the valid of "timer->flags"
> at random points.

You are right, that needs a READ_ONCE(). Stupid me.
 
> Yes, the spin_lock_irqsave() is a barrier, but that's the only one.
> Afaik, gcc could decide that "I need to spill tf, so I'll just reload
> it" after looking up get_timer_base().
>
> And no, I don't think this is the cause of my problem, but I suspect
> that something _like_ fragility in lock_timer_base() could cause this.

It might explain it, when this really ends up with the wrong base.

> I dunno. That whole thing looks very fragile to begin with: is it
> really ok to change the expiry time of a timer without holding any
> locks what-so-ever? The timer may just be firing on another CPU, and
> you may be setting the expiry time on a timer that isn't ever going to
> fire again.

A timer firing on the other CPU is not an issue, but yes, we should not do
that unlocked. This certainly is a naive over optimization.

I'll go through the locking once more with a fine comb and send out fixes
later today along with another NOHZ issue which I decoded over the weekend.
 
> And there may well be valid reasons why I'm full of crap, and there's
> some reason why this is all safe. Maybe the GP fault I saw was my
> fault after all, in some way that I can't for the life of me figure
> out right now..

Is your pr_cont thing serialized or does it rely on the timer locking (or
the lack of it)?

Thanks,

	tglx