linux-kernel - Re: [tip:sched/core] sched: Fix ancient race in do

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFz20=H0ZubZEAmEbv0gL8j3C5Kgt4m2+5MU63uVipZt0Q@mail.gmail.com>
Date:	Sun, 29 Jan 2012 09:44:42 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	mingo@...hat.com, hpa@...or.com, linux-kernel@...r.kernel.org,
	a.p.zijlstra@...llo.nl, y-goto@...fujitsu.com,
	akpm@...ux-foundation.org, tglx@...utronix.de, mingo@...e.hu,
	linux-tip-commits@...r.kernel.org
Subject: Re: [tip:sched/core] sched: Fix ancient race in do_exit()

On Sun, Jan 29, 2012 at 8:07 AM, Oleg Nesterov <oleg@...hat.com> wrote:
> On 01/28, Linus Torvalds wrote:
>>
>> It would be much nicer to just clear the rwsem waiter->task thing
>> *after* waking the task up, which would avoid this race entirely,
>> afaik.
>
> How? The problem is that wake_up_process(tsk) sees this task in
> TASK_UNINTERRUPTIBLE state (the first "p->state & state" check in
> try_to_wake_up), but then this task changes its state to TASK_DEAD
> without schedule() and ttwu() does s/TASK_DEAD/TASK_RUNNING/.

Exactly.

The problem is that THE TASK HAS LONG SINCE GONE AWAY FROM THE RWSEM!

The task is already finishing up the exit (it might even have
*completed* the exit, which is why we take the extra reference count
to the task) when the wakeup happens.

THAT is the problem. The fact that in that big problem ("we do a
spurious wakeup for the rwsem long after the task no longer waits for
it") we then have a really small and tiny race (the race between
TASK_UNINTERRUPTIBLE -> (TASK_RUNNING ->) TASK_DEAD) is just a tiny
small detail.

So the real problem is that rwsem's are broken. Normal wakeups never
have this issue, because they use the spinlock for the waitqueue to
serialize: either a process wakes up in a timely manner, or the
process has removed itself from the wait-queue - you never see wakeups
long after the process has stopped caring.

The whole problem with the exit sequence is just a random detail. The
same spurious wakeup could happen in other places too - it just
doesn't happen to cause an Oops anywhere else (that we know about -
who knows if some code woudl be unhappy with getting woken up
unexpectedly).

>> Tell me, why wouldn't that work? rwsem_down_failed_common() does
>>
>>         /* wait to be given the lock */
>>         for (;;) {
>>                 if (!waiter.task)
>>                         break;
>>                ...
>>
>> so then we wouldn't need the task refcount crap in rwsem either etc,
>> and we'd get rid of all races with wakeup.
>>
>> I wonder why we're clearing that whole waiter->task so early.
>
> I must have missed something. I can't understand how this can help,
> and "clear the rwsem waiter->task thing *after* waking" looks
> obviously wrong.

Why? It helps because it means that the task that did the "down()" on
the rwsem will *never* leave its task pointer around to be spuriously
woken later: it will wait for the waiter->task entry to be NULL - so
that we know that the wakeup path has used the task _and_ woken it up,
so there is no possibility for the task pointer still being active and
used to wake up some *random* state later.

> If we do this, then we can miss the "!!waiter.task"
> condition. The loop above actually does
>
>        set_task_state(TASK_UNINTERRUPTIBLE);
>
>        if (!waiter.task)
>                break;
>        schedule();
>
> and
>        wake_up_process(tsk);
>        waiter->task = NULL;
>
> can happen right after set_task_state().

So? That's the *point*. We want to wait until the wakeup is *done*
before we return from down(). We do *not* want to "break" (and go off
doing something else) while some other CPU may still be using 'task'.

But yes, if you're talking about TASK_UNINTERRUPTIBLE, we do need to
just remove the setting of that entirely. It needs to be set *before*
adding us to the list, not after. That's just a bug - we get woken up
when we've been given the lock.

IOW, I think the fix should look like the attched.  TOTALLY UNTESTED!
This is certainly a much scarier patch, but I think it fixes the
actual *reason* for the problem, rather than just working around the
symptom.

So it may be completely and utterly broken for some subtle reason, but
I don't see what it would be. It seems to clean up and simplify the
logic, and remove all the bogus workarounds for the fact that we used
to do things stupidly.

But maybe there's some reason for those "stupid" things. I just don't see it.

                   Linus

View attachment "patch.diff" of type "text/x-patch" (1273 bytes)