linux-kernel - Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1593436373.x8otyji40u.astroid@bobo.none>
Date:   Mon, 29 Jun 2020 23:16:02 +1000
From:   Nicholas Piggin <npiggin@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Andi Kleen <ak@...ux.intel.com>,
        Davidlohr Bueso <dave@...olabs.net>, Jan Kara <jack@...e.cz>,
        Lukas Czerner <lczerner@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Oleg Nesterov <oleg@...hat.com>
Subject: Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss
 wakeup?

Excerpts from Nicholas Piggin's message of June 29, 2020 1:28 pm:
> Excerpts from Linus Torvalds's message of June 28, 2020 3:39 pm:
>> On Fri, Jun 26, 2020 at 8:43 AM Peter Zijlstra <peterz@...radead.org> wrote:
>>>
>>> I ended up with something like the below.. but it is too warm to think
>>> properly.
>>>
>>> I don't particularly like WQ_FLAG_PAGEWAITERS, but I liked open-coding
>>> all that even less.
>> 
>> Ugh.
>> 
>> I think I have a much simpler approach, actually.
>> 
>> So the *problem* is purely around that
>> 
>>                 if (behavior == EXCLUSIVE) {
>>                         if (!test_and_set_bit_lock(bit_nr, &page->flags))
>>                                 break;
>>                 } else ..
>> 
>> and in particular it is purely *after* that test_and_set_bit_lock()
>> case. We have two cases:
>> 
>>  (a) *If* we get the lock there, we're good, and all done, and we have
>> the lock. We don't care about any other wakeups, because they'll be
>> stale anyway (the thing that released the lock that we just took) and
>> because we got the lock, no other exclusive waiters should be woken up
>> anyway (and we will in turn wake up any waiters when we release it)
>> 
>>  (b) we did *not* get the lock, because somebody else got it and is
>> about to immediately unlock again. And that _future_ wakeup that they
>> send might get lost because it might end up targeting us (but we might
>> just exit and not care).
>> 
>> Agreed?
>> 
>> So the only case that really matters is that we didn't get the lock,
>> but we must *not* be woken up afterwards.
>> 
>> So how about the attached trivial two-liner? We solve the problem by
>> simply marking ourselves TASK_RUNNING, which means that we won't be
>> counted as an exclusive wakeup.
>> 
>> Ok, so the "one" line to do that is that is actually two lines:
>> 
>>         __set_current_state(TASK_RUNNING);
>>         smp_mb__before_atomic();
>> 
>> and there's four lines of comments to go with it, but it really is
>> very simple: if we do that before we do the test_and_set_bit_lock(),
>> no wakeups will be lost, because we won't be sleeping for that wakeup.
>> 
>> I'm not entirely happy about that "smp_mb__before_atomic()". I think
>> it's right in practice that test_and_set_bit_lock() (when it actually
>> does a write) has at LEAST atomic seqmantics, so I think it's good.
>> But it's not pretty.
>> 
>> But I don't want to use a heavy
>> 
>>         set_current_state(TASK_RUNNING);
>>         if (!test_and_set_bit_lock(bit_nr, &page->flags)) ..
>> 
>> sequence, because at least on x86, that test_and_set_bit_lock()
>> already has a memory barrier in it, so the extra memory barrier from
>> set_current_state() is all kinds of pointless.
>> 
>> Hmm?
> 
> Wow good catch. Does bit_is_set even have to be true? If it's woken due 
> to a signal, it may still be on the waitqueue right?

No, ignore this part (which you explained well it was just a thinko,
and your patch of course would not have worked if this was the case):
the exclusive wake up doesn't get lost if schedule() was called because
state goes back to running regardless of what woke it.

I still prefer if it can be changed to the below fix though.

> works, but it looks like a pretty standard variant of "don't lose
> wakeups" bug.
> 
> prepare_to_wait_event() has a pretty good pattern (and comment), I would
> favour using that (test the signal when inserting on the waitqueue).
> 
> @@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>         for (;;) {
>                 spin_lock_irq(&q->lock);
>  
> +               if (signal_pending_state(state, current)) {
> +                       /* Must not lose an exclusive wake up, see
> +                        * prepare_to_wait_event comment */
> +                       list_del_init(&wait->entry);
> +                       spin_unlock_irq(&q->lock);> +                       ret = -EINTR;
> +                       break;
> +               }
> +
>                 if (likely(list_empty(&wait->entry))) {
>                         __add_wait_queue_entry_tail(q, wait);
>                         SetPageWaiters(page);
> @@ -1157,11 +1166,6 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>                                 break;
>                 }
>  
> -               if (signal_pending_state(state, current)) {
> -                       ret = -EINTR;
> -                       break;
> -               }
> -
>                 if (behavior == DROP) {
>                         /*
>                          * We can no longer safely access page->flags:
> 
> - mutex_lock_common does the signal check under its wait queue lock.
> 
> - rwsem looks like it does it backward and checks if it was woken if it
> got a signal and tries to handle it that way (hopefully okay, I prefer
> the standard pattern).
> 
> - futex unqueues and tests for wakeup before testing signal.
> 
> Etc. And it's not even exclusive to signals of course, those are just 
> the typical asynchronous thing that would wake us without removing from
> the wait queue. Bit of a shame there is no standard pattern to follow
> though.
> 
> I wonder how you could improve that. finish_wait could WARN_ON an
> exclusive waiter being found on the queue?
> 
> @@ -377,6 +377,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
>          *    the list).
>          */
>         if (!list_empty_careful(&wq_entry->entry)) {
> +               WARN_ON(wq_entry->flags & WQ_FLAG_EXCLUSIVE);
>                 spin_lock_irqsave(&wq_head->lock, flags);
>                 list_del_init(&wq_entry->entry);
>                 spin_unlock_irqrestore(&wq_head->lock, flags);
> 
> That doesn't catch a limited count of wake ups, maybe if you passed in 
> success value to finish_wait, it could warn in case a failure has 
> WQ_FLAG_WOKEN. That doesn't help things that invent their own waitqueues
> mind you. I wonder if we could do some kind of generic annotations for
> anyone implementing wait queues to call, which could have debug checks
> implemented?
> 
> Thanks,
> Nick
> 
>