lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wgfjK_-Wfkb6yXYpB5WfOv5yP2NiFO68yQfBfnzRu4yYQ@mail.gmail.com>
Date:   Tue, 30 Jun 2020 11:02:23 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Oleg Nesterov <oleg@...hat.com>
Cc:     Nicholas Piggin <npiggin@...il.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Davidlohr Bueso <dave@...olabs.net>, Jan Kara <jack@...e.cz>,
        Lukas Czerner <lczerner@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?

On Tue, Jun 30, 2020 at 4:51 AM Oleg Nesterov <oleg@...hat.com> wrote:
>
> On 06/30, Oleg Nesterov wrote:
> >
> > may be someting like this
>
> or this ...

Guys, I'd suggest we either

 (a) do the minimal ugly one-liner workaround

or

 (b) do something *much* more radical.

What would that "radical" thing be? Just move the "behavior" bit into
the "struct wait_page_key" itself, and handle it at wakeup time!

Right now we have

 - wake_page_function() basically tests that it's the right bit and
page, and that the bit has actually cleared

 - then it calls "autoremove_wake_function()" to wake things up and
remove you from the wait list if that wqoke you up.

I would suggest that if we want to clean this up, we do

 - add "behavior" to  "struct wait_page_key"

 - now, for the "exclusive" case, instead of doing

        if (test_bit(key->bit_nr, &key->page->flags))
                return -1;

   the wake_page_function() would actually do

        if (test_and_set_bit(key->bit_nr, &key->page->flags))
                return -1;
        /*
         * Careful, careful: we need to make sure that the *last*
         * thing that touches 'wq_entry' is setting WQ_FLAG_WOKEN
         */
        list_del_init(&wq_entry->entry);
        default_wake_function(wq_entry, mode, sync, key);
        smp_store_release(&wq_entry->flags, WQ_FLAG_WOKEN);

        /* No point in waking up anybody else */
        return -1;

while for the SHARED and DROP cases it would do

        if (test_bit(key->bit_nr, &key->page->flags))
                return -1;
        list_del_init(&wq_entry->entry);
        default_wake_function(wq_entry, mode, sync, key);
        smp_store_release(&wq_entry->flags, WQ_FLAG_WOKEN);

        return 0;

and now the actual _waiting_ side can basically just do

        /* All the wait-queue setup is done outside the loop */
        wait_page.behavior = behavior;

        spin_lock_irq(&q->lock);
        .. test the big again here, do the rigth thing for mode, exit if done ..
        __add_wait_queue_entry_tail(q, wait);
        SetPageWaiters(page);
        set_current_state(state);
        spin_unlock_irq(&q->lock);

        for (;;) {
                set_current_state(state);

                /*
                 * Have we already been woken and are all done?
                 * We don't even need to remove ourselves from the
                 * wait-queues, that's been done for us.
                 */
                if (wait->flags & WQ_FLAG_WOKEN)
                        return 0;

                if (signal_pending_state(state, current))
                         break;

                io_schedule();
        }
        /* We got interrupted by a signal? */
        finish_wait(q, wait);

        /* But maybe we raced with being woken and are all done? */
        if (wait->flags & WQ_FLAG_WOKEN)
                        return 0;

        return -EINTR;

NOTE! The above doesn't do the thrashing/delayacct handling, and that
was intentional for clarity, because I actually think that should be
done in an outer function that just calls this actual "do the
lock/wait"

Look, doesn't that seem *much* simpler and clearer? Having that flag
in the wait-queue (that stays around even after it's been removed from
the list) shows that the state we're testing for - or the state we
wanted - is already ours.

Note that we have a "woken_wake_function()" that does that
WQ_FLAG_WOKEN thing, but I don't think it's careful about that final
WQ_FLAG_WOKEN bit setting, and the above function cares about that
(notice how it checks WQ_FLAG_WOKEN without ever taking the spinlock
that protects the other members of it, so the stack space can get
deallocated immediately once it sees that flag.

The above code has been written in my MUA, it may be entirely broken,
but it _feels_ right to me.

And yes, it's a lot bigger change than my one-liner. But honestly,
this is the kind of thing that the whole "struct wait_page_key" thing
is all about and exists for.

What do people think? I think it clarifies the logic and makes things
potentially much clearer. In fact, now the actual _waiting_ code never
looks at the struct page at all, so it can drop the reference to the
page early, outside the loop, adn that "drop" thing doesn't even have
to be visible on the waker side either (the wakeup logic is the same
as SHARED)

Hmm?

              Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ