[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f10f4c25-49c0-7ef5-55c2-769c8fd9bf90@linux.intel.com>
Date: Fri, 25 Aug 2017 15:19:53 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Mel Gorman <mgorman@...hsingularity.net>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...e.hu>, Andi Kleen <ak@...ux.intel.com>,
Kan Liang <kan.liang@...el.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>, Jan Kara <jack@...e.cz>,
Christopher Lameter <cl@...ux.com>,
"Eric W . Biederman" <ebiederm@...ssion.com>,
Davidlohr Bueso <dave@...olabs.net>,
linux-mm <linux-mm@...ck.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in
wake_up_page_bit
On 08/25/2017 12:58 PM, Linus Torvalds wrote:
> On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen <tim.c.chen@...ux.intel.com> wrote:
>> Now that we have added breaks in the wait queue scan and allow bookmark
>> on scan position, we put this logic in the wake_up_page_bit function.
>
> Oh, _this_ is the other patch you were talking about. I thought it was
> the NUMA counter threashold that was discussed around the same time,
> and that's why I referred to Mel.
>
> Gods, _this_ patch is ugly. No, I'm not happy with it at all. It
> makes that wait_queue_head much bigger, for this disgusting one use.
>
> So no, this is no good.
>
> Now, maybe the page_wait_table[] itself could be changed to be
> something that is *not* just the wait-queue head.
>
> But if we need to change the page_wait_table[] itself to have more
> information, then we should make it not be a wait-queue at all, we
> should make it be a list of much more specialized entries, indexed by
> the {page,bit} tuple.
>
> And once we do that, we probably *could* use something like two
> specialized lists: one that is wake-all, and one that is wake-one.
>
> So you'd have something like
>
> struct page_wait_struct {
> struct list_node list;
> struct page *page;
> int bit;
> struct llist_head all;
> struct llist_head exclusive;
> };
>
> and then the "page_wait_table[]" would be just an array of
>
> struct page_wait_head {
> spinlock_t lock;
> struct hlist_head list;
> };
>
> and then the rule is:
>
> - each page/bit combination allocates one of these page_wait_struct
> entries when somebody starts waiting for it for the first time (and we
> use the spinlock in the page_wait_head to serialize that list).
>
> - an exclusive wait (ie somebody who waits to get the bit for
> locking) adds itself to the 'exclusive' llist
>
> - non-locking waiters add themselves to the 'all' list
>
> - we can use llist_del_all() to remove the 'all' list and then walk
> it and wake them up without any locks
>
> - we can use llist_del_first() to remove the first exclusive waiter
> and wait _it_ up without any locks.
>
> Hmm? How does that sound? That means that we wouldn't use the normal
> wait-queues at all for the page hash waiting. We'd use this two-level
> list: one list to find the page/bit thing, and then two lists within
> that fdor the wait-queues for just *that* page/bit.
>
> So no need for the 'key' stuff at all, because the page/bit data would
> be in the first data list, and the second list wouldn't have any of
> these traversal issues where you have to be careful and do it one
> entry at a time.
If I have the waker count and flags on the wait_page_queue structure
will that have been more acceptable?
Also I think patch 1 is still a good idea for a fail safe mechanism
in case there are other long wait list.
That said, I do think your suggested approach is cleaner. However, it
is a much more substantial change. Let me take a look and see if I
have any issues implementing it.
Tim
>
> Linus
>
Powered by blists - more mailing lists