linux-kernel - Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150918062541.GA7769@js1304-P5Q-DELUXE>
Date:	Fri, 18 Sep 2015 15:25:41 +0900
From:	Joonsoo Kim <iamjoonsoo.kim@....com>
To:	Mel Gorman <mgorman@...hsingularity.net>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Rik van Riel <riel@...hat.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	David Rientjes <rientjes@...gle.com>,
	Michal Hocko <mhocko@...nel.org>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable
 to sleep, unwilling to sleep and avoiding waking kswapd

On Wed, Sep 09, 2015 at 01:22:03PM +0100, Mel Gorman wrote:
> On Tue, Sep 08, 2015 at 03:49:58PM +0900, Joonsoo Kim wrote:
> > 2015-08-24 21:09 GMT+09:00 Mel Gorman <mgorman@...hsingularity.net>:
> > > __GFP_WAIT has been used to identify atomic context in callers that hold
> > > spinlocks or are in interrupts. They are expected to be high priority and
> > > have access one of two watermarks lower than "min" which can be referred
> > > to as the "atomic reserve". __GFP_HIGH users get access to the first lower
> > > watermark and can be called the "high priority reserve".
> > >
> > > Over time, callers had a requirement to not block when fallback options
> > > were available. Some have abused __GFP_WAIT leading to a situation where
> > > an optimisitic allocation with a fallback option can access atomic reserves.
> > >
> > > This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> > > cannot sleep and have no alternative. High priority users continue to use
> > > __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> > > willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> > > that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> > > as a caller that is willing to enter direct reclaim and wake kswapd for
> > > background reclaim.
> > 
> > Hello, Mel.
> > 
> > I think that it is better to do one thing at one patch.
> 
> This was a case where the incremental change felt unnecessary. The purpose
> of the patch is to "distinguish between being unable to sleep, unwilling
> to sleep and avoiding waking kswapd". Splitting that up is possible but
> I'm not convinced it helps.
> 
> > To distinguish real atomic, we just need to introduce __GFP_ATOMIC and
> > make GFP_ATOMIC to __GFP_ATOMIC | GFP_HARDER and change related
> > things. __GFP_WAIT changes isn't needed at all for this purpose. It can
> > reduce patch size and provides more good bisectability.
> > 
> > And, I don't think that introducing __GFP_KSWAPD_RECLAIM is good thing.
> > Basically, kswapd reclaim should be enforced.
> 
> Several years ago, I would have agreed. Now there are callers that want
> to control kswapd and I think it made more sense to clearly state whether
> RECLAIM and KSWAPD are allowed instead of having RECLAIM and NO_KSWAPD
> flags -- i.e. flags that consistently allow or consistently deny.
> 
> > New flag makes user who manually
> > manipulate gfp flag more difficult. Without this change, your second hazard will
> > be disappeared although it is almost harmless.
> > 
> > And, I doubt that this big one shot change is preferable. AFAIK, even if changes
> > are one to one mapping and no functional difference, each one is made by
> > one patch and send it to correct maintainer. I guess there is some difficulty
> > in this patch to do like this, but, it could. Isn't it?
> > 
> 
> Splitting this into one patch per maintainer would be a review and bisection
> nightmare. If I saw someone else doing that I would wonder if they were
> just trying to increase their patch count for no reason.
> 
> > Some nitpicks are below.
> > 
> > > <SNIP>
> > >
> > > diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> > > index 03e75fef15b8..86809bd2026d 100644
> > > --- a/arch/arm/xen/mm.c
> > > +++ b/arch/arm/xen/mm.c
> > > @@ -25,7 +25,7 @@
> > >  unsigned long xen_get_swiotlb_free_pages(unsigned int order)
> > >  {
> > >         struct memblock_region *reg;
> > > -       gfp_t flags = __GFP_NOWARN;
> > > +       gfp_t flags = __GFP_NOWARN|___GFP_KSWAPD_RECLAIM;
> > 
> > Please use __XXX rather than ___XXX.
> > 
> 
> Fixed.
> 
> > > <SNIP>
> > >
> > > @@ -457,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > >                  * We solve this, and guarantee forward progress, with a rescuer
> > >                  * workqueue per bio_set. If we go to allocate and there are
> > >                  * bios on current->bio_list, we first try the allocation
> > > -                * without __GFP_WAIT; if that fails, we punt those bios we
> > > -                * would be blocking to the rescuer workqueue before we retry
> > > -                * with the original gfp_flags.
> > > +                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> > > +                * bios we would be blocking to the rescuer workqueue before
> > > +                * we retry with the original gfp_flags.
> > >                  */
> > >
> > >                 if (current->bio_list && !bio_list_empty(current->bio_list))
> > > -                       gfp_mask &= ~__GFP_WAIT;
> > > +                       gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> > 
> > How about introduce helper function to mask out __GFP_DIRECT_RECLAIM?
> > It can be used many places.
> > 
> 
> In this case, the pattern for removing a single flag is easier to recognise
> than a helper whose implementation must be examined.
> 
> > >                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> > >                 if (!p && gfp_mask != saved_gfp) {
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 627ed0c593fb..e3605acaaffc 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -1156,8 +1156,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
> > >   * @bio: bio to allocate request for (can be %NULL)
> > >   * @gfp_mask: allocation mask
> > >   *
> > > - * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
> > > - * function keeps retrying under memory pressure and fails iff @q is dead.
> > > + * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
> > > + * this function keeps retrying under memory pressure and fails iff @q is dead.
> > >   *
> > >   * Must be called with @q->queue_lock held and,
> > >   * Returns ERR_PTR on failure, with @q->queue_lock held.
> > > @@ -1177,7 +1177,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> > >         if (!IS_ERR(rq))
> > >                 return rq;
> > >
> > > -       if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
> > > +       if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
> > >                 blk_put_rl(rl);
> > >                 return rq;
> > >         }
> > > @@ -1255,11 +1255,11 @@ EXPORT_SYMBOL(blk_get_request);
> > >   * BUG.
> > >   *
> > >   * WARNING: When allocating/cloning a bio-chain, careful consideration should be
> > > - * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
> > > - * anything but the first bio in the chain. Otherwise you risk waiting for IO
> > > - * completion of a bio that hasn't been submitted yet, thus resulting in a
> > > - * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
> > > - * of bio_alloc(), as that avoids the mempool deadlock.
> > > + * given to how you allocate bios. In particular, you cannot use
> > > + * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
> > > + * you risk waiting for IO completion of a bio that hasn't been submitted yet,
> > > + * thus resulting in a deadlock. Alternatively bios should be allocated using
> > > + * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
> > >   * If possible a big IO should be split into smaller parts when allocation
> > >   * fails. Partial allocation should not be an error, or you risk a live-lock.
> > >   */
> > > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > > index 1a27f45ec776..381cb50a673c 100644
> > > --- a/block/blk-ioc.c
> > > +++ b/block/blk-ioc.c
> > > @@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
> > >  {
> > >         struct io_context *ioc;
> > >
> > > -       might_sleep_if(gfp_flags & __GFP_WAIT);
> > > +       might_sleep_if(gfpflags_allow_blocking(gfp_flags));
> > >
> > >         do {
> > >                 task_lock(task);
> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > index 9b6e28830b82..a8b46659ce4e 100644
> > > --- a/block/blk-mq-tag.c
> > > +++ b/block/blk-mq-tag.c
> > > @@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
> > >         if (tag != -1)
> > >                 return tag;
> > >
> > > -       if (!(data->gfp & __GFP_WAIT))
> > > +       if (!gfpflags_allow_blocking(data->gfp))
> > >                 return -1;
> > >
> > >         bs = bt_wait_ptr(bt, hctx);
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 7d842db59699..7d80379d7a38 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
> > >                 if (percpu_ref_tryget_live(&q->mq_usage_counter))
> > >                         return 0;
> > >
> > > -               if (!(gfp & __GFP_WAIT))
> > > +               if (!gfpflags_allow_blocking(gfp))
> > >                         return -EBUSY;
> > >
> > >                 ret = wait_event_interruptible(q->mq_freeze_wq,
> > > @@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
> > >
> > >         ctx = blk_mq_get_ctx(q);
> > >         hctx = q->mq_ops->map_queue(q, ctx->cpu);
> > > -       blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
> > > +       blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
> > >                         reserved, ctx, hctx);
> > >
> > >         rq = __blk_mq_alloc_request(&alloc_data, rw);
> > > -       if (!rq && (gfp & __GFP_WAIT)) {
> > > +       if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
> > >                 __blk_mq_run_hw_queue(hctx);
> > >                 blk_mq_put_ctx(ctx);
> > 
> > Is there any reason not to use gfpflags_allow_nonblocking() here?
> > There are some places not using this helper and reason isn't
> > specified.
> > 
> 
> Strictly speaking the helper could be used. However, in cases where the
> same function manipulates or examines the flag in any way, I did not use
> the helper. It's in all those cases, I thought the final result was
> easier to follow.
> > >
> > >  /*
> > > + * A caller that is willing to wait may enter direct reclaim and will
> > > + * wake kswapd to reclaim pages in the background until the high
> > > + * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
> > > + * avoid unnecessary delays when a fallback option is available but
> > > + * still allow kswapd to reclaim in the background. The kswapd flag
> > > + * can be cleared when the reclaiming of pages would cause unnecessary
> > > + * disruption.
> > > + */
> > > +#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> > 
> > Convention is that combination of gfp flags don't use __XXX.
> > 
> 
> I don't understand. GFP_MOVABLE_MASK, GFP_USER and a bunch of other
> combinations use __XXX.

Hello, Mel.
Sorry for late response.

Yes, GFP_XXX can consist of multiple __GFP_XXX.
But, __GFP_XXX doesn't consist of multiple __GFP_YYY.
Your __GFP_WAIT seems to be a first one.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/