linux-kernel - Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160617213931.GA13688@cmpxchg.org>
Date:	Fri, 17 Jun 2016 17:39:31 -0400
From:	Johannes Weiner <hannes@...xchg.org>
To:	Vlastimil Babka <vbabka@...e.cz>
Cc:	Michal Hocko <mhocko@...nel.org>,
	Dave Chinner <david@...morbit.com>, linux-mm@...ck.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD

On Fri, Jun 17, 2016 at 10:30:06PM +0200, Vlastimil Babka wrote:
> On 17.6.2016 20:22, Johannes Weiner wrote:
> > On Thu, Jun 16, 2016 at 01:26:06PM +0200, Michal Hocko wrote:
> >> @@ -54,6 +54,13 @@ kmem_flags_convert(xfs_km_flags_t flags)
> >>  			lflags &= ~__GFP_FS;
> >>  	}
> >>  
> >> +	/*
> >> +	 * Default page/slab allocator behavior is to retry for ever
> >> +	 * for small allocations. We can override this behavior by using
> >> +	 * __GFP_RETRY_HARD which will tell the allocator to retry as long
> >> +	 * as it is feasible but rather fail than retry for ever for all
> >> +	 * request sizes.
> >> +	 */
> >>  	if (flags & KM_MAYFAIL)
> >>  		lflags |= __GFP_RETRY_HARD;
> > 
> > I think this example shows that __GFP_RETRY_HARD is not a good flag
> > because it conflates two seemingly unrelated semantics; the comment
> > doesn't quite make up for that.
> > 
> > When the flag is set,
> > 
> > - it allows costly orders to invoke the OOM killer and retry
> 
> No, it's not allowing the OOM killer for costly orders, only non-costly, AFAIK.
> Mainly it allows more aggressive compaction (especially after my series [1]).

Ah, you're right. It calls into the may_oom function but that skips
actual killing for costly orders.

> > - it allows !costly orders to fail
> > 
> > While 1. is obvious from the name, 2. is not. Even if we don't want
> > full-on fine-grained naming for every reclaim methodology and retry
> > behavior, those two things just shouldn't be tied together.
> 
> Well, if allocation is not allowed to fail, it's like trying "indefinitely hard"
> already. Telling it it should "try hard" then doesn't make any sense without
> also being able to fail.

I can see that argument, but it's really anything but obvious at the
callsite. Dave's response to Michal's patch was a good demonstration.
And I don't think adding comments fixes an unintuitive interface.

> > I don't see us failing !costly order per default anytime soon, and
> > they are common, so adding a __GFP_MAYFAIL to explicitely override
> > that behavior seems like a good idea to me. That would make the XFS
> > callsite here perfectly obvious.
> > 
> > And you can still combine it with __GFP_REPEAT.
> 
> But that would mean the following meaningful combinations for non-costly orders
> (assuming e.g. GFP_KERNEL which allows reclaim/compaction in the first place).

I would ignore order here. Part of what makes this interface
unintuitive is when we expect different flags to be passed for
different orders, especially because the orders are often
variable. Michal's __GFP_RETRY_HARD is an improvement in the sense
that it ignores the order and tries to do the right thing regardless
of it. The interface should really be about the intent at the
callsite, not about implementation details of the allocator.

But adding TRY_HARD to express "this can fail" isn't intuitive.

> __GFP_NORETRY - that one is well understood hopefully, and implicitly mayfail

Yeah. Never OOM, never retry etc. The callsite can fall back, and
prefers that over OOM kills and disruptive allocation latencies.

> __GFP_MAYFAIL - ???

May OOM for certain orders and retry a few times, but still fail. The
callsite can fall back, but it wouldn't come for free. E.g. it might
have to abort an explicitely requested user operation.

This is the default for costly orders, so it has an effect only on
non-costly orders. But that's where I would separate interface from
implementation: you'd use it e.g. in callsites where you have variable
orders but always the same fallback. XFS does that extensively.

> __GFP_MAYFAIL | __GFP_REPEAT - ???
> 
> Which one of the last two tries harder? How specifically? Will they differ by
> (not) allowing OOM? Won't that be just extra confusing?

Adding __GFP_REPEAT would always be additive. This combination would
mean: try the hardest not to fail, but don't lock up in cases when the
order happens to be !costly.

Again, I'm not too thrilled about that flag as it's so damn vague. But
that's more about how we communicate latency/success expectations. My
concern is exclusively about its implication of MAYFAIL.

> > For a generic allocation site like this, __GFP_MAYFAIL | __GFP_REPEAT
> > does the right thing for all orders, and it's self-explanatory: try
> > hard, allow falling back.
> > 
> > Whether we want a __GFP_REPEAT or __GFP_TRY_HARD at all is a different
> > topic. In the long term, it might be better to provide best-effort per
> > default and simply annotate MAYFAIL/NORETRY callsites that want to
> > give up earlier. Because as I mentioned at LSFMM, it's much easier to
> > identify callsites that have a convenient fallback than callsites that
> > need to "try harder." Everybody thinks their allocations are oh so
> > important. The former is much more specific and uses obvious criteria.
> 
> For higher-order allocations, best-effort might also mean significant system
> disruption, not just latency of the allocation itself. One example is hugeltbfs
> allocations (echo X > .../nr_hugepages) where the admin is willing to pay this
> cost. But to do that by default and rely on everyone else passing NORETRY
> wouldn't go far. So I think the TRY_HARD kind of flag makes sense.

I think whether the best-effort behavior should be opt-in or opt-out,
or how fine-grained the latency/success control over the allocator
should be is a different topic. I'd prefer defaulting to reliability
and annotating low-latency requirements, but I can see TRY_HARD work
too. It just shouldn't imply MAY_FAIL.