linux-kernel - Re: [PATCH] bcachefs: Switch to memalloc_flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALOAHbCssCSb7zF6VoKugFjAQcMACmOTtSCzd7n8oGfXdsxNsg@mail.gmail.com>
Date: Fri, 30 Aug 2024 17:14:28 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: Kent Overstreet <kent.overstreet@...ux.dev>, Michal Hocko <mhocko@...e.com>, 
	Matthew Wilcox <willy@...radead.org>, linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Dave Chinner <dchinner@...hat.com>
Subject: Re: [PATCH] bcachefs: Switch to memalloc_flags_do() for vmalloc allocations

On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@...morbit.com> wrote:
>
> On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote:
> > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe
> > according to call path and allocation size, you still need to be
> > checking for failure - in the same way that you shouldn't be using
> > BUG_ON() if you cannot prove that the condition won't occur in real wold
> > usage.
>
> We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
> now. This was the default Irix kernel allocator behaviour (it had a
> forwards progress guarantee and would never fail allocation unless
> told it could do so). We've been using the same "guaranteed not to
> fail" semantics on Linux since the original port started 25 years
> ago via open-coded loops.
>
> IOWs, __GFP_NOFAIL semantics have been production tested for a
> couple of decades on Linux via XFS, and nobody here can argue that
> XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL
> as it is used by XFS is reliable and lives up to the "will not fail"
> guarantee that it is supposed to have.
>
> Fundamentally, __GFP_NOFAIL came about to replace the callers doing
>
>         do {
>                 p = kmalloc(size);
>         while (!p);
>
> so that they blocked until memory allocation succeeded. The call
> sites do not check for failure, because -failure never occurs-.
>
> The MM devs want to have visibility of these allocations - they may
> not like them, but having __GFP_NOFAIL means it's trivial to audit
> all the allocations that use these semantics.  IOWs, __GFP_NOFAIL
> was created with an explicit guarantee that it -will not fail- for
> normal allocation contexts so it could replace all the open-coded
> will-not-fail allocation loops..
>
> Given this guarantee, we recently removed these historic allocation
> wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the
> allocation call sites. There's nearly a hundred memory allocation
> locations in XFS that are tagged with __GFP_NOFAIL.
>
> If we're now going to have the "will not fail" guarantee taken away
> from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can
> it be used anywhere else that a "will not fail" guarantee it
> required.
>
> Put simply: __GFP_NOFAIL will be rendered completely useless if it
> can fail due to external scoped memory allocation contexts.  This
> will force us to revert all __GFP_NOFAIL allocations back to
> open-coded will-not-fail loops.
>
> This is not a step forwards for anyone.

Hello Dave,

I've noticed that XFS has increasingly replaced kmem_alloc() with
__GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of
__GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41
occurrences. In kmem_alloc(), there's an explicit
memalloc_retry_wait() to throttle the allocator under heavy memory
pressure, which aligns with your filesystem design. However, using
__GFP_NOFAIL removes this throttling mechanism, potentially causing
issues when the system is under heavy memory load. I'm concerned that
this shift might not be a beneficial trend.

We have been using XFS for our big data servers for years, and it has
consistently performed well with older kernels like 4.19.y. However,
after upgrading all our servers from 4.19.y to 6.1.y over the past two
years, we have frequently encountered livelock issues caused by memory
exhaustion. To mitigate this, we've had to limit the RSS of
applications, which isn't an ideal solution and represents a worrying
trend.

-- 
Regards
Yafang