linux-kernel - Re: [PATCH] bcachefs: Switch to memalloc_flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZtcPjBeF9TjCe8Sl@tiehlicka>
Date: Tue, 3 Sep 2024 15:30:52 +0200
From: Michal Hocko <mhocko@...e.com>
To: Theodore Ts'o <tytso@....edu>
Cc: Yafang Shao <laoar.shao@...il.com>, Dave Chinner <david@...morbit.com>,
	Kent Overstreet <kent.overstreet@...ux.dev>,
	Matthew Wilcox <willy@...radead.org>, linux-fsdevel@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Dave Chinner <dchinner@...hat.com>
Subject: Re: [PATCH] bcachefs: Switch to memalloc_flags_do() for vmalloc
 allocations

On Tue 03-09-24 08:44:16, Theodore Ts'o wrote:
> On Tue, Sep 03, 2024 at 02:34:05PM +0800, Yafang Shao wrote:
> >
> > When setting GFP_NOFAIL, it's important to not only enable direct
> > reclaim but also the OOM killer. In scenarios where swap is off and
> > there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can
> > result in an infinite loop. In other words, GFP_NOFAIL should not be
> > used with GFP_NOFS. Unfortunately, many call sites do combine them.
> > For example:
> > 
> > XFS:
> > 
> > fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL
> > fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL
> > 
> > EXT4:
> > 
> > fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL
> > fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL
> > 
> > This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted
> > could provide further insight.
> 
> GFP_NOFS is needed because we need to signal to the mm layer to avoid
> recursing into file system layer --- for example, to clean a page by
> writing it back to the FS.  Since we may have taken various file
> system locks, recursing could lead to deadlock, which would make the
> system (and the user) sad.
> 
> If the mm layer wants to OOM kill a process, that should be fine as
> far as the file system is concerned --- this could reclaim anonymous
> pages that don't need to be written back, for example.  And we don't
> need to write back dirty pages before the process killed.  So I'm a
> bit puzzled why (as you imply; I haven't dug into the mm code in
> question) GFP_NOFS implies disabling the OOM killer?

Yes, because there might be a lot of fs pages pinned while performing
NOFS allocation and that could fire the OOM killer way too prematurely.
This has been quite some time ago since this was introduced but I do
remember workloads hitting that. Also there is usually kswapd making
sufficient progress to move forward. There are cases where kswapd is
completely stuck and other __GFP_FS allocations triggering full direct
reclaim or background kworkers freeing some memory and OOM killer
doesn't have good enough picture to make an educated guess the oom
killer is the only available way forward.

A typical example would be a workload that would care is trashing but
still making a slow progress which is acceptable which is acceptable
because the most important workload makes a decent progress (the working
set fits in or is mlocked) and rebuilding the state is more harmfull
than a slow IO.

-- 
Michal Hocko
SUSE Labs