linux-ext4 - Re: [PATCH] ext4: Fix deadlock on page reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <3D2360FA-AD48-48AE-B1CE-D1CF58C1B8AB@dilger.ca>
Date:   Mon, 29 Jul 2019 12:40:44 -0600
From:   Andreas Dilger <adilger@...ger.ca>
To:     Damien Le Moal <damien.lemoal@....com>
Cc:     "Theodore Y. Ts'o" <tytso@....edu>,
        Dave Chinner <david@...morbit.com>,
        Christoph Hellwig <hch@...radead.org>,
        "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        Christoph Hellwig <hch@....de>,
        Johannes Thumshirn <jthumshirn@...e.de>,
        Naohiro Aota <Naohiro.Aota@....com>,
        Masato Suzuki <masato.suzuki@....com>
Subject: Re: [PATCH] ext4: Fix deadlock on page reclaim

On Jul 26, 2019, at 8:59 PM, Damien Le Moal <damien.lemoal@....com> wrote:
> 
> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>> 
>>>> This looks like something that could hit every file systems, so
>>>> shouldn't we fix this in common code?  We could also look into
>>>> just using memalloc_nofs_save for the page cache allocation path
>>>> instead of the per-mapping gfp_mask.
>>> 
>>> I think it has to be the entire IO path - any allocation from the
>>> underlying filesystem could recurse into the top level filesystem
>>> and then deadlock if the memory reclaim submits IO or blocks on
>>> IO completion from the upper filesystem. That's a bloody big hammer
>>> for something that is only necessary when there are stacked
>>> filesystems like this....
>> 
>> Yeah.... that's why using memalloc_nofs_save() probably makes the most
>> sense, and dm_zoned should use that before it calls into ext4.
> 
> Unfortunately, with this particular setup, that will not solve the problem.
> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
> that HBA command ring. The commands themselves are read from the ring and
> executed by the tcmu-runner user process which executes them doing
> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
> context than the dm-zoned worker thread issuing the BIO,
> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state
in the BIO so that the worker thread will also get the correct behaviour
when it is processing this IO submission.

> We tried a simpler setup using loopback mount (XFS used directly in an ext4
> file) and running the same workload. We failed to recreate a similar deadlock in
> this case, but I am strongly suspecting that it can happen too. It is simply
> much harder to hit because the IO path from XFS to ext4 is all in-kernel and
> asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs
> which makes it relatively easy to get inter-dependent writes or read+write
> queued back-to-back and create the deadlock.
> 
> So back to Dave's point, we may be needing the big-hammer solution in the case
> of stacked file systems, while a non-stack setups do not necessarily need it
> (that is for the FS to decide). But I do not see how to implement this big
> hammer conditionally. How can a file system tell if it is at the top of the
> stack (big hammer not needed) or lower than the top level (big hammer needed) ?
> 
> One simple hack would be an fcntl() or mount option to tell the FS to use
> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
> applications or system setup is correct. So not so safe.

Cheers, Andreas






Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)