lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 29 Jul 2019 01:07:34 +0000
From:   Damien Le Moal <Damien.LeMoal@....com>
To:     Dave Chinner <david@...morbit.com>
CC:     "Theodore Y. Ts'o" <tytso@....edu>,
        Christoph Hellwig <hch@...radead.org>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        Christoph Hellwig <hch@....de>,
        Johannes Thumshirn <jthumshirn@...e.de>,
        Naohiro Aota <Naohiro.Aota@....com>,
        Masato Suzuki <masato.suzuki@....com>
Subject: Re: [PATCH] ext4: Fix deadlock on page reclaim

On 2019/07/29 8:42, Dave Chinner wrote:
> On Sat, Jul 27, 2019 at 02:59:59AM +0000, Damien Le Moal wrote:
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>>>
>>>>> This looks like something that could hit every file systems, so
>>>>> shouldn't we fix this in common code?  We could also look into
>>>>> just using memalloc_nofs_save for the page cache allocation path
>>>>> instead of the per-mapping gfp_mask.
>>>>
>>>> I think it has to be the entire IO path - any allocation from the
>>>> underlying filesystem could recurse into the top level filesystem
>>>> and then deadlock if the memory reclaim submits IO or blocks on
>>>> IO completion from the upper filesystem. That's a bloody big hammer
>>>> for something that is only necessary when there are stacked
>>>> filesystems like this....
>>>
>>> Yeah.... that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
> 
> Right, I'm talking about using memalloc_nofs_save() as a huge hammer
> in the pread/pwrite() calling context, not the bio submission
> context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO
> below).

Yes, I understood your point. And I agree that it indeed would be a big hammer.
We should be able to do better than that :)

>> One simple hack would be an fcntl() or mount option to tell the FS to use
>> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
>> applications or system setup is correct. So not so safe.
> 
> Wasn't there discussion at some point in the past about an interface
> for special processes to be able to mark themselves as PF_MEMALLOC
> (some kind of prctl, I think) for things like FUSE daemons? That
> would prevent direct reclaim recursion for these userspace daemons
> that are in the kernel memory reclaim IO path. It's the same
> situation there, isn't it? How does fuse deal with this problem?

I do not recall such discussion. But indeed FUSE may give some hints. Good idea.
Thanks. I will check.


-- 
Damien Le Moal
Western Digital Research

Powered by blists - more mailing lists