[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BYAPR04MB5816C422BCCC0ACBC82E731EE7DD0@BYAPR04MB5816.namprd04.prod.outlook.com>
Date: Mon, 29 Jul 2019 01:07:34 +0000
From: Damien Le Moal <Damien.LeMoal@....com>
To: Dave Chinner <david@...morbit.com>
CC: "Theodore Y. Ts'o" <tytso@....edu>,
Christoph Hellwig <hch@...radead.org>,
Andreas Dilger <adilger.kernel@...ger.ca>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
Christoph Hellwig <hch@....de>,
Johannes Thumshirn <jthumshirn@...e.de>,
Naohiro Aota <Naohiro.Aota@....com>,
Masato Suzuki <masato.suzuki@....com>
Subject: Re: [PATCH] ext4: Fix deadlock on page reclaim
On 2019/07/29 8:42, Dave Chinner wrote:
> On Sat, Jul 27, 2019 at 02:59:59AM +0000, Damien Le Moal wrote:
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>>>
>>>>> This looks like something that could hit every file systems, so
>>>>> shouldn't we fix this in common code? We could also look into
>>>>> just using memalloc_nofs_save for the page cache allocation path
>>>>> instead of the per-mapping gfp_mask.
>>>>
>>>> I think it has to be the entire IO path - any allocation from the
>>>> underlying filesystem could recurse into the top level filesystem
>>>> and then deadlock if the memory reclaim submits IO or blocks on
>>>> IO completion from the upper filesystem. That's a bloody big hammer
>>>> for something that is only necessary when there are stacked
>>>> filesystems like this....
>>>
>>> Yeah.... that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
>
> Right, I'm talking about using memalloc_nofs_save() as a huge hammer
> in the pread/pwrite() calling context, not the bio submission
> context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO
> below).
Yes, I understood your point. And I agree that it indeed would be a big hammer.
We should be able to do better than that :)
>> One simple hack would be an fcntl() or mount option to tell the FS to use
>> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
>> applications or system setup is correct. So not so safe.
>
> Wasn't there discussion at some point in the past about an interface
> for special processes to be able to mark themselves as PF_MEMALLOC
> (some kind of prctl, I think) for things like FUSE daemons? That
> would prevent direct reclaim recursion for these userspace daemons
> that are in the kernel memory reclaim IO path. It's the same
> situation there, isn't it? How does fuse deal with this problem?
I do not recall such discussion. But indeed FUSE may give some hints. Good idea.
Thanks. I will check.
--
Damien Le Moal
Western Digital Research
Powered by blists - more mailing lists