[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BYAPR04MB5816BD641DF55A93986DF826E7DC0@BYAPR04MB5816.namprd04.prod.outlook.com>
Date: Tue, 30 Jul 2019 02:06:33 +0000
From: Damien Le Moal <Damien.LeMoal@....com>
To: Andreas Dilger <adilger@...ger.ca>
CC: "Theodore Y. Ts'o" <tytso@....edu>,
Dave Chinner <david@...morbit.com>,
Christoph Hellwig <hch@...radead.org>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
Christoph Hellwig <hch@....de>,
Johannes Thumshirn <jthumshirn@...e.de>,
Naohiro Aota <Naohiro.Aota@....com>,
Masato Suzuki <masato.suzuki@....com>
Subject: Re: [PATCH] ext4: Fix deadlock on page reclaim
Andreas,
On 2019/07/30 3:40, Andreas Dilger wrote:
> On Jul 26, 2019, at 8:59 PM, Damien Le Moal <damien.lemoal@....com> wrote:
>>
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>>>
>>>>> This looks like something that could hit every file systems, so
>>>>> shouldn't we fix this in common code? We could also look into
>>>>> just using memalloc_nofs_save for the page cache allocation path
>>>>> instead of the per-mapping gfp_mask.
>>>>
>>>> I think it has to be the entire IO path - any allocation from the
>>>> underlying filesystem could recurse into the top level filesystem
>>>> and then deadlock if the memory reclaim submits IO or blocks on
>>>> IO completion from the upper filesystem. That's a bloody big hammer
>>>> for something that is only necessary when there are stacked
>>>> filesystems like this....
>>>
>>> Yeah.... that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
>
> One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state
> in the BIO so that the worker thread will also get the correct behaviour
> when it is processing this IO submission.
But there is no BIO to work with here: dm-zoned BIOs are transformed into an
application (tcmu-runner running the ZBC file handler) calls to pread()/pwrite()
into an ext4 file. And processing of these calls trigger the direct reclaim and
FS recursion which cause the deadlock. So BIOs are not the proper vehicle to
transport the information about the NOFS allocation.
If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the
cleanest approach.
Best regards.
--
Damien Le Moal
Western Digital Research
Powered by blists - more mailing lists