linux-kernel - Re: [PATCH 1/2] mm/filemap: don't initiate writeback if mapping has no dirty pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <c28d4243-aeb9-901a-46e9-bfe2e704cd8f@yandex-team.ru>
Date:   Tue, 30 Jul 2019 21:15:22 +0300
From:   Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To:     Jan Kara <jack@...e.cz>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Tejun Heo <tj@...nel.org>,
        Jens Axboe <axboe@...nel.dk>,
        Johannes Weiner <hannes@...xchg.org>,
        linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 1/2] mm/filemap: don't initiate writeback if mapping has
 no dirty pages



On 30.07.2019 18:48, Jan Kara wrote:
> On Tue 30-07-19 17:57:18, Konstantin Khlebnikov wrote:
>> On 30.07.2019 17:14, Jan Kara wrote:
>>> On Tue 23-07-19 11:16:51, Konstantin Khlebnikov wrote:
>>>> On 23.07.2019 3:52, Andrew Morton wrote:
>>>>>
>>>>> (cc linux-fsdevel and Jan)
>>>
>>> Thanks for CC Andrew.
>>>
>>>>> On Mon, 22 Jul 2019 12:36:08 +0300 Konstantin Khlebnikov <khlebnikov@...dex-team.ru> wrote:
>>>>>
>>>>>> Functions like filemap_write_and_wait_range() should do nothing if inode
>>>>>> has no dirty pages or pages currently under writeback. But they anyway
>>>>>> construct struct writeback_control and this does some atomic operations
>>>>>> if CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
>>>>>> updates state of writeback ownership, on slow path might be more work.
>>>>>> Current this path is safely avoided only when inode mapping has no pages.
>>>>>>
>>>>>> For example generic_file_read_iter() calls filemap_write_and_wait_range()
>>>>>> at each O_DIRECT read - pretty hot path.
>>>
>>> Yes, but in common case mapping_needs_writeback() is false for files you do
>>> direct IO to (exactly the case with no pages in the mapping). So you
>>> shouldn't see the overhead at all. So which case you really care about?
>>>
>>>>>> This patch skips starting new writeback if mapping has no dirty tags set.
>>>>>> If writeback is already in progress filemap_write_and_wait_range() will
>>>>>> wait for it.
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> --- a/mm/filemap.c
>>>>>> +++ b/mm/filemap.c
>>>>>> @@ -408,7 +408,8 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>>>>>>     		.range_end = end,
>>>>>>     	};
>>>>>> -	if (!mapping_cap_writeback_dirty(mapping))
>>>>>> +	if (!mapping_cap_writeback_dirty(mapping) ||
>>>>>> +	    !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
>>>>>>     		return 0;
>>>>>>     	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
>>>>>
>>>>> How does this play with tagged_writepages?  We assume that no tagging
>>>>> has been performed by any __filemap_fdatawrite_range() caller?
>>>>>
>>>>
>>>> Checking also PAGECACHE_TAG_TOWRITE is cheap but seems redundant.
>>>>
>>>> To-write tags are supposed to be a subset of dirty tags:
>>>> to-write is set only when dirty is set and cleared after starting writeback.
>>>>
>>>> Special case set_page_writeback_keepwrite() which does not clear to-write
>>>> should be for dirty page thus dirty tag is not going to be cleared either.
>>>> Ext4 calls it after redirty_page_for_writepage()
>>>> XFS even without clear_page_dirty_for_io()
>>>>
>>>> Anyway to-write tag without dirty tag or at clear page is confusing.
>>>
>>> Yeah, TOWRITE tag is intended to be internal to writepages logic so your
>>> patch is fine in that regard. Overall the patch looks good to me so I'm
>>> just wondering a bit about the motivation...
>>
>> In our case file mixes cached pages and O_DIRECT read. Kind of database
>> were index header is memory mapped while the rest data read via O_DIRECT.
>> I suppose for sharing index between multiple instances.
> 
> OK, that has always been a bit problematic but you're not the first one to
> have such design ;). So feel free to add:
> 
> Reviewed-by: Jan Kara <jack@...e.cz>
> 
> to your patch.

Thanks.

O_DIRECT has long history of misunderstandings =)
It looks some cases are still not documented.
My favourite: O_DIRECT write into hole goes into cache, at least for ext4.

> 
>> On this path we also hit this bug:
>> https://lore.kernel.org/lkml/156355839560.2063.5265687291430814589.stgit@buzz/
>> so that's why I've started looking into this code.
> 
> I see. OK.
> 
> 								Honza
>