[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1b99e874-e9df-0b06-c856-edb94eca16dc@huaweicloud.com>
Date: Fri, 2 Aug 2024 19:13:11 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Dave Chinner <david@...morbit.com>
Cc: linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, djwong@...nel.org, hch@...radead.org,
brauner@...nel.org, jack@...e.cz, yi.zhang@...wei.com,
chengzhihao1@...wei.com, yukuai3@...wei.com
Subject: Re: [PATCH 5/6] iomap: drop unnecessary state_lock when setting ifs
uptodate bits
On 2024/8/2 14:29, Dave Chinner wrote:
> On Fri, Aug 02, 2024 at 10:57:41AM +0800, Zhang Yi wrote:
>> On 2024/8/2 8:05, Dave Chinner wrote:
>>> On Wed, Jul 31, 2024 at 05:13:04PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@...wei.com>
>>>>
>>>> Commit '1cea335d1db1 ("iomap: fix sub-page uptodate handling")' fix a
>>>> race issue when submitting multiple read bios for a page spans more than
>>>> one file system block by adding a spinlock(which names state_lock now)
>>>> to make the page uptodate synchronous. However, the race condition only
>>>> happened between the read I/O submitting and completeing threads,
>>>
>>> when we do writeback on a folio that has multiple blocks on it we
>>> can submit multiple bios for that, too. Hence the write completions
>>> can race with each other and write submission, too.
>>>
>>> Yes, write bio submission and completion only need to update ifs
>>> accounting using an atomic operation, but the same race condition
>>> exists even though the folio is fully locked at the point of bio
>>> submission.
>>>
>>>
>>>> it's
>>>> sufficient to use page lock to protect other paths, e.g. buffered write
>>> ^^^^ folio
>>>> path.
>>>>
>>>> After large folio is supported, the spinlock could affect more
>>>> about the buffered write performance, so drop it could reduce some
>>>> unnecessary locking overhead.
>>>
>>> From the point of view of simple to understand and maintain code, I
>>> think this is a bad idea. The data structure is currently protected
>>> by the state lock in all situations, but this change now makes it
>>> protected by the state lock in one case and the folio lock in a
>>> different case.
>>
>> Yeah, I agree that this is a side-effect of this change, after this patch,
>> we have to be careful to distinguish between below two cases B1 and B2 as
>> Willy mentioned.
>>
>> B. If ifs_set_range_uptodate() is called from iomap_set_range_uptodate(),
>> either we know:
>> B1. The caller of iomap_set_range_uptodate() holds the folio lock, and this
>> is the only place that can call ifs_set_range_uptodate() for this folio
>> B2. The caller of iomap_set_range_uptodate() holds the state lock
>
> Yes, I read that before I commented that I think it's a bad idea.
> And then provided a method where we don't need to care about this at
> all.
>>
>>>
>>> Making this change also misses the elephant in the room: the
>>> buffered write path still needs the ifs->state_lock to update the
>>> dirty bitmap. Hence we're effectively changing the serialisation
>>> mechanism for only one of the two ifs state bitmaps that the
>>> buffered write path has to update.
>>>
>>> Indeed, we can't get rid of the ifs->state_lock from the dirty range
>>> updates because iomap_dirty_folio() can be called without the folio
>>> being locked through folio_mark_dirty() calling the ->dirty_folio()
>>> aop.
>>>
>>
>> Sorry, I don't understand, why folio_mark_dirty() could be called without
>> folio lock (isn't this supposed to be a bug)? IIUC, all the file backed
>> folios must be locked before marking dirty. Are there any exceptions or am
>> I missing something?
>
> Yes: reading the code I pointed you at.
>
> /**
> * folio_mark_dirty - Mark a folio as being modified.
> * @folio: The folio.
> *
> * The folio may not be truncated while this function is running.
> * Holding the folio lock is sufficient to prevent truncation, but some
> * callers cannot acquire a sleeping lock. These callers instead hold
> * the page table lock for a page table which contains at least one page
> * in this folio. Truncation will block on the page table lock as it
> * unmaps pages before removing the folio from its mapping.
> *
> * Return: True if the folio was newly dirtied, false if it was already dirty.
> */
>
> So, yes, ->dirty_folio() can indeed be called without the folio
> being locked and it is not a bug.
Ha, right, I missed the comments of this function, it means that there are
some special callers that hold table lock instead of folio lock, is it
pte_alloc_map_lock?
I checked all the filesystem related callers and didn't find any real
caller that mark folio dirty without holding folio lock and that could
affect current filesystems which are using iomap framework, it's just
a potential possibility in the future, am I right?
>
> Hence we have to serialise ->dirty_folio against both
> __iomap_write_begin() dirtying the folio and iomap_writepage_map()
> clearing the dirty range.
>
Both __iomap_write_begin() and iomap_writepage_map() are under the folio
lock now (locked in iomap_get_folio() and writeback_get_folio()), is there
any special about this case?
> And that means we alway need to take the ifs->state_lock in
> __iomap_write_begin() when we have an ifs attached to the folio.
> Hence it is a) not correct and b) makes no sense to try to do
> uptodate bitmap updates without it held...
>
>>> IOWs, getting rid of the state lock out of the uptodate range
>>> changes does not actually get rid of it from the buffered IO patch.
>>> we still have to take it to update the dirty range, and so there's
>>> an obvious way to optimise the state lock usage without changing any
>>> of the bitmap access serialisation behaviour. i.e. We combine the
>>> uptodate and dirty range updates in __iomap_write_end() into a
>>> single lock context such as:
>>>
>>> iomap_set_range_dirty_uptodate()
>>> {
>>> struct iomap_folio_state *ifs = folio->private;
>>> struct inode *inode:
>>> unsigned int blks_per_folio;
>>> unsigned int first_blk;
>>> unsigned int last_blk;
>>> unsigned int nr_blks;
>>> unsigned long flags;
>>>
>>> if (!ifs)
>>> return;
>>>
>>> inode = folio->mapping->host;
>>> blks_per_folio = i_blocks_per_folio(inode, folio);
>>> first_blk = (off >> inode->i_blkbits);
>>> last_blk = (off + len - 1) >> inode->i_blkbits;
>>> nr_blks = last_blk - first_blk + 1;
>>>
>>> spin_lock_irqsave(&ifs->state_lock, flags);
>>> bitmap_set(ifs->state, first_blk, nr_blks);
>>> bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
>>> spin_unlock_irqrestore(&ifs->state_lock, flags);
>>> }
>>>
>>> This means we calculate the bitmap offsets only once, we take the
>>> state lock only once, and we don't do anything if there is no
>>> sub-folio state.
>>>
>>> If we then fix the __iomap_write_begin() code as Willy pointed out
>>> to elide the erroneous uptodate range update, then we end up only
>>> taking the state lock once per buffered write instead of 3 times per
>>> write.
>>>
>>> This patch only reduces it to twice per buffered write, so doing the
>>> above should provide even better performance without needing to
>>> change the underlying serialisation mechanism at all.
>>>
>>
>> Thanks for the suggestion. I've thought about this solution too, but I
>> didn't think we need the state_lock when setting ifs dirty bit since the
>> folio lock should work, so I changed my mind and planed to drop all ifs
>> state_lock in the write path (please see the patch 6). Please let me
>> know if I'm wrong.
>
> Whether it works or not is irrelevant: it is badly designed code
> that you have proposed. We can acheive the same result without
> changing the locking rules for the bitmap data via a small amount of
> refactoring, and that is a much better solution than creating
> complex and subtle locking rules for the object.
>
> "But it works" doesn't mean the code is robust, maintainable code.
>
> So, good optimisation, but NACK in this form. Please rework it to
> only take the ifs->state_lock once for both bitmap updates in the
> __iomap_write_end() path.
>
OK, sure, this looks reasonable to me, I will revise it as you suggested
and check performance gain again in my next iteration.
Thanks,
Yi.
Powered by blists - more mailing lists