[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210526100027.GA30369@quack2.suse.cz>
Date: Wed, 26 May 2021 12:00:27 +0200
From: Jan Kara <jack@...e.cz>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Jan Kara <jack@...e.cz>, linux-fsdevel@...r.kernel.org,
Christoph Hellwig <hch@...radead.org>,
Dave Chinner <david@...morbit.com>, ceph-devel@...r.kernel.org,
Chao Yu <yuchao0@...wei.com>,
Damien Le Moal <damien.lemoal@....com>,
"Darrick J. Wong" <darrick.wong@...cle.com>,
Jaegeuk Kim <jaegeuk@...nel.org>,
Jeff Layton <jlayton@...nel.org>,
Johannes Thumshirn <jth@...nel.org>,
linux-cifs@...r.kernel.org, linux-ext4@...r.kernel.org,
linux-f2fs-devel@...ts.sourceforge.net, linux-mm@...ck.org,
linux-xfs@...r.kernel.org, Miklos Szeredi <miklos@...redi.hu>,
Steve French <sfrench@...ba.org>, Ted Tso <tytso@....edu>,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH 03/13] mm: Protect operations adding pages to page cache
with invalidate_lock
On Tue 25-05-21 14:01:49, Darrick J. Wong wrote:
> On Tue, May 25, 2021 at 03:50:40PM +0200, Jan Kara wrote:
> > Currently, serializing operations such as page fault, read, or readahead
> > against hole punching is rather difficult. The basic race scheme is
> > like:
> >
> > fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
> > truncate_inode_pages_range()
> > <create pages in page
> > cache here>
> > <update fs block mapping and free blocks>
> >
> > Now the problem is in this way read / page fault / readahead can
> > instantiate pages in page cache with potentially stale data (if blocks
> > get quickly reused). Avoiding this race is not simple - page locks do
> > not work because we want to make sure there are *no* pages in given
> > range. inode->i_rwsem does not work because page fault happens under
> > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> > the performance for mixed read-write workloads suffer.
> >
> > So create a new rw_semaphore in the address_space - invalidate_lock -
> > that protects adding of pages to page cache for page faults / reads /
> > readahead.
> >
> > Signed-off-by: Jan Kara <jack@...e.cz>
> > ---
> > Documentation/filesystems/locking.rst | 64 ++++++++++++++++++--------
> > fs/inode.c | 2 +
> > include/linux/fs.h | 6 +++
> > mm/filemap.c | 65 ++++++++++++++++++++++-----
> > mm/readahead.c | 2 +
> > mm/rmap.c | 37 +++++++--------
> > mm/truncate.c | 3 +-
> > 7 files changed, 129 insertions(+), 50 deletions(-)
> >
> > diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> > index 4ed2b22bd0a8..af425bef55d3 100644
> > --- a/Documentation/filesystems/locking.rst
> > +++ b/Documentation/filesystems/locking.rst
> > @@ -271,19 +271,19 @@ prototypes::
> > locking rules:
> > All except set_page_dirty and freepage may block
> >
> > -====================== ======================== =========
> > -ops PageLocked(page) i_rwsem
> > -====================== ======================== =========
> > +====================== ======================== ========= ===============
> > +ops PageLocked(page) i_rwsem invalidate_lock
> > +====================== ======================== ========= ===============
> > writepage: yes, unlocks (see below)
> > -readpage: yes, unlocks
> > +readpage: yes, unlocks shared
> > writepages:
> > set_page_dirty no
> > -readahead: yes, unlocks
> > -readpages: no
> > +readahead: yes, unlocks shared
> > +readpages: no shared
> > write_begin: locks the page exclusive
> > write_end: yes, unlocks exclusive
> > bmap:
> > -invalidatepage: yes
> > +invalidatepage: yes exclusive
> > releasepage: yes
> > freepage: yes
> > direct_IO:
> > @@ -378,7 +378,10 @@ keep it that way and don't breed new callers.
> > ->invalidatepage() is called when the filesystem must attempt to drop
> > some or all of the buffers from the page when it is being truncated. It
> > returns zero on success. If ->invalidatepage is zero, the kernel uses
> > -block_invalidatepage() instead.
> > +block_invalidatepage() instead. The filesystem should exclusively acquire
>
> s/should/must/ ? It's not really optional to lock out invalidations
> anymore now that the page cache synchronizes on invalidate_lock, right?
Right, updated.
> > +invalidate_lock before invalidating page cache in truncate / hole punch path
> > +(and thus calling into ->invalidatepage) to block races between page cache
> > +invalidation and page cache filling functions (fault, read, ...).
> >
> > ->releasepage() is called when the kernel is about to try to drop the
> > buffers from the page in preparation for freeing it. It returns zero to
> > @@ -573,6 +576,27 @@ in sys_read() and friends.
> > the lease within the individual filesystem to record the result of the
> > operation
> >
> > +->fallocate implementation must be really careful to maintain page cache
> > +consistency when punching holes or performing other operations that invalidate
> > +page cache contents. Usually the filesystem needs to call
> > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > +However the filesystem usually also needs to update its internal (and on disk)
> > +view of file offset -> disk block mapping. Until this update is finished, the
> > +filesystem needs to block page faults and reads from reloading now-stale page
> > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > +and acquires it in shared mode in paths loading pages from disk
> > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > +responsible for taking this lock in its fallocate implementation and generally
> > +whenever the page cache contents needs to be invalidated because a block is
> > +moving from under a page.
> > +
> > +->copy_file_range and ->remap_file_range implementations need to serialize
> > +against modifications of file data while the operation is running. For
> > +blocking changes through write(2) and similar operations inode->i_rwsem can be
> > +used. For blocking changes through memory mapping, the filesystem can use
> > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > +implementation.
>
> Once this patch lands, will there be any filesystems that can skip
> taking invalidate_lock in ->page_mkwrite and not have problems? Now
> that the address_space has an invalidation lock, everyone is strongly
> incentivized to use it unless they have yet another layer of locks to do
> more or less the same thing, right?
Well, I assume btrfs will want to keep their special extent tree locking
and thus invalidate_lock is not necessary for it strictly speaking. Also
filesystems supporting only read, write, mmap, truncate (such as udf,
reiserfs, ...) do not really need invalidate_lock (they usually don't
bother with any page_mkwrite helper in fact). So there are going to be
exceptions. I want to add invalidate_lock locking around truncate handling
for these filesystem as well to make locking rules simpler and to be able
to add assertions into VFS helpers. I didn't plan to do this for
.page_mkwrite as there it might actually hurt performance noticeably.
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists