[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230109172500.bd4z2incticapm7x@quack3>
Date: Mon, 9 Jan 2023 18:25:00 +0100
From: Jan Kara <jack@...e.cz>
To: David Howells <dhowells@...hat.com>
Cc: Jens Axboe <axboe@...nel.dk>, Al Viro <viro@...iv.linux.org.uk>,
Christoph Hellwig <hch@....de>,
Matthew Wilcox <willy@...radead.org>,
Logan Gunthorpe <logang@...tatee.com>,
Christoph Hellwig <hch@...radead.org>,
Jeff Layton <jlayton@...nel.org>,
linux-fsdevel@...r.kernel.org, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 7/7] iov_iter, block: Make bio structs pin pages
rather than ref'ing if appropriate
On Mon 09-01-23 09:43:24, David Howells wrote:
> Jens Axboe <axboe@...nel.dk> wrote:
>
> > > A field, bi_cleanup_mode, is added to the bio struct that gets set by
> > > iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
> > > necessary. FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page(). Other
> > > flags could also be used in future.
> > >
> > > Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
> > > indicate that attached pages are ref'd by default. Cloning sets it to 0.
> > > __bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
> > > indicates.
> >
> > What's the motivation for this change?
>
> DIO reads in most filesystems and, I think, the block layer are currently
> broken with respect to concurrent fork in the same process because they take
> refs (FOLL_GET) on the pages involved which causes the CoW mechanism to
> malfunction, leading (I think) the parent process to not see the result of the
> DIO. IIRC, the pages undergoing DIO get forcibly copied by fork - and the
> copies given to the parent. Instead, DIO reads should be pinning the pages
> (FOLL_PIN). Maybe Willy can weigh in on this?
>
> Further, getting refs on pages in, say, a KVEC iterator is the wrong
> thing to do as the kvec may point to things that shouldn't be ref'd
> (vmap'd or vmalloc'd regions, for example). Instead, the in-kernel
> caller should do what it needs to do to keep hold of the memory and the
> DIO should not take a ref at all.
Yes, plus there is also a problem if user sets up a DIO read into a buffer
backed by memory mapped file, then these mapped pages can be cleaned by
writeback while the DIO read is running causing checksum failures or
DIF/DIX failures. Also once the writeback is done, the filesystem currently
thinks it controls all paths modifying page data and thus can happily go on
deduplicating file blocks or do similar stuff although pages are
concurrently modified by DIO read possibly causing data corruption. See [1]
for more details why filesystems have a problem with this. So filesystems
really need DIO reads to use FOLL_PIN instead of FOLL_GET and consequently
we need to pass information to bio completion function how page references
should be dropped.
Honza
[1] https://lore.kernel.org/all/20180103100430.GE4911@quack2.suse.cz/
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists