linux-ext4 - Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YhlkcYjozFmt3Kl4@mit.edu>
Date:   Fri, 25 Feb 2022 18:21:21 -0500
From:   "Theodore Ts'o" <tytso@....edu>
To:     John Hubbard <jhubbard@...dia.com>
Cc:     Eric Biggers <ebiggers@...nel.org>,
        Lee Jones <lee.jones@...aro.org>, linux-ext4@...r.kernel.org,
        Christoph Hellwig <hch@....de>,
        Dave Chinner <dchinner@...hat.com>,
        Goldwyn Rodrigues <rgoldwyn@...e.com>,
        "Darrick J . Wong" <darrick.wong@...cle.com>,
        Bob Peterson <rpeterso@...hat.com>,
        Damien Le Moal <damien.lemoal@....com>,
        Andreas Gruenbacher <agruenba@...hat.com>,
        Ritesh Harjani <riteshh@...ux.ibm.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Johannes Thumshirn <jth@...nel.org>, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, cluster-devel@...hat.com,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages
 without asking ext4 first

On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote:
> On 2/25/22 13:23, Theodore Ts'o wrote:
> > [un]pin_user_pages_remote is dirtying pages without properly warning
> > the file system in advance.  This was noted by Jan Kara in 2018[1] and
> 
> In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported
> was actually that dio_bio_complete() was calling set_page_dirty_lock()
> on pages that were not (any longer) set up for that.

Fair enough, there are two problems that are getting conflated here,
and that's my bad.  The problem which Jan pointed out is one where the
Direct I/O read path triggered a page fault, so page_mkwrite() was
actually called.  So in this case, the file system was actually
notified, and the page was marked dirty after the file system was
notified.  But then the DIO read was racing with the page cleaner,
which would call writepage(), and then clear the page, and then remove
the buffer_heads.  Then dio_bio_complete() would call set_page_dirty()
a second time, and that's what would trigger the BUG.

But in the syzbot reproducer, it's a different problem.  In this case,
process_vm_writev() calling [un]pin_user_pages_remote(), and
page_mkwrite() is never getting called.  So there is no need to race
with the page cleaner, and so the BUG triggers much more reliably.

> > more recently has resulted in bug reports by Syzbot in various Android
> > kernels[2].
> > 
> > This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
> 
> Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock()
> call into mm/gup.c, but that merely refactored things. The callers are
> all over the kernel, and those callers are what need changing in order
> to fix this.

>From my perspective, the bug is calling set_page_dirty() without first
calling the file system's page_mkwrite().  This is necessary since the
file system needs to allocate file system data blocks in preparation
for a future writeback.

Now, calling page_mkwrite() by itself is not enough, since the moment
you make the page dirty, the page cleaner could go ahead and call
writepage() behind your back and clean it.  In actual practice, with a
Direct I/O read request racing with writeback, this is race was quite
hard to hit, because the that would imply that the background
writepage() call would have to complete ahead of the synchronous read
request, and the block layer generally prioritizes synchronous reads
ahead of background write requests.  So in practice, this race was
***very*** hard to hit.  Jan may have reported it in 2018, but I don't
think I've ever seen it happen myself.

For process_vm_writev() this is a case where user pages are pinned and
then released in short order, so I suspect that race with the page
cleaner would also be very hard to hit.  But we could completely
remove the potential for the race, and also make things kinder for
f2fs and btrfs's compressed file write support, by making things work
much like the write(2) system call.  Imagine if we had a
"pin_user_pages_local()" which calls write_begin(), and a
"unpin_user_pages_local()" which calls write_end(), and the
presumption with the "[un]pin_user_pages_local" API is that you don't
hold the pinned pages for very long --- say, not across a system call
boundary, and then it would work the same way the write(2) system call
works does except that in the case of process_vm_writev(2) the pages
are identified by another process's address space where they happen to
be mapped.

This obviously doesn't work when pinning pages for remote DMA, because
in that case the time between pin_user_pages_remote() and
unpin_user_pages_remote() could be a long, long time, so that means we
can't use using write_begin/write_end; we'd need to call page_mkwrite()
when the pages are first pinned and then somehow prevent the page
cleaner from touching a dirty page which is pinned for use by the
remote DMA.

Does that make sense?

							- Ted