linux-ext4 - Re: Kernel Benchmarking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200917182314.GU5449@casper.infradead.org>
Date:   Thu, 17 Sep 2020 19:23:14 +0100
From:   Matthew Wilcox <willy@...radead.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Michael Larabel <Michael@...haellarabel.com>,
        Matthieu Baerts <matthieu.baerts@...sares.net>,
        Amir Goldstein <amir73il@...il.com>,
        Ted Ts'o <tytso@...gle.com>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Jan Kara <jack@...e.cz>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: Kernel Benchmarking

On Thu, Sep 17, 2020 at 10:51:41AM -0700, Linus Torvalds wrote:
> It does strike me that if the main source of contention comes from
> that "we need to check that the mapping is still valid as we insert
> the page into the page tables", then the page lock really isn't the
> obvious lock to use.
> 
> It would be much more natural to use the mapping->i_mmap_rwsem, I feel.
> 
> Willy? Your "just check for uptodate without any lock" patch itself
> feels wrong. That's what we do for plain reads, but the difference is
> that a read is a one-time event and a race is fine: we get valid data,
> it's just that it's only valid *concurrently* with the truncate or
> hole-punching event (ie either all zeroes or old data is fine).
> 
> The reason faulting a page in is different from a read is that if you
> then map in a stale page, it might have had the correct contents at
> the time of the fault, but it will not have the correct contents going
> forward.
> 
> So a page-in requires fundamentally stronger locking than a read()
> does, because of how the page-in causes that "future lifetime" of the
> page, in ways a read() event does not.

Yes, I agree, mmap is granting future access to a page in a
way that read is not.  So we need to be sure that any concurrent
truncate/hole-punch/collapse-range/invalidate-for-directio/migration
(henceforth page-killer) doesn't allow a page that's about to be recycled
to be added to the page tables.

What I was going for was that every page-killer currently does something
like this:

        if (page_mapped(page))
                unmap_mapping_pages(mapping, page->index, nr, false);

so as long as we mark the page as being no-new-maps-allowed before
the page-killer checks page_mapped(), and the map-page path checks
that the page isn't no-new-maps-allowed after incrementing page_mapped(),
then we'll never see something we shouldn't in the page tables -- either
it will show up in the page tables right before the page-killer calls
unmap_mapping_pages() (in which case you'll get the old contents), or
it won't show up in the page tables.

I'd actually want to wrap all that into:

static inline void page_kill_mappings(struct page)
{
	ClearPageUptodate(page);
	smb_mb__before_atomic();
	if (!page_mapped(page))
		return;
	unmap_mapping_pages(mapping, page->index, compound_nr(page), false);
}

but that's just syntax.

I'm pretty sure that patch I sent out doesn't handle page faults on
disappearing pages correctly; it needs to retry so it can instantiate
a new page in the page cache.  And as Jan pointed out, it didn't handle
the page migration case.  But that wasn't really the point of the patch.

> But truncation that does page cache removal already requires that
> i_mmap_rwsem, and in fact the VM already very much uses that (ie when
> walking the page mapping).
> 
> The other alternative might be just the mapping->private_lock. It's
> not a reader-writer lock, but if we don't need to sleep (and I don't
> think the final "check ->mapping" can sleep anyway since it has to be
> done together with the page table lock), a spinlock would be fine.

I'm not a huge fan of taking file-wide locks for something that has
a naturally finer granularity.  It tends to bite us when weirdos with
giant databases mmap the whole thing and then whine about contention on
the rwsem during page faults.

But you're right that unmap_mapping_range() already takes this lock for
removing pages from the page table, so it would be reasonable to take
it for read when adding pages to the page table.  Something like taking
the i_mmap_lock_read(file->f_mapping) in filemap_fault, then adding a
new VM_FAULT_I_MMAP_LOCKED bit so that do_read_fault() and friends add:

	if (ret & VM_FAULT_I_MMAP_LOCKED)
		i_mmap_unlock_read(vmf->vma->vm_file->f_mapping);
	else
		unlock_page(page);

... want me to turn that into a real patch?