[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210209220056.GD103365@xz-x1>
Date: Tue, 9 Feb 2021 17:00:56 -0500
From: Peter Xu <peterx@...hat.com>
To: Mike Kravetz <mike.kravetz@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
Jerome Glisse <jglisse@...hat.com>,
"Kirill A . Shutemov" <kirill@...temov.name>,
Hugh Dickins <hughd@...gle.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Matthew Wilcox <willy@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Nadav Amit <nadav.amit@...il.com>
Subject: Re: [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs
On Tue, Feb 09, 2021 at 11:29:56AM -0800, Mike Kravetz wrote:
> On 2/5/21 6:36 PM, Peter Xu wrote:
> > On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
> >> On 1/29/21 2:49 PM, Peter Xu wrote:
> >>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> >>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
> >> ...
> >>> Huge & Mike,
> >>>
> >>> Would any of you have comment/concerns on the high-level design of this series?
> >>>
> >>> It would be great to know it, especially major objection, before move on to an
> >>> non-rfc version.
> >>
> >> My apologies for not looking at this sooner. Even now, I have only taken
> >> a very brief look at the hugetlbfs patches.
> >>
> >> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
> >> hugetlbfs. As you can imagine, there is some overlap in handling of wp ptes
> >> set for soft dirty. In addition, pmd sharing must be disabled for soft dirty
> >> as here and in Axel's uffd minor fault code.
> >
> > Interesting to know that we'll reach and need something common from different
> > directions, especially when they all mostly happen at the same time. :)
> >
> > Is there a real "BUG" that you mentioned? I'd be glad to read about it if
> > there is a link or something.
> >
>
> Sorry, I was referring to a bugzilla bug not a BUG(). Bottom line is that
> hugetlb was mostly overlooked when soft dirty support was added. A thread
> mostly from me is at:
> lore.kernel.org/r/999775bf-4204-2bec-7c3d-72d81b4fce30@...cle.com
> I am close to sending out a RFC, but keep getting distracted.
Thanks. Indeed I see no reason to not have hugetlb supported for soft dirty.
Tracking 1G huge pages could be too coarse and heavy, but 2M at least still
seems reasonable.
>
> >> No objections to the overall approach based on my quick look.
> >
> > Thanks for having a look.
> >
> > So for hugetlb one major thing is indeed about the pmd sharing part, which
> > seems that we've got very good consensus on.
>
> Yes
>
> > The other thing that I'd love to get some comment would be a shared topic with
> > shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
> > to record wr-protect information even if the pgtable entries were flushed.
> > That comes from a fundamental difference between anonymous and file-backed
> > memory in that anonymous pages keep all info in the pgtable entry, but
> > file-backed memory is not, e.g., pgtable entries can be dropped at any time as
> > long as page cache is there.
>
> Sorry, but I can not recall this difference for hugetlb pages. What operations
> lead to flushing of pagetable entries? It would need to be something other
> than unmap as it seems we want to lose the information in unmap IIUC.
For hugetlbfs I know two cases.
One is exactly huge pmd sharing as mentioned above, where we'll drop the
pgtable entries for a specific process but the page cache will still exist.
The other one is hugetlbfs_punch_hole(), where hugetlb_vmdelete_list() called
before remove_inode_hugepages(). For uffd-wp, there will be a very small
window that a wr-protected huge page can be written before the page is finally
dropped in remove_inode_hugepages() but after pgtable entry flushed. In some
apps that could cause data loss.
>
> > I goes to look at soft-dirty then regarding this issue, and there's actually a
> > paragraph about it:
> >
> > While in most cases tracking memory changes by #PF-s is more than enough
> > there is still a scenario when we can lose soft dirty bits -- a task
> > unmaps a previously mapped memory region and then maps a new one at
> > exactly the same place. When unmap is called, the kernel internally
> > clears PTE values including soft dirty bits. To notify user space
> > application about such memory region renewal the kernel always marks
> > new memory regions (and expanded regions) as soft dirty.
> >
> > I feel like it just means soft-dirty currently allows false positives: we could
> > have set the soft dirty bit even if the page is clean. And that's what this
> > series wanted to avoid: it used the new concept called "swap special pte" to
> > persistent that information even for file-backed memory. That all goes for
> > avoiding those false positives.
>
> Yes, I have seen this with soft dirty. It really does not seem right. When
> you first create a mapping, even before faulting in anything the vma is marked
> VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty.
Right that seems not optimal. It is understandable since dirty info is indeed
tolerant to false positives, so soft-dirty avoided this issue as uffd-wp wanted
to solve in this series. It would be great to know if current approach in this
series would work for us to remove those false positives.
>
> To be honest, I am not sure you want to try and carry per-process/per-mapping
> wp information in the file.
What this series does is trying to persist that information in pgtable entries,
rather than in the file (or page cache). Frankly I can't say whether that's
optimal either, so I'm always open to any comment. So far I think it's a valid
solution, but it could always be possible that I missed something important.
> In the comment about soft dirty above, it seems
> reasonable that unmapping would clear all soft dirty information. Also,
> unmapping would clear any uffd state/information.
Right, unmap should always means "dropping all information in the ptes". It's
in below patch that we tried to treat it differently:
https://github.com/xzpeter/linux/commit/e958e9ee8d33e9a6602f93cdbe24a0c3614ab5e2
A quick summary of above patch: only if we unmap or truncate the hugetlbfs
file, would we call hugetlb_vmdelete_list() with ZAP_FLAG_DROP_FILE_UFFD_WP
(which means we'll drop all the information, including uffd-wp bit).
Thanks,
--
Peter Xu
Powered by blists - more mailing lists