linux-kernel - Re: [PATCH] mm: Avoid unnecessary page fault retires on shared memory types

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Ynk02co/piCrs7iM@xz-m1.local>
Date:   Mon, 9 May 2022 11:35:53 -0400
From:   Peter Xu <peterx@...hat.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Alistair Popple <apopple@...dia.com>
Subject: Re: [PATCH] mm: Avoid unnecessary page fault retires on shared
 memory types

Hi, Johannes,

On Fri, May 06, 2022 at 04:27:06PM -0400, Johannes Weiner wrote:
> On Thu, May 05, 2022 at 05:17:48PM -0400, Peter Xu wrote:
> > I observed that for each of the shared file-backed page faults, we're very
> > likely to retry one more time for the 1st write fault upon no page.  It's
> > because we'll need to release the mmap lock for dirty rate limit purpose
> > with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> > 
> > Then after that throttling we return VM_FAULT_RETRY.
> > 
> > We did that probably because VM_FAULT_RETRY is the only way we can return
> > to the fault handler at that time telling it we've released the mmap lock.
> > 
> > However that's not ideal because it's very likely the fault does not need
> > to be retried at all since the pgtable was well installed before the
> > throttling, so the next continuous fault (including taking mmap read lock,
> > walk the pgtable, etc.) could be in most cases unnecessary.
> > 
> > It's not only slowing down page faults for shared file-backed, but also add
> > more mmap lock contention which is in most cases not needed at all.
> > 
> > To observe this, one could try to write to some shmem page and look at
> > "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> > shmem write simply because we retried, and vm event "pgfault" will capture
> > that.
> > 
> > To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> > show that we've completed the whole fault and released the lock.  It's also
> > a hint that we should very possibly not need another fault immediately on
> > this page because we've just completed it.
> > 
> > This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> > program sequentially dirtying 400MB shmem file being mmap()ed:
> > 
> >   Before: 650980.20 (+-1.94%)
> >   After:  569396.40 (+-1.38%)
> > 
> > I believe it could help more than that.
> > 
> > We need some special care on GUP and the s390 pgfault handler (for gmap
> > code before returning from pgfault), the rest changes in the page fault
> > handlers should be relatively straightforward.
> > 
> > Another thing to mention is that mm_account_fault() does take this new
> > fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> > 
> > I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> > not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> > them as-is.
> > 
> > Signed-off-by: Peter Xu <peterx@...hat.com>
> 
> The change makes sense to me, but the unlock/retry signaling is
> tricky...
> 
> > @@ -1227,6 +1247,18 @@ int fixup_user_fault(struct mm_struct *mm,
> >  		return -EINTR;
> >  
> >  	ret = handle_mm_fault(vma, address, fault_flags, NULL);
> > +
> > +	if (ret & VM_FAULT_COMPLETED) {
> > +		/*
> > +		 * NOTE: it's a pity that we need to retake the lock here
> > +		 * to pair with the unlock() in the callers. Ideally we
> > +		 * could tell the callers so they do not need to unlock.
> > +		 */
> > +		mmap_read_lock(mm);
> > +		*unlocked = true;
> > +		return 0;
> > +	}
> 
> unlocked can be NULL inside the function, yet you assume it's non-NULL
> here. This is okay because COMPLETED can only be returned if RETRY is
> set, and when RETRY is set unlocked must be non-NULL. It's correct but
> not very obvious.
> 
> It might be cleaner to have separate flags for ALLOW_RETRY and
> ALLOW_UNLOCK, with corresponding VM_FAULT_RETRY and VM_FAULT_UNLOCKED?
> Even if not all combinations are used.

I can do that, but note that I don't see a major difference even if we're
going to introduce ALLOW_UNLOCK.  IOW, afaict when we have ALLOW_RETRY then
we probably will also have ALLOW_UNLOCK, and vice versa. If that's the case
then it kind of loses the benefit of introducing a new flag.

What I want to achieve with this patch is to provide a variance of
VM_FAULT_RETRY.  I did it because VM_FAULT_RETRY actually contains more
than one meaning:

  (1) It means we have released the mmap_sem read lock, AND,
  (2) It needs the caller to retry one more time on the fault.

Here the tricky thing is (2) depends on (1) so when we have (2) we must
have (1).

However it's not the same the other way round: we could have (1) only but
without (2), just like what we're doing with fault_dirty_shared_page()
right now, where we may only want (1) but we're using (1+2) because we
don't have a retval that only contains (1).

So I see the new VM_FAULT_COMPLETED as a new variance of VM_FAULT_RETRY
just with condition (2) removed.

> 
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2942,7 +2942,7 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
> >  		balance_dirty_pages_ratelimited(mapping);
> >  		if (fpin) {
> >  			fput(fpin);
> > -			return VM_FAULT_RETRY;
> > +			return VM_FAULT_COMPLETED;
> 
> There is one oddity in this now.
> 
> It completes the fault and no longer triggers a retry. Yet it's still
> using maybe_unlock_mmap_for_io() and subject to retry limiting. This
> means that if the fault already retried once, this code won't drop the
> mmap_sem to call balance_dirty_pages() - even though it safely could
> and should do so, without risking endless retries.

This sounds like another change to me.  It's about whether we should
consider using maybe_unlock_mmap_for_io() even for after-first page faults
if it has retried already.

It looks good to me, but I don't know enough on the rate limiting code path
so I can't really provide anything more useful.  I do see quite a few other
callers of maybe_unlock_mmap_for_io() outside of this call site so we'll
want to analyze all of them to make a conclusion, IMHO.

What I can say is if one day we choose to do so then as long as fpin is set
then I think we can still safely return the VM_FAULT_COMPLETED here as long
as all the assumptions keep the same - firstly we released mmap_sem, then
we have finished the current page fault processing as a hint to be
delivered upward to the callers.

Thanks,

-- 
Peter Xu