linux-kernel - Re: [PATCH 19/25] mm/s390: Use mm_fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200617164413.GG76766@xz-x1>
Date:   Wed, 17 Jun 2020 12:44:13 -0400
From:   Peter Xu <peterx@...hat.com>
To:     Christian Borntraeger <borntraeger@...ibm.com>
Cc:     Alexander Gordeev <agordeev@...ux.ibm.com>,
        linux-kernel@...r.kernel.org,
        Gerald Schaefer <gerald.schaefer@...ibm.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Heiko Carstens <heiko.carstens@...ibm.com>,
        Vasily Gorbik <gor@...ux.ibm.com>, linux-s390@...r.kernel.org
Subject: Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()

On Wed, Jun 17, 2020 at 06:14:52PM +0200, Christian Borntraeger wrote:
> 
> 
> On 17.06.20 18:06, Peter Xu wrote:
> > Hi, Christian,
> > 
> > On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 16.06.20 18:35, Peter Xu wrote:
> >>> Hi, Alexander,
> >>>
> >>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> >>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >>>>>  	if (unlikely(fault & VM_FAULT_ERROR))
> >>>>>  		goto out_up;
> >>>>>
> >>>>> -	/*
> >>>>> -	 * Major/minor page fault accounting is only done on the
> >>>>> -	 * initial attempt. If we go through a retry, it is extremely
> >>>>> -	 * likely that the page will be found in page cache at that point.
> >>>>> -	 */
> >>>>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>>>> -		if (fault & VM_FAULT_MAJOR) {
> >>>>> -			tsk->maj_flt++;
> >>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> >>>>> -				      regs, address);
> >>>>> -		} else {
> >>>>> -			tsk->min_flt++;
> >>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> >>>>> -				      regs, address);
> >>>>> -		}
> >>>>>  		if (fault & VM_FAULT_RETRY) {
> >>>>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >>>>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> > 
> > [1]
> > 
> >>>>
> >>>> Seems like the call to mm_fault_accounting() will be missed if
> >>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> >>>> jumps to "out_up"...
> >>>
> >>> This is true as a functional change.  However that also means that we've got a
> >>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> >>> than handled correctly (for instance, due to some try_lock failed during the
> >>> fault process).
> >>>
> >>> To me, that case should not be counted as a page fault at all?  Or we might get
> >>> the same duplicated accounting when the page fault retried from a higher stack.
> >>>
> >>> Thanks
> >>
> >> This case below (the one with the gmap) is the KVM case for doing a so called
> >> pseudo page fault to our guests. (we notify our guests about major host page
> >> faults and let it reschedule to something else instead of halting the vcpu).
> >> This is being resolved with either gup or fixup_user_fault asynchronously by
> >> KVM code (this can also be sync when the guest does not match some conditions)
> >> We do not change the counters in that code as far as I can tell so we should
> >> continue to do it here.
> >>
> >> (see arch/s390/kvm/kvm-s390.c
> >> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
> >> {
> >> [...]
> >>         } else if (current->thread.gmap_pfault) {
> >>                 trace_kvm_s390_major_guest_pfault(vcpu);
> >>                 current->thread.gmap_pfault = 0;
> >>                 if (kvm_arch_setup_async_pf(vcpu))
> >>                         return 0;
> >>                 return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
> >>         }
> > 
> > Please correct me if I'm wrong... but I still think what this patch does is the
> > right thing to do.
> > 
> > Note again that IMHO when reached [1] above it means the page fault is not
> > handled correctly so we need to fallback to KVM async page fault, then we
> > shouldn't increment the accountings until it's finally handled correctly. That
> > final accounting should be done in the async pf path in gup code where the page
> > fault is handled:
> > 
> >   kvm_arch_fault_in_page
> >     gmap_fault
> >       fixup_user_fault
> > 
> > Where in fixup_user_fault() we have:
> > 
> > 	if (tsk) {
> > 		if (major)
> > 			tsk->maj_flt++;
> > 		else
> > 			tsk->min_flt++;
> > 	}
> > 
> 
> Right that case does work. Its the case where we do not inject a pseudo pagefault and
> instead fall back to synchronous fault-in.
> What is about the other case:
> 
> kvm_setup_async_pf
> 	->workqueue
> 		async_pf_execute
> 			get_user_pages_remote
> 
> Does get_user_pages_remote do the accounting as well? I cant see that.
> 

It should, with:

  get_user_pages_remote
    __get_user_pages_remote
      __get_user_pages_locked
        __get_user_pages
          faultin_page

Where the accounting is done in faultin_page().

We probably need to also move the accounting in faultin_page() to be after the
retry check too, but that should be irrelevant to this patch.

Thanks!

-- 
Peter Xu