linux-kernel - Re: [PATCH] mm: Speed up mremap on large regions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2c540b83-9125-8cae-9aab-3c8b37d7e914@suse.com>
Date:   Fri, 12 Oct 2018 09:29:10 +0200
From:   Juergen Gross <jgross@...e.com>
To:     Jann Horn <jannh@...gle.com>
Cc:     joel@...lfernandes.org, kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>, kernel-team@...roid.com,
        Minchan Kim <minchan@...gle.com>,
        Hugh Dickins <hughd@...gle.com>, lokeshgidra@...gle.com,
        Andrew Morton <akpm@...ux-foundation.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Kate Stewart <kstewart@...uxfoundation.org>,
        pombredanne@...b.com, Thomas Gleixner <tglx@...utronix.de>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Radim Krčmář <rkrcmar@...hat.com>,
        kvm@...r.kernel.org
Subject: Re: [PATCH] mm: Speed up mremap on large regions

On 12/10/2018 07:34, Jann Horn wrote:
> On Fri, Oct 12, 2018 at 7:29 AM Juergen Gross <jgross@...e.com> wrote:
>> On 12/10/2018 05:21, Jann Horn wrote:
>>> +cc xen maintainers and kvm folks
>>>
>>> On Fri, Oct 12, 2018 at 4:40 AM Joel Fernandes (Google)
>>> <joel@...lfernandes.org> wrote:
>>>> Android needs to mremap large regions of memory during memory management
>>>> related operations. The mremap system call can be really slow if THP is
>>>> not enabled. The bottleneck is move_page_tables, which is copying each
>>>> pte at a time, and can be really slow across a large map. Turning on THP
>>>> may not be a viable option, and is not for us. This patch speeds up the
>>>> performance for non-THP system by copying at the PMD level when possible.
>>> [...]
>>>> +bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>>>> +                 unsigned long new_addr, unsigned long old_end,
>>>> +                 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
>>>> +{
>>> [...]
>>>> +       /*
>>>> +        * We don't have to worry about the ordering of src and dst
>>>> +        * ptlocks because exclusive mmap_sem prevents deadlock.
>>>> +        */
>>>> +       old_ptl = pmd_lock(vma->vm_mm, old_pmd);
>>>> +       if (old_ptl) {
>>>> +               pmd_t pmd;
>>>> +
>>>> +               new_ptl = pmd_lockptr(mm, new_pmd);
>>>> +               if (new_ptl != old_ptl)
>>>> +                       spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>>>> +
>>>> +               /* Clear the pmd */
>>>> +               pmd = *old_pmd;
>>>> +               pmd_clear(old_pmd);
>>>> +
>>>> +               VM_BUG_ON(!pmd_none(*new_pmd));
>>>> +
>>>> +               /* Set the new pmd */
>>>> +               set_pmd_at(mm, new_addr, new_pmd, pmd);
>>>> +               if (new_ptl != old_ptl)
>>>> +                       spin_unlock(new_ptl);
>>>> +               spin_unlock(old_ptl);
>>>
>>> How does this interact with Xen PV? From a quick look at the Xen PV
>>> integration code in xen_alloc_ptpage(), it looks to me as if, in a
>>> config that doesn't use split ptlocks, this is going to temporarily
>>> drop Xen's type count for the page to zero, causing Xen to de-validate
>>> and then re-validate the L1 pagetable; if you first set the new pmd
>>> before clearing the old one, that wouldn't happen. I don't know how
>>> this interacts with shadow paging implementations.
>>
>> No, this isn't an issue. As the L1 pagetable isn't being released it
>> will stay pinned, so there will be no need to revalidate it.
> 
> Where exactly is the L1 pagetable pinned? xen_alloc_ptpage() does:
> 
>         if (static_branch_likely(&xen_struct_pages_ready))
>             SetPagePinned(page);

This marking the pagetable as to be pinned, in order to pin it via

xen_activate_mm()
  xen_pgd_pin()
    __xen_pgd_pin()
      __xen_pgd_walk()
        xen_pin_page()
          xen_do_pin()

> 
>         if (!PageHighMem(page)) {
>             xen_mc_batch();
> 
>             __set_pfn_prot(pfn, PAGE_KERNEL_RO);
> 
>             if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS)
>                 __pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);
> 
>             xen_mc_issue(PARAVIRT_LAZY_MMU);
>         } else {
>             /* make sure there are no stray mappings of
>                this page */
>             kmap_flush_unused();
>         }
> 
> which means that if USE_SPLIT_PTE_PTLOCKS is false, the table doesn't
> get pinned and only stays typed as long as it is referenced by an L2
> table, right?

In case the pagetable has been allocated since activation of the
address space it seems indeed not to be pinned yet. IMO this is not
meant to be that way, but probably most kernel configs for 32-bit
PV guests have NR_CPUS > 4, so they do use split ptlocks.

In fact this seems to be a bug as at deactivation of the address
space the kernel will try to unpin the pagetable and the hypervisor
would issue a warning if it has been built with debug messages
enabled. Same applies to suspend()/resume() cycles.


Juergen