linux-kernel - Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20160223225825.GA3651@node.shutemov.name>
Date:	Wed, 24 Feb 2016 01:58:26 +0300
From:	"Kirill A. Shutemov" <kirill@...temov.name>
To:	Dave Hansen <dave.hansen@...el.com>
Cc:	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Hugh Dickins <hughd@...gle.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Vlastimil Babka <vbabka@...e.cz>,
	Christoph Lameter <cl@...two.org>,
	Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
	Jerome Marchand <jmarchan@...hat.com>,
	Yang Shi <yang.shi@...aro.org>,
	Sasha Levin <sasha.levin@...cle.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCHv2 08/28] mm: postpone page table allocation until
 do_set_pte()

On Tue, Feb 16, 2016 at 09:38:48AM -0800, Dave Hansen wrote:
> Sorry, fat-fingered the send on the last one.
> 
> On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> > On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
> >>> +	if (unlikely(pmd_none(*fe->pmd) &&
> >>> +			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
> >>> +		return VM_FAULT_OOM;
> >>
> >> Should we just move this pmd_none() check in to __pte_alloc()?  You do
> >> this same-style check at least twice.
> > 
> > We have it there. The check here is speculative to avoid taking ptl.
> 
> OK, that's a performance optimization.  Why shouldn't all callers of
> __pte_alloc() get the same optimization?

I've sent patch for this.

> >>> +	/* If an huge pmd materialized from under us just retry later */
> >>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> >>> +		return 0;
> >>
> >> Nit: please stop sprinkling unlikely() everywhere.  Is there some
> >> concrete benefit to doing it here?  I really doubt the compiler needs
> >> help putting the code for "return 0" out-of-line.
> >>
> >> Why is it important to abort here?  Is this a small-page-only path?
> > 
> > This unlikely() was moved from __handle_mm_fault(). I didn't put much
> > consideration in it.
> 
> OK, but separately from the unlikely()...  Why is it important to jump
> out of this code when we see a pmd_trans_huge() pmd?

The code below work on pte level, so it expect the pmd to point to page
table.

And the page fault most likely was solved anyway.

> >>> +static int pte_alloc_one_map(struct fault_env *fe)
> >>> +{
> >>> +	struct vm_area_struct *vma = fe->vma;
> >>> +
> >>> +	if (!pmd_none(*fe->pmd))
> >>> +		goto map_pte;
> >>
> >> So the calling convention here is...?  It looks like this has to be
> >> called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
> >> pte page.  This can never be called on a huge pmd.  Right?
> > 
> > It's not under ptl, so pmd can be filled under us. There's huge pmd check in
> > 'map_pte' goto path.
> 
> OK, could we add some comments on that?  We expect to be called to
> ______, but if there is a race, we might also have to handle ______, etc...?

Ok.

> >>> +	if (fe->prealloc_pte) {
> >>> +		smp_wmb(); /* See comment in __pte_alloc() */
> >>
> >> Are we trying to make *this* cpu's write visible, or to see the write
> >> from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
> >> smp_rmb() what we want for that?
> > 
> > See 362a61ad6119.
> 
> That patch explains that anyone allocating and initializing a page table
> page must ensure that all CPUs can see the initialization writes
> *before* the page can be linked into the page tables.  __pte_alloc()
> performs a smp_wmb() to ensure that other processors can see its writes.
> 
> That still doesn't answer my question though.  What does this barrier
> do?  What does it make visible to this processor?  __pte_alloc() already
> made its initialization visible, so what's the purpose *here*?

We don't call __pte_alloc() to allocate the page table for ->prealloc_pte,
we call pte_alloc_one(), which doesn't have the barrier.

> >>> +		atomic_long_inc(&vma->vm_mm->nr_ptes);
> >>> +		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
> >>> +		spin_unlock(fe->ptl);
> >>> +		fe->prealloc_pte = 0;
> >>> +	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
> >>> +					fe->address))) {
> >>> +		return VM_FAULT_OOM;
> >>> +	}
> >>> +map_pte:
> >>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> >>> +		return VM_FAULT_NOPAGE;
> >>
> >> I think I need a refresher on the locking rules.  pte_offset_map*() is
> >> unsafe to call on a huge pmd.  What in this context makes it impossible
> >> for the pmd to get promoted after the check?
> > 
> > Do you mean what stops pte page table to collapsed into huge pmd?
> > The answer is mmap_sem. Collapse code takes the lock on write to be able to
> > retract page table.
> 
> What I learned in this set is that pte_offset_map_lock() is dangerous to
> call unless THPs have been excluded somehow from the PMD it's being
> called on.
> 
> What I'm looking for is something to make sure that the context has been
> thought through and is thoroughly THP-free.
> 
> It sounds like you've thought through all the cases, but your thoughts
> aren't clear from the way the code is laid out currently.

Actually, I've discovered race looking into this code. Andrea has fixed it
in __handle_mm_fault() and I will move the comment here.

> >>> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
> >>>   *
> >>>   * Target users are page handler itself and implementations of
> >>>   * vm_ops->map_pages.
> >>>   */
> >>> -void do_set_pte(struct fault_env *fe, struct page *page)
> >>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> >>> +		struct page *page)
> >>>  {
> >>>  	struct vm_area_struct *vma = fe->vma;
> >>>  	bool write = fe->flags & FAULT_FLAG_WRITE;
> >>>  	pte_t entry;
> >>>  
> >>> +	if (!fe->pte) {
> >>> +		int ret = pte_alloc_one_map(fe);
> >>> +		if (ret)
> >>> +			return ret;
> >>> +	}
> >>> +
> >>> +	if (!pte_none(*fe->pte))
> >>> +		return VM_FAULT_NOPAGE;
> >>
> >> Oh, you've got to add another pte_none() check because you're deferring
> >> the acquisition of the ptl lock?
> > 
> > Yes, we need to re-check once ptl is taken.
> 
> Another good comment to add, I think. :)

Ok.

> >>> -	/* Check if it makes any sense to call ->map_pages */
> >>> -	fe->address = start_addr;
> >>> -	while (!pte_none(*fe->pte)) {
> >>> -		if (++start_pgoff > end_pgoff)
> >>> -			goto out;
> >>> -		fe->address += PAGE_SIZE;
> >>> -		if (fe->address >= fe->vma->vm_end)
> >>> -			goto out;
> >>> -		fe->pte++;
> >>> +	if (pmd_none(*fe->pmd))
> >>> +		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
> >>> +	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
> >>> +	if (fe->prealloc_pte) {
> >>> +		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
> >>> +		fe->prealloc_pte = 0;
> >>>  	}
> >>> +	if (!fe->pte)
> >>> +		goto out;
> >>
> >> What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
> >> present?
> > 
> > Huge pmd is mapped.
> > 
> > Comment added.
> 
> Huh, so in _some_ contexts, !fe->pte means that we've got a huge pmd.  I
> don't remember seeing that spelled out in the structure comments.

I'll change it to "if (pmd_trans_huge(*fe->pmd))".

> >>> +	if (unlikely(pmd_none(*fe->pmd))) {
> >>> +		/*
> >>> +		 * Leave __pte_alloc() until later: because vm_ops->fault may
> >>> +		 * want to allocate huge page, and if we expose page table
> >>> +		 * for an instant, it will be difficult to retract from
> >>> +		 * concurrent faults and from rmap lookups.
> >>> +		 */
> >>> +	} else {
> >>> +		/*
> >>> +		 * A regular pmd is established and it can't morph into a huge
> >>> +		 * pmd from under us anymore at this point because we hold the
> >>> +		 * mmap_sem read mode and khugepaged takes it in write mode.
> >>> +		 * So now it's safe to run pte_offset_map().
> >>> +		 */
> >>> +		fe->pte = pte_offset_map(fe->pmd, fe->address);
> >>> +
> >>> +		entry = *fe->pte;
> >>> +		barrier();
> >>
> >> Barrier because....?
> 
> Did you miss a response here, Kirill?

The comment below is about this barrier.
Isn't it sufficient?

> >>> +		if (pte_none(entry)) {
> >>> +			pte_unmap(fe->pte);
> >>> +			fe->pte = NULL;
> >>> +		}
> >>> +	}
> >>> +
> >>>  	/*
> >>>  	 * some architectures can have larger ptes than wordsize,
> >>>  	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
> >>>  	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
> >>> -	 * The code below just needs a consistent view for the ifs and
> >>> +	 * The code above just needs a consistent view for the ifs and
> >>>  	 * we later double check anyway with the ptl lock held. So here
> >>>  	 * a barrier will do.
> >>>  	 */
> >>
> >> Looks like the barrier got moved, but not the comment.
> > 
> > Moved.
> > 
> >> Man, that's a lot of code.
> > 
> > Yeah. I don't see a sensible way to split it. :-/
> 
> Can you do the "postpone allocation" parts without adding additional THP
> code?  Or does the postponement just add all of the extra THP-handling
> spots?

I'll check. But I wouldn't expect moving THP-handling out of the commit
will make it much smaller.

-- 
 Kirill A. Shutemov