linux-kernel - Re: [PATCH v4 21/66] mm/mmap: Change do_brk_flags() to expand existing VMA and add do_brk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220119030311.qblhmf6yk6l7sl6b@revolver>
Date:   Wed, 19 Jan 2022 03:03:19 +0000
From:   Liam Howlett <liam.howlett@...cle.com>
To:     Vlastimil Babka <vbabka@...e.cz>
CC:     "maple-tree@...ts.infradead.org" <maple-tree@...ts.infradead.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Song Liu <songliubraving@...com>,
        Davidlohr Bueso <dave@...olabs.net>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Laurent Dufour <ldufour@...ux.ibm.com>,
        David Rientjes <rientjes@...gle.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Rik van Riel <riel@...riel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Michel Lespinasse <walken.cr@...il.com>,
        Jerome Glisse <jglisse@...hat.com>,
        Minchan Kim <minchan@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Rom Lemarchand <romlem@...gle.com>
Subject: Re: [PATCH v4 21/66] mm/mmap: Change do_brk_flags() to expand
 existing VMA and add do_brk_munmap()

* Vlastimil Babka <vbabka@...e.cz> [220113 07:59]:
> On 12/1/21 15:29, Liam Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
> > 
> > Avoid allocating a new VMA when it a vma modification can occur.  When a
> > brk() can expand or contract a VMA, then the single store operation will
> > only modify one index of the maple tree instead of causing a node to
> > split or coalesce.  This avoids unnecessary allocations/frees of maple
> > tree nodes and VMAs.
> > 
> > Use the advanced API for the maple tree to avoid unnecessary walks of
> > the tree.
> > 
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@...cle.com>
> > ---
> >  mm/mmap.c | 258 +++++++++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 207 insertions(+), 51 deletions(-)
> > 
> > +static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
> > +			unsigned long addr, unsigned long len,
> > +			unsigned long flags)
> >  {
> >  	struct mm_struct *mm = current->mm;
> > -	struct vm_area_struct *vma, *prev;
> > -	pgoff_t pgoff = addr >> PAGE_SHIFT;
> > +	struct vm_area_struct *prev = NULL;
> >  	int error;
> >  	unsigned long mapped_addr;
> >  	validate_mm_mt(mm);
> > @@ -2740,11 +2854,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len,
> >  	if (error)
> >  		return error;
> >  
> > -	/* Clear old maps, set up prev and uf */
> > -	if (munmap_vma_range(mm, addr, len, &prev, uf))
> > -		return -ENOMEM;
> > -
> > -	/* Check against address space limits *after* clearing old maps... */
> > +	/* Check against address space limits by the changed size */
> 
> Can that cause spurious ENOMEM because now the check assumes 'len' worth of
> purely new pages and no reuse?


I don't think so?  I must be missing how anything could exist in this
range to begin with?  The brk syscall checks to ensure there is enough
room and the other two users are the elf and a.out loaders - could
either of those two map over parts of themselves on load?  This seemed
to be there primarily to set up for an rb insert (set prev, rb_link,
rb_parent) in the current code.  Sort of like how get_unmapped_area()
with MAP_FIXED appears to be used to sanitize the addr and len.


> 
> >  	if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
> >  		return -ENOMEM;
> >  
> > @@ -2754,28 +2864,57 @@ static int do_brk_flags(unsigned long addr, unsigned long len,
> >  	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> >  		return -ENOMEM;
> >  
> > -	/* Can we just expand an old private anonymous mapping? */
> > -	vma = vma_merge(mm, prev, addr, addr + len, flags,
> > -			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
> > -	if (vma)
> > -		goto out;
> > +	mas->last = addr + len - 1;
> > +	if (vma) {
> > +		/* Expand the existing vma if possible; almost never a singular
> > +		 * list, so this will almost always fail. */
> >  
> > -	/*
> > -	 * create a vma struct for an anonymous mapping
> > -	 */
> > -	vma = vm_area_alloc(mm);
> > -	if (!vma) {
> > -		vm_unacct_memory(len >> PAGE_SHIFT);
> > -		return -ENOMEM;
> > +		if ((!vma->anon_vma ||
> > +		     list_is_singular(&vma->anon_vma_chain)) &&
> 
> Hmm I feel uneasy about this part that mimics what vma_merge() does. Looks
> like something e.g. we can easily forget to adjust when changing vma_merge()
> itself.

vma_merge() is overly heavy for what brk() is doing.  I get what you are
saying about it potentially being missed and I think brk is already in
the 'potentially overlooked' category as it stands today.  Honestly, one
less user of vma_merge() (and thus less __vma_adjust() users) seems like
a win to me.

> Is this optimization worth the trouble given the comment above "so this will
> almost always fail"?

vma_merge() walks the tree for next and next->next and does a lot of
extra checks before arriving at the conclusion that this will fail.
Maybe 'almost always fail' is too strong wording; on boot of my VM, 63
expands happen out of 517, so 87.8% fail this test.

> 
> > +		     ((vma->vm_flags & ~VM_SOFTDIRTY) == flags)){
> > +			mas->index = vma->vm_start;
> > +
> > +			vma_adjust_trans_huge(vma, addr, addr + len, 0);
> > +			if (vma->anon_vma) {
> > +				anon_vma_lock_write(vma->anon_vma);
> > +				anon_vma_interval_tree_pre_update_vma(vma);
> > +			}
> > +			vma->vm_end = addr + len;
> > +			vma->vm_flags |= VM_SOFTDIRTY;
> > +			if (mas_store_gfp(mas, vma, GFP_KERNEL))
> > +				goto mas_mod_fail;
> > +
> > +			if (vma->anon_vma) {
> > +				anon_vma_interval_tree_post_update_vma(vma);
> > +				anon_vma_unlock_write(vma->anon_vma);
> > +			}
> > +			khugepaged_enter_vma_merge(vma, flags);
> > +			goto out;
> > +		}
> > +		prev = vma;
> >  	}
> > +	mas->index = addr;
> > +	mas_walk(mas);
> > +
> > +	/* create a vma struct for an anonymous mapping */
> > +	vma = vm_area_alloc(mm);
> > +	if (!vma)
> > +		goto vma_alloc_fail;
> >  
> >  	vma_set_anonymous(vma);
> >  	vma->vm_start = addr;
> >  	vma->vm_end = addr + len;
> > -	vma->vm_pgoff = pgoff;
> > +	vma->vm_pgoff = addr >> PAGE_SHIFT;
> >  	vma->vm_flags = flags;
> >  	vma->vm_page_prot = vm_get_page_prot(flags);
> > -	vma_link(mm, vma, prev);
> > +	if (vma_mas_store(vma, mas))
> > +		goto mas_store_fail;
> > +
> > +	if (!prev)
> > +		prev = mas_prev(mas, 0);
> > +
> > +	__vma_link_list(mm, vma, prev);
> > +	mm->map_count++;
> >  out:
> >  	perf_event_mmap(vma);
> >  	mm->total_vm += len >> PAGE_SHIFT;
> > @@ -2785,15 +2924,31 @@ static int do_brk_flags(unsigned long addr, unsigned long len,
> >  	vma->vm_flags |= VM_SOFTDIRTY;
> >  	validate_mm_mt(mm);
> >  	return 0;
> > +
> > +mas_store_fail:
> > +	vm_area_free(vma);
> > +vma_alloc_fail:
> > +	vm_unacct_memory(len >> PAGE_SHIFT);
> > +	return -ENOMEM;
> > +
> > +mas_mod_fail:
> > +	vma->vm_end = addr;
> > +	if (vma->anon_vma) {
> > +		anon_vma_interval_tree_post_update_vma(vma);
> > +		anon_vma_unlock_write(vma->anon_vma);
> > +	}
> > +	return -ENOMEM;
> > +
> >  }
> >  
> >  int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
> >  {
> >  	struct mm_struct *mm = current->mm;
> > +	struct vm_area_struct *vma = NULL;
> >  	unsigned long len;
> >  	int ret;
> >  	bool populate;
> > -	LIST_HEAD(uf);
> > +	MA_STATE(mas, &mm->mm_mt, addr, addr);
> >  
> >  	len = PAGE_ALIGN(request);
> >  	if (len < request)
> > @@ -2804,10 +2959,11 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
> >  	if (mmap_write_lock_killable(mm))
> >  		return -EINTR;
> >  
> > -	ret = do_brk_flags(addr, len, flags, &uf);
> > +	// This vma left intentionally blank.
> 
> This comment using unintentionally bad syntax (// vs /* */)

Ha!  Thanks!

> 
> Also if we leave it blank it means this path won't ever expand an existing
> vma, while previously it could succeed the vma_merge, no? Or all callers of
> vm_brk_flags() in a scenario where there's no expand anyway? Maybe just have
> a more verbose comment...

Is it possible that other code paths lead to a vma merge?  From what I
can tell the other entry points are for the BSS or after loading a
binary anyways.  I guess the next vma could be anon and have matching
flags, but I think they will all have a vma->vm_file.  In fact, if I
change the do_brk_flags() to check !vma->vm_file and pass through the
vma in the case of vma->vm_end == addr, then it works - but there are no
merging from this code path that I can see on boot.  If you think this
is necessary, I can add it in, but I don't think it's needed.


> 
> > +	mas_walk(&mas);
> > +	ret = do_brk_flags(&mas, vma, addr, len, flags);
> >  	populate = ((mm->def_flags & VM_LOCKED) != 0);
> >  	mmap_write_unlock(mm);
> > -	userfaultfd_unmap_complete(mm, &uf);
> 
> Looks like this part is removed completely from vm_brk_flags() paths?
> 

When I removed the call to munmap_vma_range(), I dropped the userfaultfd
here as there was no need.


> OK it seems the whole patch makes some asumption that vm_brk_flags() never
> has to unmap a pre-existing area, and in the brk() syscall this is now
> delegated to do_brk_munmap(), and do_brk_flags() loses the support. While it
> might be safe, it should be discussed in the patch that vm_brk_flags()
> didn't actually need to support the unmap part, because x y z. And best if
> there are some DEBUG_VM based assertions supporting that.

Yes,  I do believe this to be true.  I don't think do_brk_flags() needs
to unmap.  If you look at brk() syscall, it actually ensure there is
enough room for the expansion + vm_start_gap(next), so that one is
probably safe.  brk() already depends on do_brk_munmap() doing the
unmap.  The other callers are the elf and a.out loaders, which also
don't appear to use the functionality.  I thought the call was only for
setting up for insertion into the rbtree.

> 
> But then again, is the optimized scenario happening often enough to warrant it?

well, 12.2% use the optimization to the fullest, the rest fail faster.
I am really after the faster failure optimization here.  I especially do
not like the fact that vma_merge() gets the next vma and the next->next
vma prior to seeing if it can be merged.  I get why, but is there really
going to be an anon vma with the right flags, no file, etc, etc, etc
often enough to try this?  In fact, it's not possible at all to need
next->next when we unmap the area first.  Out of the 8 cases in
vma_merge, only 1, 2, and 3 are possible.  Of the 3 possibilities, I am
suggesting that 2 is really the only one we should check as 1 and 3 are
so unlikely.


> 
> >  	if (populate && !ret)
> >  		mm_populate(addr, len);
> >  	return ret;
>