linux-kernel - Re: [PATCH v3 2/8] binder: concurrent page installation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xpzec7tqe43sykvqtgrlh3furu7vn2nrnkjmv7odzy7ywd4lf6@hlawbgapxcfk>
Date: Fri, 8 Nov 2024 15:41:41 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Carlos Llamas <cmllamas@...gle.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Arve Hjønnevåg <arve@...roid.com>,
        Todd Kjos <tkjos@...roid.com>, Martijn Coenen <maco@...roid.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Christian Brauner <brauner@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>, linux-kernel@...r.kernel.org,
        kernel-team@...roid.com, David Hildenbrand <david@...hat.com>,
        Barry Song <v-songbaohua@...o.com>
Subject: Re: [PATCH v3 2/8] binder: concurrent page installation

* Carlos Llamas <cmllamas@...gle.com> [241108 14:11]:
> Allow multiple callers to install pages simultaneously by downgrading
> the mmap_sem to non-exclusive mode.

Since we actually allow downgrading of a lock, it would be more clear to
say that you are changing from a read lock to a write lock.  I was
searching for the lock downgrade in this patch :)

>Races to the same PTE are handled
> using get_user_pages_remote() to retrieve the already installed page.
> This method significantly reduces contention in the mmap semaphore.
> 
> To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid
> operating on an isolated VMA. In addition, zap_page_range_single() is
> called under the alloc->mutex to avoid racing with the shrinker.
> 
> Many thanks to Barry Song who posted a similar approach [1].
> 
> Link: https://lore.kernel.org/all/20240902225009.34576-1-21cnbao@gmail.com/ [1]
> Cc: David Hildenbrand <david@...hat.com>
> Cc: Barry Song <v-songbaohua@...o.com>
> Cc: Suren Baghdasaryan <surenb@...gle.com>
> Cc: Liam R. Howlett <Liam.Howlett@...cle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@...gle.com>
> Signed-off-by: Carlos Llamas <cmllamas@...gle.com>
> ---
>  drivers/android/binder_alloc.c | 64 +++++++++++++++++++++-------------
>  1 file changed, 40 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index 7241bf4a3ff2..2ab520c285b3 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c
> @@ -221,26 +221,14 @@ static int binder_install_single_page(struct binder_alloc *alloc,
>  				      struct binder_lru_page *lru_page,
>  				      unsigned long addr)
>  {
> +	struct vm_area_struct *vma;
>  	struct page *page;
> -	int ret = 0;
> +	long npages;
> +	int ret;
>  
>  	if (!mmget_not_zero(alloc->mm))
>  		return -ESRCH;
>  
> -	/*
> -	 * Protected with mmap_sem in write mode as multiple tasks
> -	 * might race to install the same page.
> -	 */
> -	mmap_write_lock(alloc->mm);
> -	if (binder_get_installed_page(lru_page))
> -		goto out;
> -
> -	if (!alloc->vma) {
> -		pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> -		ret = -ESRCH;
> -		goto out;
> -	}
> -
>  	page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
>  	if (!page) {
>  		pr_err("%d: failed to allocate page\n", alloc->pid);
> @@ -248,19 +236,47 @@ static int binder_install_single_page(struct binder_alloc *alloc,
>  		goto out;
>  	}
>  
> -	ret = vm_insert_page(alloc->vma, addr, page);
> -	if (ret) {
> +	mmap_read_lock(alloc->mm);

Ah, I debate bring this up, but you can do this without the mmap read
lock.  You can use rcu and the per-vma locking like in the page fault
path.  If you look at how that is done using the lock_vma_under_rcu()
(mm/memory.c) then you can see that pages are installed today without
the mmap locking (on some architectures - which includes x86_64 and
arm64).

userfaultfd had to do something like this as well, and created some
abstraction to facilitate either mmap or rcu, based on the arch support.
Going to rcu really helped performance there [1].  There was also a
chance of the vma being missed, so it is checked again under the mmap
read lock, but realistically that never happens and exists to ensure
correctness.

You also mention the shrinker and using the alloc->mutex, well zapping
pages can also be done under the vma specific lock - if that helps?

Freeing page tables is different though, that needs more locking, but I
don't think this is an issue for you.

[1]. https://lore.kernel.org/all/20240215182756.3448972-1-lokeshgidra@google.com/

> +	vma = vma_lookup(alloc->mm, addr);

Thank you for not saving a pointer anymore.

> +	if (!vma || vma != alloc->vma) {
> +		__free_page(page);
> +		pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> +		ret = -ESRCH;
> +		goto unlock;
> +	}
> +
> +	ret = vm_insert_page(vma, addr, page);
> +	switch (ret) {
> +	case -EBUSY:
> +		/*
> +		 * EBUSY is ok. Someone installed the pte first but the
> +		 * lru_page->page_ptr has not been updated yet. Discard
> +		 * our page and look up the one already installed.
> +		 */
> +		ret = 0;
> +		__free_page(page);
> +		npages = get_user_pages_remote(alloc->mm, addr, 1, 0, &page, NULL);
> +		if (npages <= 0) {
> +			pr_err("%d: failed to find page at offset %lx\n",
> +			       alloc->pid, addr - alloc->buffer);
> +			ret = -ESRCH;
> +			break;
> +		}
> +		fallthrough;
> +	case 0:
> +		/* Mark page installation complete and safe to use */
> +		binder_set_installed_page(lru_page, page);
> +		break;
> +	default:
> +		__free_page(page);
>  		pr_err("%d: %s failed to insert page at offset %lx with %d\n",
>  		       alloc->pid, __func__, addr - alloc->buffer, ret);
> -		__free_page(page);
>  		ret = -ENOMEM;
> -		goto out;
> +		break;
>  	}
> -
> -	/* Mark page installation complete and safe to use */
> -	binder_set_installed_page(lru_page, page);
> +unlock:
> +	mmap_read_unlock(alloc->mm);
>  out:
> -	mmap_write_unlock(alloc->mm);
>  	mmput_async(alloc->mm);
>  	return ret;
>  }
> @@ -1091,7 +1107,6 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
>  	trace_binder_unmap_kernel_end(alloc, index);
>  
>  	list_lru_isolate(lru, item);
> -	mutex_unlock(&alloc->mutex);
>  	spin_unlock(lock);
>  
>  	if (vma) {
> @@ -1102,6 +1117,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
>  		trace_binder_unmap_user_end(alloc, index);
>  	}
>  
> +	mutex_unlock(&alloc->mutex);
>  	mmap_read_unlock(mm);
>  	mmput_async(mm);
>  	__free_page(page_to_free);
> -- 
> 2.47.0.277.g8800431eea-goog
>