lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <95e9d80e-6c19-4a1f-9c21-307006858dff@suse.cz>
Date: Fri, 10 Jan 2025 15:34:26 +0100
From: Vlastimil Babka <vbabka@...e.cz>
To: Suren Baghdasaryan <surenb@...gle.com>, akpm@...ux-foundation.org
Cc: peterz@...radead.org, willy@...radead.org, liam.howlett@...cle.com,
 lorenzo.stoakes@...cle.com, mhocko@...e.com, hannes@...xchg.org,
 mjguzik@...il.com, oliver.sang@...el.com, mgorman@...hsingularity.net,
 david@...hat.com, peterx@...hat.com, oleg@...hat.com, dave@...olabs.net,
 paulmck@...nel.org, brauner@...nel.org, dhowells@...hat.com,
 hdanton@...a.com, hughd@...gle.com, lokeshgidra@...gle.com,
 minchan@...gle.com, jannh@...gle.com, shakeel.butt@...ux.dev,
 souravpanda@...gle.com, pasha.tatashin@...een.com, klarasmodin@...il.com,
 richard.weiyang@...il.com, corbet@....net, linux-doc@...r.kernel.org,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, kernel-team@...roid.com
Subject: Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a
 reference count

On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore and the vma->detached flag with
> a refcount (vm_refcnt).
> When vma is in detached state, vm_refcnt is 0 and only a call to
> vma_mark_attached() can take it out of this state. Note that unlike
> before, now we enforce both vma_mark_attached() and vma_mark_detached()
> to be done only after vma has been write-locked. vma_mark_attached()
> changes vm_refcnt to 1 to indicate that it has been attached to the vma
> tree. When a reader takes read lock, it increments vm_refcnt, unless the
> top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> a writer. When writer takes write lock, it sets the top usable bit to
> indicate its presence. If there are readers, writer will wait using newly
> introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> mode first, there can be only one writer at a time. The last reader to
> release the lock will signal the writer to wake up.
> refcount might overflow if there are many competing readers, in which case
> read-locking will fail. Readers are expected to handle such failures.
> In summary:
> 1. all readers increment the vm_refcnt;
> 2. writer sets top usable (writer) bit of vm_refcnt;
> 3. readers cannot increment the vm_refcnt if the writer bit is set;
> 4. in the presence of readers, writer must wait for the vm_refcnt to drop
> to 1 (ignoring the writer bit), indicating an attached vma with no readers;
> 5. vm_refcnt overflow is handled by the readers.
> 
> Suggested-by: Peter Zijlstra <peterz@...radead.org>
> Suggested-by: Matthew Wilcox <willy@...radead.org>
> Signed-off-by: Suren Baghdasaryan <surenb@...gle.com>

Reviewed-by: Vlastimil Babka <vbabka@...e.cz>

But think there's a problem that will manifest after patch 15.
Also I don't feel qualified enough about the lockdep parts though
(although I think I spotted another issue with those, below) so best if
PeterZ can review those.
Some nits below too.

> +
> +static inline void vma_refcount_put(struct vm_area_struct *vma)
> +{
> +	int oldcnt;
> +
> +	if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> +		rwsem_release(&vma->vmlock_dep_map, _RET_IP_);

Shouldn't we rwsem_release always? And also shouldn't it precede the
refcount operation itself?

> +		if (is_vma_writer_only(oldcnt - 1))
> +			rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);

Hmm hmm we should maybe read the vm_mm pointer before dropping the
refcount? In case this races in a way that is_vma_writer_only tests true
but the writer meanwhile finishes and frees the vma. It's safe now but
not after making the cache SLAB_TYPESAFE_BY_RCU ?

> +	}
> +}
> +

>  static inline void vma_end_read(struct vm_area_struct *vma)
>  {
>  	rcu_read_lock(); /* keeps vma alive till the end of up_read */

This should refer to vma_refcount_put(). But after fixing it I think we
could stop doing this altogether? It will no longer keep vma "alive"
with SLAB_TYPESAFE_BY_RCU.

> -	up_read(&vma->vm_lock.lock);
> +	vma_refcount_put(vma);
>  	rcu_read_unlock();
>  }
>  

<snip>

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
>  #endif
>  
>  #ifdef CONFIG_PER_VMA_LOCK
> +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> +{
> +	/*
> +	 * If vma is detached then only vma_mark_attached() can raise the
> +	 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> +	 */
> +	if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
> +		return false;
> +
> +	rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> +	rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> +		   refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> +		   TASK_UNINTERRUPTIBLE);
> +	lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> +
> +	return true;
> +}
> +
> +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> +{
> +	*detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> +	rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> +}
> +
>  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
>  {
> -	down_write(&vma->vm_lock.lock);
> +	bool locked;
> +
> +	/*
> +	 * __vma_enter_locked() returns false immediately if the vma is not
> +	 * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
> +	 * indicating that vma is attached with no readers.
> +	 */
> +	locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);

Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
__vma_enter_locked() itself as it's the one adding it in the first place.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ