lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpE+eza4eZ9Hckvuj2MarLwaXoTFTNtUdw375Ugo=dUfvg@mail.gmail.com>
Date: Fri, 10 Jan 2025 08:47:18 -0800
From: Suren Baghdasaryan <surenb@...gle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: akpm@...ux-foundation.org, peterz@...radead.org, willy@...radead.org, 
	liam.howlett@...cle.com, lorenzo.stoakes@...cle.com, mhocko@...e.com, 
	hannes@...xchg.org, mjguzik@...il.com, oliver.sang@...el.com, 
	mgorman@...hsingularity.net, david@...hat.com, peterx@...hat.com, 
	oleg@...hat.com, dave@...olabs.net, paulmck@...nel.org, brauner@...nel.org, 
	dhowells@...hat.com, hdanton@...a.com, hughd@...gle.com, 
	lokeshgidra@...gle.com, minchan@...gle.com, jannh@...gle.com, 
	shakeel.butt@...ux.dev, souravpanda@...gle.com, pasha.tatashin@...een.com, 
	klarasmodin@...il.com, richard.weiyang@...il.com, corbet@....net, 
	linux-doc@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	kernel-team@...roid.com
Subject: Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a
 reference count

On Fri, Jan 10, 2025 at 7:56 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
>
> On Fri, Jan 10, 2025 at 6:33 AM Vlastimil Babka <vbabka@...e.cz> wrote:
> >
> > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > considerable space for each vm_area_struct. However vma_lock has
> > > two important specifics which can be used to replace rw_semaphore
> > > with a simpler structure:
> > > 1. Readers never wait. They try to take the vma_lock and fall back to
> > > mmap_lock if that fails.
> > > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > > because writers first take mmap_lock in write mode.
> > > Because of these requirements, full rw_semaphore functionality is not
> > > needed and we can replace rw_semaphore and the vma->detached flag with
> > > a refcount (vm_refcnt).
> > > When vma is in detached state, vm_refcnt is 0 and only a call to
> > > vma_mark_attached() can take it out of this state. Note that unlike
> > > before, now we enforce both vma_mark_attached() and vma_mark_detached()
> > > to be done only after vma has been write-locked. vma_mark_attached()
> > > changes vm_refcnt to 1 to indicate that it has been attached to the vma
> > > tree. When a reader takes read lock, it increments vm_refcnt, unless the
> > > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> > > a writer. When writer takes write lock, it sets the top usable bit to
> > > indicate its presence. If there are readers, writer will wait using newly
> > > introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> > > mode first, there can be only one writer at a time. The last reader to
> > > release the lock will signal the writer to wake up.
> > > refcount might overflow if there are many competing readers, in which case
> > > read-locking will fail. Readers are expected to handle such failures.
> > > In summary:
> > > 1. all readers increment the vm_refcnt;
> > > 2. writer sets top usable (writer) bit of vm_refcnt;
> > > 3. readers cannot increment the vm_refcnt if the writer bit is set;
> > > 4. in the presence of readers, writer must wait for the vm_refcnt to drop
> > > to 1 (ignoring the writer bit), indicating an attached vma with no readers;
> > > 5. vm_refcnt overflow is handled by the readers.
> > >
> > > Suggested-by: Peter Zijlstra <peterz@...radead.org>
> > > Suggested-by: Matthew Wilcox <willy@...radead.org>
> > > Signed-off-by: Suren Baghdasaryan <surenb@...gle.com>
> >
> > Reviewed-by: Vlastimil Babka <vbabka@...e.cz>
> >
> > But think there's a problem that will manifest after patch 15.
> > Also I don't feel qualified enough about the lockdep parts though
> > (although I think I spotted another issue with those, below) so best if
> > PeterZ can review those.
> > Some nits below too.
> >
> > > +
> > > +static inline void vma_refcount_put(struct vm_area_struct *vma)
> > > +{
> > > +     int oldcnt;
> > > +
> > > +     if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> > > +             rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> >
> > Shouldn't we rwsem_release always? And also shouldn't it precede the
> > refcount operation itself?
>
> Yes. Hillf pointed to the same issue. It will be fixed in the next version.
>
> >
> > > +             if (is_vma_writer_only(oldcnt - 1))
> > > +                     rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);
> >
> > Hmm hmm we should maybe read the vm_mm pointer before dropping the
> > refcount? In case this races in a way that is_vma_writer_only tests true
> > but the writer meanwhile finishes and frees the vma. It's safe now but
> > not after making the cache SLAB_TYPESAFE_BY_RCU ?
>
> Hmm. But if is_vma_writer_only() is true that means the writed is
> blocked and is waiting for the reader to drop the vm_refcnt. IOW, it
> won't proceed and free the vma until the reader calls
> rcuwait_wake_up(). Your suggested change is trivial and I can do it
> but I want to make sure I'm not missing something. Am I?

Ok, after thinking some more, I think the race you might be referring
to is this:

writer                                           reader

    __vma_enter_locked
        refcount_add_not_zero(VMA_LOCK_OFFSET, ...)
                                                    vma_refcount_put
                                                       __refcount_dec_and_test()
                                                           if
(is_vma_writer_only())
        rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, ...)
    __vma_exit_locked
        refcount_sub_and_test(VMA_LOCK_OFFSET, ...)
    free the vma

rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);

I think it's possible and your suggestion of storing the mm before
doing __refcount_dec_and_test() should work. Thanks for pointing this
out! I'll fix it in the next version.

>
> >
> > > +     }
> > > +}
> > > +
> >
> > >  static inline void vma_end_read(struct vm_area_struct *vma)
> > >  {
> > >       rcu_read_lock(); /* keeps vma alive till the end of up_read */
> >
> > This should refer to vma_refcount_put(). But after fixing it I think we
> > could stop doing this altogether? It will no longer keep vma "alive"
> > with SLAB_TYPESAFE_BY_RCU.
>
> Yeah, I think the comment along with rcu_read_lock()/rcu_read_unlock()
> here can be safely removed.
>
> >
> > > -     up_read(&vma->vm_lock.lock);
> > > +     vma_refcount_put(vma);
> > >       rcu_read_unlock();
> > >  }
> > >
> >
> > <snip>
> >
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> > >  #endif
> > >
> > >  #ifdef CONFIG_PER_VMA_LOCK
> > > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> > > +{
> > > +     /*
> > > +      * If vma is detached then only vma_mark_attached() can raise the
> > > +      * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> > > +      */
> > > +     if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
> > > +             return false;
> > > +
> > > +     rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> > > +     rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> > > +                refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> > > +                TASK_UNINTERRUPTIBLE);
> > > +     lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> > > +
> > > +     return true;
> > > +}
> > > +
> > > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> > > +{
> > > +     *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> > > +     rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> > > +}
> > > +
> > >  void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> > >  {
> > > -     down_write(&vma->vm_lock.lock);
> > > +     bool locked;
> > > +
> > > +     /*
> > > +      * __vma_enter_locked() returns false immediately if the vma is not
> > > +      * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
> > > +      * indicating that vma is attached with no readers.
> > > +      */
> > > +     locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);
> >
> > Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
> > below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
> > __vma_enter_locked() itself as it's the one adding it in the first place.
>
> Well, it won't be called tgt_refcount then. Maybe "bool vma_attached"
> and inside __vma_enter_locked() we do:
>
> unsigned int tgt_refcnt = VMA_LOCK_OFFSET + vma_attached ? 1 : 0;
>
> Is that better?
>
> >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ