lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c7c8158d-e360-4a29-9e48-f2b45b9ca147@lucifer.local>
Date: Fri, 23 Jan 2026 17:59:38 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...nel.org>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
        Michal Hocko <mhocko@...e.com>, Shakeel Butt <shakeel.butt@...ux.dev>,
        Jann Horn <jannh@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev,
        Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        Will Deacon <will@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
        Waiman Long <longman@...hat.com>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Clark Williams <clrkwllms@...nel.org>,
        Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [PATCH RESEND v3 07/10] mm/vma: introduce helper struct + thread
 through exclusive lock fns

On Thu, Jan 22, 2026 at 01:41:39PM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 22, 2026 at 5:03 AM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> >
> > It is confusing to have __vma_enter_exclusive_locked() return 0, 1 or an
> > error (but only when waiting for readers in TASK_KILLABLE state), and
> > having the return value be stored in a stack variable called 'locked' is
> > further confusion.
> >
> > More generally, we are doing a lock of rather finnicky things during the
> > acquisition of a state in which readers are excluded and moving out of this
> > state, including tracking whether we are detached or not or whether an
> > error occurred.
> >
> > We are implementing logic in __vma_enter_exclusive_locked() that
> > effectively acts as if 'if one caller calls us do X, if another then do Y',
> > which is very confusing from a control flow perspective.
> >
> > Introducing the shared helper object state helps us avoid this, as we can
> > now handle the 'an error arose but we're detached' condition correctly in
> > both callers - a warning if not detaching, and treating the situation as if
> > no error arose in the case of a VMA detaching.
> >
> > This also acts to help document what's going on and allows us to add some
> > more logical debug asserts.
> >
> > Also update vma_mark_detached() to add a guard clause for the likely
> > 'already detached' state (given we hold the mmap write lock), and add a
> > comment about ephemeral VMA read lock reference count increments to clarify
> > why we are entering/exiting an exclusive locked state here.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> > ---
> >  mm/mmap_lock.c | 144 +++++++++++++++++++++++++++++++------------------
> >  1 file changed, 91 insertions(+), 53 deletions(-)
> >
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > index f73221174a8b..75166a43ffa4 100644
> > --- a/mm/mmap_lock.c
> > +++ b/mm/mmap_lock.c
> > @@ -46,20 +46,40 @@ EXPORT_SYMBOL(__mmap_lock_do_trace_released);
> >  #ifdef CONFIG_MMU
> >  #ifdef CONFIG_PER_VMA_LOCK
> >
> > +/* State shared across __vma_[enter, exit]_exclusive_locked(). */
> > +struct vma_exclude_readers_state {
> > +       /* Input parameters. */
> > +       struct vm_area_struct *vma;
> > +       int state; /* TASK_KILLABLE or TASK_UNINTERRUPTIBLE. */
> > +       bool detaching;
> > +
> Are these:
>             /* Output parameters. */
> ?

Yurp.

Oh you added the comment :) well clearly I should have, will do so.

> > +       bool detached;
> > +       bool exclusive; /* Are we exclusively locked? */
> > +};
> > +
> >  /*
> >   * Now that all readers have been evicted, mark the VMA as being out of the
> >   * 'exclude readers' state.
> >   *
> >   * Returns true if the VMA is now detached, otherwise false.
> >   */
> > -static bool __must_check __vma_exit_exclusive_locked(struct vm_area_struct *vma)
> > +static void __vma_exit_exclusive_locked(struct vma_exclude_readers_state *ves)
> >  {
> > -       bool detached;
> > +       struct vm_area_struct *vma = ves->vma;
> > +
> > +       VM_WARN_ON_ONCE(ves->detached);
> > +       VM_WARN_ON_ONCE(!ves->exclusive);
> >
> > -       detached = refcount_sub_and_test(VM_REFCNT_EXCLUDE_READERS_FLAG,
> > -                                        &vma->vm_refcnt);
> > +       ves->detached = refcount_sub_and_test(VM_REFCNT_EXCLUDE_READERS_FLAG,
> > +                                             &vma->vm_refcnt);
> >         __vma_lockdep_release_exclusive(vma);
> > -       return detached;
> > +}
> > +
> > +static unsigned int get_target_refcnt(struct vma_exclude_readers_state *ves)
> > +{
> > +       const unsigned int tgt = ves->detaching ? 0 : 1;
> > +
> > +       return tgt | VM_REFCNT_EXCLUDE_READERS_FLAG;
> >  }
> >
> >  /*
> > @@ -69,30 +89,31 @@ static bool __must_check __vma_exit_exclusive_locked(struct vm_area_struct *vma)
> >   * Note that this function pairs with vma_refcount_put() which will wake up this
> >   * thread when it detects that the last reader has released its lock.
> >   *
> > - * The state parameter ought to be set to TASK_UNINTERRUPTIBLE in cases where we
> > - * wish the thread to sleep uninterruptibly or TASK_KILLABLE if a fatal signal
> > - * is permitted to kill it.
> > + * The ves->state parameter ought to be set to TASK_UNINTERRUPTIBLE in cases
> > + * where we wish the thread to sleep uninterruptibly or TASK_KILLABLE if a fatal
> > + * signal is permitted to kill it.
> >   *
> > - * The function will return 0 immediately if the VMA is detached, and 1 once the
> > - * VMA has evicted all readers, leaving the VMA exclusively locked.
> > + * The function sets the ves->locked parameter to true if an exclusive lock was
>
> s/ves->locked/ves->exclusive
>
> > + * acquired, or false if the VMA was detached or an error arose on wait.
> >   *
> > - * If the function returns 1, the caller is required to invoke
> > - * __vma_exit_exclusive_locked() once the exclusive state is no longer required.
> > + * If the function indicates an exclusive lock was acquired via ves->exclusive
> > + * (or equivalently, returning 0 with !ves->detached),
>
> I would remove the mention of that equivalence because with this
> change, return 0 simply indicates that the operation was successful
> and should not be used to infer any additional states. To get specific
> state the caller should use proper individual ves fields. Using return
> value for anything else defeats the whole purpose of this cleanup.

OK I'll remove the equivalency comment.

>
> > the caller is required to
> > + * invoke __vma_exit_exclusive_locked() once the exclusive state is no longer
> > + * required.
> >   *
> > - * If state is set to something other than TASK_UNINTERRUPTIBLE, the function
> > - * may also return -EINTR to indicate a fatal signal was received while waiting.
> > + * If ves->state is set to something other than TASK_UNINTERRUPTIBLE, the
> > + * function may also return -EINTR to indicate a fatal signal was received while
> > + * waiting.
> >   */
> > -static int __vma_enter_exclusive_locked(struct vm_area_struct *vma,
> > -               bool detaching, int state)
> > +static int __vma_enter_exclusive_locked(struct vma_exclude_readers_state *ves)
> >  {
> > -       int err;
> > -       unsigned int tgt_refcnt = VM_REFCNT_EXCLUDE_READERS_FLAG;
> > +       struct vm_area_struct *vma = ves->vma;
> > +       unsigned int tgt_refcnt = get_target_refcnt(ves);
> > +       int err = 0;
> >
> >         mmap_assert_write_locked(vma->vm_mm);
> > -
> > -       /* Additional refcnt if the vma is attached. */
> > -       if (!detaching)
> > -               tgt_refcnt++;
> > +       VM_WARN_ON_ONCE(ves->detached);
> > +       VM_WARN_ON_ONCE(ves->exclusive);
>
> Aren't these output parameters? If so, why do we stipulate their
> initial values instead of setting them appropriately?

I guess it was just to ensure correctly set up but yes it's a bit weird, will
remove.

>
> >
> >         /*
> >          * If vma is detached then only vma_mark_attached() can raise the
> > @@ -101,37 +122,39 @@ static int __vma_enter_exclusive_locked(struct vm_area_struct *vma,
> >          * See the comment describing the vm_area_struct->vm_refcnt field for
> >          * details of possible refcnt values.
> >          */
> > -       if (!refcount_add_not_zero(VM_REFCNT_EXCLUDE_READERS_FLAG, &vma->vm_refcnt))
> > +       if (!refcount_add_not_zero(VM_REFCNT_EXCLUDE_READERS_FLAG, &vma->vm_refcnt)) {
> > +               ves->detached = true;
> >                 return 0;
> > +       }
> >
> >         __vma_lockdep_acquire_exclusive(vma);
> >         err = rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> >                    refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> > -                  state);
> > +                  ves->state);
> >         if (err) {
> > -               if (__vma_exit_exclusive_locked(vma)) {
> > -                       /*
> > -                        * The wait failed, but the last reader went away
> > -                        * as well. Tell the caller the VMA is detached.
> > -                        */
> > -                       WARN_ON_ONCE(!detaching);
> > -                       err = 0;
> > -               }
> > +               __vma_exit_exclusive_locked(ves);
> >                 return err;
>
> Nice! We preserve both error and detached state information.

:)

>
> >         }
> > -       __vma_lockdep_stat_mark_acquired(vma);
> >
> > -       return 1;
> > +       __vma_lockdep_stat_mark_acquired(vma);
> > +       ves->exclusive = true;
> > +       return 0;
> >  }
> >
> >  int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq,
> >                 int state)
> >  {
> > -       int locked;
> > +       int err;
> > +       struct vma_exclude_readers_state ves = {
> > +               .vma = vma,
> > +               .state = state,
> > +       };
> >
> > -       locked = __vma_enter_exclusive_locked(vma, false, state);
> > -       if (locked < 0)
> > -               return locked;
> > +       err = __vma_enter_exclusive_locked(&ves);
> > +       if (err) {
> > +               WARN_ON_ONCE(ves.detached);
>
> I believe the above WARN_ON_ONCE() should stay inside of
> __vma_enter_exclusive_locked(). Its correctness depends on the
> implementation details of __vma_enter_exclusive_locked(). More

Well this was kind of horrible in the original implementation, as you are
literally telling the function whether you are detaching or not, and only doing
this assert if you were not.

That kind of 'if the caller is X do A, if the caller is Y do B' is really a code
smell, you should have X do the thing.

> specifically, it is only correct because
> __vma_enter_exclusive_locked() returns 0 if the VMA is detached, even
> if there was a pending SIGKILL.

Well it's a documented aspect of the function that we return 0 immediately on
detached state so I'm not sure that is an implementation detail?

I significantly prefer having that here vs. 'if not detaching then assert if
detached' for people to scratch their heads over in the function.

I think this detail is incorrect anyway, because:

	if (err) {
		if (__vma_exit_exclusive_locked(vma)) {
			/*
			 * The wait failed, but the last reader went away
			 * as well. Tell the caller the VMA is detached.
			 */
			 WARN_ON_ONCE(!detaching);
			 err = 0;
		}
		...
	}

Implies - hey we're fine with err not being zero AND detaching right? In which
case reset the error?

Except when detaching we set TASK_UNINTERRUPTIBLE? Which surely means we never
seen an error?

Or do we?

Either way it's something we handle differently based on _caller_. So it doesn't
belong in the function at all.

It's certainly logic that's highly confusing and needs to be handled
differently.

>
> > +               return err;
> > +       }
> >
> >         /*
> >          * We should use WRITE_ONCE() here because we can have concurrent reads
> > @@ -141,9 +164,11 @@ int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq,
> >          */
> >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> >
> > -       /* vma should remain attached. */
> > -       if (locked)
> > -               WARN_ON_ONCE(__vma_exit_exclusive_locked(vma));
> > +       if (!ves.detached) {
>
> Strictly speaking the above check should be checking ves->exclusive
> instead of !ves.detached. What you have is technically correct but
> it's again related to that comment:
>
> "If the function indicates an exclusive lock was acquired via
> ves->exclusive (or equivalently, returning 0 with !ves->detached), the
> caller is required to invoke __vma_exit_exclusive_locked() once the
> exclusive state is no longer required."
>
> So, here you are using returning 0 with !ves->detached as an
> indication that the VMA was successfully locked. I think it's less
> confusing if we use the field dedicated for that purpose.

OK changed.

>
> > +               __vma_exit_exclusive_locked(&ves);
> > +               /* VMA should remain attached. */
> > +               WARN_ON_ONCE(ves.detached);
> > +       }
> >
> >         return 0;
> >  }
> > @@ -151,7 +176,12 @@ EXPORT_SYMBOL_GPL(__vma_start_write);
> >
> >  void vma_mark_detached(struct vm_area_struct *vma)
> >  {
> > -       bool detached;
> > +       struct vma_exclude_readers_state ves = {
> > +               .vma = vma,
> > +               .state = TASK_UNINTERRUPTIBLE,
> > +               .detaching = true,
> > +       };
> > +       int err;
> >
> >         vma_assert_write_locked(vma);
> >         vma_assert_attached(vma);
> > @@ -160,18 +190,26 @@ void vma_mark_detached(struct vm_area_struct *vma)
> >          * See the comment describing the vm_area_struct->vm_refcnt field for
> >          * details of possible refcnt values.
> >          */
> > -       detached = __vma_refcount_put(vma, NULL);
> > -       if (unlikely(!detached)) {
> > -               /* Wait until vma is detached with no readers. */
> > -               if (__vma_enter_exclusive_locked(vma, true, TASK_UNINTERRUPTIBLE)) {
> > -                       /*
> > -                        * Once this is complete, no readers can increment the
> > -                        * reference count, and the VMA is marked detached.
> > -                        */
> > -                       detached = __vma_exit_exclusive_locked(vma);
> > -                       WARN_ON_ONCE(!detached);
> > -               }
> > +       if (likely(__vma_refcount_put(vma, NULL)))
> > +               return;
> > +
> > +       /*
> > +        * Wait until the VMA is detached with no readers. Since we hold the VMA
> > +        * write lock, the only read locks that might be present are those from
> > +        * threads trying to acquire the read lock and incrementing the
> > +        * reference count before realising the write lock is held and
> > +        * decrementing it.
> > +        */
> > +       err = __vma_enter_exclusive_locked(&ves);
> > +       if (!err && !ves.detached) {
>
> Same here, we should be checking ves->exclusive to decide if
> __vma_exit_exclusive_locked() should be called or not.

Ack, changed.

>
> > +               /*
> > +                * Once this is complete, no readers can increment the
> > +                * reference count, and the VMA is marked detached.
> > +                */
> > +               __vma_exit_exclusive_locked(&ves);
> >         }
> > +       /* If an error arose but we were detached anyway, we don't care. */
> > +       WARN_ON_ONCE(!ves.detached);
> >  }
> >
> >  /*
> > --
> > 2.52.0

Cheers, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ