linux-kernel - Re: [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpEyh64aHhBFkFt6_u3f9sk1RXBd4Xxy80voMsyy7Unpsg@mail.gmail.com>
Date: Thu, 31 Jul 2025 07:06:41 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: akpm@...ux-foundation.org, jannh@...gle.com, Liam.Howlett@...cle.com, 
	vbabka@...e.cz, pfalcato@...e.de, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] mm: change vma_start_read() to drop RCU lock on failure

On Thu, Jul 31, 2025 at 4:49 AM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
>
> So this patch is broken :P

Ugh, sorry about that. I must have had lockdep disabled when testing this...

>
> Am getting:
>
> [    2.002807] ------------[ cut here ]------------
> [    2.003014] Voluntary context switch within RCU read-side critical section!
> [    2.003022] WARNING: CPU: 1 PID: 202 at kernel/rcu/tree_plugin.h:332 rcu_note_context_switch+0x506/0x580
> [    2.003643] Modules linked in:
> [    2.003765] CPU: 1 UID: 0 PID: 202 Comm: dhcpcd Not tainted 6.16.0-rc5+ #41 PREEMPT(voluntary)
> [    2.004103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-1-1 04/01/2014
> [    2.004460] RIP: 0010:rcu_note_context_switch+0x506/0x580
> [    2.004669] Code: 00 00 00 0f 85 f5 fd ff ff 49 89 90 a8 00 00 00 e9 e9 fd ff ff c6 05 86 69 90 01 01 90 48 c7 c7 98 c3 90 b8 e8 cb b4 f5 ff 90 <0f> 0b 90 90 e9 38 fb ff ff 48 8b 7d 20 48 89 3c 24 e8 64 26 d5 00
> [    2.005382] RSP: 0018:ffffb36b40607aa8 EFLAGS: 00010082
> [    2.005585] RAX: 000000000000003f RBX: ffff9c044128ad00 RCX: 0000000000000027
> [    2.005866] RDX: ffff9c0577c97f48 RSI: 0000000000000001 RDI: ffff9c0577c97f40
> [    2.006136] RBP: ffff9c0577ca9f80 R08: 40000000ffffe1f7 R09: 0000000000000000
> [    2.006411] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [    2.006692] R13: ffff9c04fb0423d0 R14: ffffffffb82e2600 R15: ffff9c044128ad00
> [    2.006968] FS:  00007fd12e7d9740(0000) GS:ffff9c05be614000(0000) knlGS:0000000000000000
> [    2.007281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.007498] CR2: 00007ffe2f0d2798 CR3: 00000001bb8b1000 CR4: 0000000000750ef0
> [    2.007770] PKRU: 55555554
> [    2.007880] Call Trace:
> [    2.007985]  <TASK>
> [    2.008076]  __schedule+0x94/0xee0
> [    2.008212]  ? __pfx_bit_wait_io+0x10/0x10
> [    2.008370]  schedule+0x22/0xd0
> [    2.008517]  io_schedule+0x41/0x60
> [    2.008653]  bit_wait_io+0xc/0x60
> [    2.008783]  __wait_on_bit+0x25/0x90
> [    2.008925]  out_of_line_wait_on_bit+0x85/0x90
> [    2.009104]  ? __pfx_wake_bit_function+0x10/0x10
> [    2.009288]  __ext4_find_entry+0x2b2/0x470
> [    2.009449]  ? __d_alloc+0x117/0x1d0
> [    2.009591]  ext4_lookup+0x6b/0x1f0
> [    2.009733]  path_openat+0x895/0x1030
> [    2.009880]  do_filp_open+0xc3/0x150
> [    2.010021]  ? do_anonymous_page+0x5b1/0xae0
> [    2.010195]  do_sys_openat2+0x76/0xc0
> [    2.010339]  __x64_sys_openat+0x4f/0x70
> [    2.010490]  do_syscall_64+0xa4/0x260
> [    2.010638]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [    2.010840] RIP: 0033:0x7fd12e0a2006
> [    2.010984] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
>
> and
>
> [   23.004231] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [   23.004464] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P202/6:b..l
> [   23.004736] rcu:     (detected by 2, t=21002 jiffies, g=-663, q=940 ncpus=4)
> [   23.004992] task:dhcpcd          state:S stack:0     pid:202   tgid:202   ppid:196    task_flags:0x400140 flags:0x00004002
> [   23.005416] Call Trace:
> [   23.005515]  <TASK>
> [   23.005603]  __schedule+0x3ca/0xee0
> [   23.005754]  schedule+0x22/0xd0
> [   23.005878]  schedule_hrtimeout_range_clock+0xf2/0x100
> [   23.006075]  poll_schedule_timeout.constprop.0+0x32/0x80
> [   23.006281]  do_sys_poll+0x3bb/0x550
> [   23.006424]  ? __pfx_pollwake+0x10/0x10
> [   23.006573]  ? __pfx_pollwake+0x10/0x10
> [   23.006712]  ? __pfx_pollwake+0x10/0x10
> [   23.006848]  __x64_sys_ppoll+0xc9/0x160
> [   23.006993]  do_syscall_64+0xa4/0x260
> [   23.007140]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   23.007339] RIP: 0033:0x7fd12e0a2006
> [   23.007483] RSP: 002b:00007ffe2f0f28e0 EFLAGS: 00000202 ORIG_RAX: 000000000000010f
> [   23.007770] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd12e0a2006
> [   23.008035] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 000055abb0c5ae20
> [   23.008309] RBP: 00007ffe2f0f2900 R08: 0000000000000008 R09: 0000000000000000
> [   23.008588] R10: 00007ffe2f0f2c80 R11: 0000000000000202 R12: 000055abb0c3cd80
> [   23.008869] R13: 00007ffe2f0f2c80 R14: 000055ab9297b5c0 R15: 0000000000000000
> [   23.009141]  </TASK>
>
> Here.
>
> I identify the bug below.
>
> On Wed, Jul 30, 2025 at 06:34:04PM -0700, Suren Baghdasaryan wrote:
> > vma_start_read() can drop and reacquire RCU lock in certain failure
> > cases. It's not apparent that the RCU session started by the caller of
> > this function might be interrupted when vma_start_read() fails to lock
> > the vma. This might become a source of subtle bugs and to prevent that
> > we change the locking rules for vma_start_read() to drop RCU read lock
> > upon failure. This way it's more obvious that RCU-protected objects are
> > unsafe after vma locking fails.
> >
> > Suggested-by: Vlastimil Babka <vbabka@...e.cz>
> > Signed-off-by: Suren Baghdasaryan <surenb@...gle.com>
> > ---
> >  mm/mmap_lock.c | 76 +++++++++++++++++++++++++++++---------------------
> >  1 file changed, 44 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > index 10826f347a9f..0129db8f652f 100644
> > --- a/mm/mmap_lock.c
> > +++ b/mm/mmap_lock.c
> > @@ -136,15 +136,21 @@ void vma_mark_detached(struct vm_area_struct *vma)
> >   * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
> >   * detached.
> >   *
> > - * WARNING! The vma passed to this function cannot be used if the function
> > - * fails to lock it because in certain cases RCU lock is dropped and then
> > - * reacquired. Once RCU lock is dropped the vma can be concurently freed.
> > + * WARNING! On entrance to this function RCU read lock should be held and it
> > + * is released if function fails to lock the vma, therefore vma passed to this
> > + * function cannot be used if the function fails to lock it.
> > + * When vma is successfully locked, RCU read lock is kept intact and RCU read
> > + * session is not interrupted. This is important when locking is done while
> > + * walking the vma tree under RCU using vma_iterator because if the RCU lock
> > + * is dropped, the iterator becomes invalid.
> >   */
>
> I feel like this is a bit of a wall of noise, can we add a clearly separated line like:
>
>         ...
>         *
>
>         * IMPORTANT: RCU lock must be held upon entering the function, but
>         *            upon error IT IS RELEASED. The caller must handle this
>         *            correctly.
>         */

Are you suggesting to replace my comment or amend it with this one? I
think the answer is "replace" but I want to confirm.

>
> >  static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >                                                   struct vm_area_struct *vma)
> >  {
>
> I was thinking we could split this out into a wrapper __vma_start_read()
> function but then the stability check won't really fit properly so never
> mind :)
>
> > +     struct mm_struct *other_mm;
> >       int oldcnt;
> >
> > +     RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu lock held");
>
> Good to add this.
>
> >       /*
> >        * Check before locking. A race might cause false locked result.
> >        * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> > @@ -152,8 +158,10 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >        * we don't rely on for anything - the mm_lock_seq read against which we
> >        * need ordering is below.
> >        */
> > -     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
> > -             return NULL;
> > +     if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence)) {
> > +             vma = NULL;
> > +             goto err;
> > +     }
> >
> >       /*
> >        * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire()
> > @@ -164,7 +172,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >       if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt,
> >                                                             VMA_REF_LIMIT))) {
> >               /* return EAGAIN if vma got detached from under us */
> > -             return oldcnt ? NULL : ERR_PTR(-EAGAIN);
> > +             vma = oldcnt ? NULL : ERR_PTR(-EAGAIN);
> > +             goto err;
> >       }
> >
> >       rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_);
> > @@ -175,23 +184,8 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >        * is dropped and before rcuwait_wake_up(mm) is called. Grab it before
> >        * releasing vma->vm_refcnt.
> >        */
>
> I feel like this comment above should be moved below to where the 'action' is.

Ack.

>
> > -     if (unlikely(vma->vm_mm != mm)) {
> > -             /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > -             struct mm_struct *other_mm = vma->vm_mm;
> > -
> > -             /*
> > -              * __mmdrop() is a heavy operation and we don't need RCU
> > -              * protection here. Release RCU lock during these operations.
> > -              * We reinstate the RCU read lock as the caller expects it to
> > -              * be held when this function returns even on error.
> > -              */
> > -             rcu_read_unlock();
> > -             mmgrab(other_mm);
> > -             vma_refcount_put(vma);
> > -             mmdrop(other_mm);
> > -             rcu_read_lock();
> > -             return NULL;
> > -     }
> > +     if (unlikely(vma->vm_mm != mm))
> > +             goto err_unstable;
> >
> >       /*
> >        * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result.
> > @@ -206,10 +200,26 @@ static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
> >        */
> >       if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&mm->mm_lock_seq))) {
> >               vma_refcount_put(vma);
> > -             return NULL;
> > +             vma = NULL;
> > +             goto err;
> >       }
> >
> >       return vma;
> > +err:
> > +     rcu_read_unlock();
> > +
> > +     return vma;
> > +err_unstable:
>
> Move comment above here I think.

Got it.

>
> > +     /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
> > +     other_mm = vma->vm_mm;
> > +
> > +     /* __mmdrop() is a heavy operation, do it after dropping RCU lock. */
> > +     rcu_read_unlock();
> > +     mmgrab(other_mm);
> > +     vma_refcount_put(vma);
> > +     mmdrop(other_mm);
> > +
> > +     return NULL;
> >  }
> >
> >  /*
> > @@ -223,8 +233,8 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       MA_STATE(mas, &mm->mm_mt, address, address);
> >       struct vm_area_struct *vma;
> >
> > -     rcu_read_lock();
> >  retry:
> > +     rcu_read_lock();
> >       vma = mas_walk(&mas);
> >       if (!vma)
> >               goto inval;
>                 ^
>                 |---- this is incorrect, you took the RCU read lock above, but you don't unlock... :)
>
> You can fix easily with:
>
>         if (!vma) {
>                 rcu_read_unlock();
>                 goto inval;
>         }
>
> Which fixes the issue locally for me.

Yes, I overlooked that here we did not yet attempt to lock the vma. Will fix.
Thanks!

>
> > @@ -241,6 +251,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >               /* Failed to lock the VMA */
> >               goto inval;
> >       }
> > +
> > +     rcu_read_unlock();
> > +
> >       /*
> >        * At this point, we have a stable reference to a VMA: The VMA is
> >        * locked and we know it hasn't already been isolated.
> > @@ -249,16 +262,14 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >        */
> >
> >       /* Check if the vma we locked is the right one. */
> > -     if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > -             goto inval_end_read;
> > +     if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > +             vma_end_read(vma);
> > +             goto inval;
> > +     }
> >
> > -     rcu_read_unlock();
> >       return vma;
> >
> > -inval_end_read:
> > -     vma_end_read(vma);
> >  inval:
> > -     rcu_read_unlock();
> >       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> >       return NULL;
> >  }
> > @@ -313,6 +324,7 @@ struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> >                */
> >               if (PTR_ERR(vma) == -EAGAIN) {
> >                       /* reset to search from the last address */
> > +                     rcu_read_lock();
> >                       vma_iter_set(vmi, from_addr);
> >                       goto retry;
> >               }
> > @@ -342,9 +354,9 @@ struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> >       return vma;
> >
> >  fallback_unlock:
> > +     rcu_read_unlock();
> >       vma_end_read(vma);
> >  fallback:
> > -     rcu_read_unlock();
> >       vma = lock_next_vma_under_mmap_lock(mm, vmi, from_addr);
> >       rcu_read_lock();
> >       /* Reinitialize the iterator after re-entering rcu read section */
> > --
> > 2.50.1.552.g942d659e1b-goog
> >