linux-kernel - Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpH8zsboafV1UWufYhbVXN-yKgMOKm=vr2vBYAPNmPtrvw@mail.gmail.com>
Date: Tue, 15 Jul 2025 13:13:36 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	akpm@...ux-foundation.org, david@...hat.com, peterx@...hat.com, 
	jannh@...gle.com, hannes@...xchg.org, mhocko@...nel.org, paulmck@...nel.org, 
	shuah@...nel.org, adobriyan@...il.com, brauner@...nel.org, 
	josef@...icpanda.com, yebin10@...wei.com, linux@...ssschuh.net, 
	willy@...radead.org, osalvador@...e.de, andrii@...nel.org, 
	ryan.roberts@....com, christophe.leroy@...roup.eu, tjmercier@...gle.com, 
	kaleshsingh@...gle.com, aha310510@...il.com, linux-kernel@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, 
	linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock

On Tue, Jul 15, 2025 at 1:16 AM Vlastimil Babka <vbabka@...e.cz> wrote:
>
> On 7/10/25 19:02, Suren Baghdasaryan wrote:
> > On Thu, Jul 10, 2025 at 12:03 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> >>
> >> On Wed, Jul 9, 2025 at 10:47 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> >> >
> >> > On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
> >> > >
> >> > > * Suren Baghdasaryan <surenb@...gle.com> [250709 11:06]:
> >> > > > On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@...e.cz> wrote:
> >> > > > >
> >> > > > > On 7/9/25 16:43, Suren Baghdasaryan wrote:
> >> > > > > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@...e.cz> wrote:
> >> > > > > >>
> >> > > > > >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> >> > > > > >> >>> +     rcu_read_unlock();
> >> > > > > >> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> >> > > > > >> >>> +     rcu_read_lock();
> >> > > > > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> >> > > > > >> >> we lock under mmap lock.
> >> > > > > >> > Correct.
> >> > > > > >>
> >> > > > > >> I wonder if it's really necessary? Can't it be done just inside
> >> > > > > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
> >> > > > > >>
> >> > > > > >> Even if we later manage to extend this approach to smaps and employ rcu
> >> > > > > >> locking to traverse the page tables, I'd think it's best to separate and
> >> > > > > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> >> > > > > >> avoid too long time under the lock.
> >> > > > > >
> >> > > > > > I thought we would need to be in the same rcu read section while
> >> > > > > > traversing the maple tree using vma_next() but now looking at it,
> >> > > > > > maybe we can indeed enter only while finding and locking the next
> >> > > > > > vma...
> >> > > > > > Liam, would that work? I see struct ma_state containing a node field.
> >> > > > > > Can it be freed from under us if we find a vma, exit rcu read section
> >> > > > > > then re-enter rcu and use the same iterator to find the next vma?
> >> > > > >
> >> > > > > If the rcu protection needs to be contigous, and patch 8 avoids the issue by
> >> > > > > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
> >> > > > > the issue or is it why we see the syzbot reports?) then I guess in the code
> >> > > > > quoted above we also need a vma_iter_init() after the rcu_read_lock(),
> >> > > > > because although the iterator was used briefly under mmap_lock protection,
> >> > > > > that was then unlocked and there can be a race before the rcu_read_lock().
> >> > > >
> >> > > > Quite true. So, let's wait for Liam's confirmation and based on his
> >> > > > answer I'll change the patch by either reducing the rcu read section
> >> > > > or adding the missing vma_iter_init() after we switch to mmap_lock.
> >> > >
> >> > > You need to either be under rcu or mmap lock to ensure the node in the
> >> > > maple state hasn't been freed (and potentially, reallocated).
> >> > >
> >> > > So in this case, in the higher level, we can hold the rcu read lock for
> >> > > a series of walks and avoid re-walking the tree then the performance
> >> > > would be better.
> >> >
> >> > Got it. Thanks for confirming!
> >> >
> >> > >
> >> > > When we return to userspace, then we should drop the rcu read lock and
> >> > > will need to vma_iter_set()/vma_iter_invalidate() on return.  I thought
> >> > > this was being done (through vma_iter_init()), but syzbot seems to
> >> > > indicate a path that was missed?
> >> >
> >> > We do that in m_start()/m_stop() by calling
> >> > lock_vma_range()/unlock_vma_range() but I think I have two problems
> >> > here:
> >> > 1. As Vlastimil mentioned I do not reset the iterator when falling
> >> > back to mmap_lock and exiting and then re-entering rcu read section;
> >> > 2. I do not reset the iterator after exiting rcu read section in
> >> > m_stop() and re-entering it in m_start(), so the later call to
> >> > lock_next_vma() might be using an iterator with a node that was freed
> >> > (and possibly reallocated).
> >> >
> >> > >
> >> > > This is the same thing that needed to be done previously with the mmap
> >> > > lock, but now under the rcu lock.
> >> > >
> >> > > I'm not sure how to mitigate the issue with the page table, maybe we
> >> > > guess on the number of vmas that we were doing for 4k blocks of output
> >> > > and just drop/reacquire then.  Probably a problem for another day
> >> > > anyways.
> >> > >
> >> > > Also, I think you can also change the vma_iter_init() to vma_iter_set(),
> >> > > which is slightly less code under the hood.  Vlastimil asked about this
> >> > > and it's probably a better choice.
> >> >
> >> > Ack.
> >> > I'll update my series with these fixes and all comments I received so
> >> > far, will run the reproducers to confirm no issues and repost them
> >> > later today.
> >>
> >> I have the patchset ready but would like to test it some more. Will
> >> post it tomorrow.
> >
> > Ok, I found a couple of issues using the syzbot reproducer [1] (which
> > is awesome BTW!):
> > 1. rwsem_acquire_read() inside vma_start_read() at [2] should be moved
> > after the last check, otherwise the lock is considered taken on
> > vma->vm_refcnt overflow;
>
> I think it's fine because if the last check fails there's a
> vma_refcount_put() that includes rwsem_release(), no?

Ah, yes, you are right. This is fine. Obviously trying to figure out
the issue right before a flight is not a good idea :)

>
> > 2. query_matching_vma() is missing unlock_vma() call when it does
> > "goto next_vma;" and re-issues query_vma_find_by_addr(). The previous
> > vma is left locked;
> >
> > [1] https://syzkaller.appspot.com/x/repro.c?x=101edf70580000
> > [2] https://elixir.bootlin.com/linux/v6.15.5/source/include/linux/mm.h#L747
> >
> > After these fixes it's much harder to fail but I still get one more
> > error copied below. I will continue the investigation and will hold
> > off reposting until this is fixed. That will be next week since I'll
> > be out of town the rest of this week.
> >
> > Andrew, could you please remove this patchset from mm-unstable for now
> > until I fix the issue and re-post the new version?
>
> Andrew can you do that please? We keep getting new syzbot reports.
>
> > The error I got after these fixes is:
>
> I suspect the root cause is the ioctls are not serialized against each other
> (probably not even against read()) and yet we treat m->private as safe to
> work on. Now we have various fields that are dangerous to race on - for
> example locked_vma and iter races would explain a lot of this.
>
> I suspect as long as we used purely seq_file workflow, it did the right
> thing for us wrt serialization, but the ioctl addition violates that. We
> should rather recheck even the code before this series, if dangerous ioctl
> vs read() races are possible. And the ioctl implementation should be
> refactored to use an own per-ioctl-call private context, not the seq_file's
> per-file-open context.

Huh, I completely failed to consider this. In hindsight it is quite
obvious... Thanks Vlastimil, I owe you a beer or two.