[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpH0HzM97exh92mpkuimxaen2Qh+tj_tZ=QBHQfi-3ejLQ@mail.gmail.com>
Date: Thu, 10 Jul 2025 10:02:32 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>, Suren Baghdasaryan <surenb@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, akpm@...ux-foundation.org,
david@...hat.com, peterx@...hat.com, jannh@...gle.com, hannes@...xchg.org,
mhocko@...nel.org, paulmck@...nel.org, shuah@...nel.org, adobriyan@...il.com,
brauner@...nel.org, josef@...icpanda.com, yebin10@...wei.com,
linux@...ssschuh.net, willy@...radead.org, osalvador@...e.de,
andrii@...nel.org, ryan.roberts@....com, christophe.leroy@...roup.eu,
tjmercier@...gle.com, kaleshsingh@...gle.com, aha310510@...il.com,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
On Thu, Jul 10, 2025 at 12:03 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
>
> On Wed, Jul 9, 2025 at 10:47 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> >
> > On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@...gle.com> [250709 11:06]:
> > > > On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@...e.cz> wrote:
> > > > >
> > > > > On 7/9/25 16:43, Suren Baghdasaryan wrote:
> > > > > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@...e.cz> wrote:
> > > > > >>
> > > > > >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> > > > > >> >>> + rcu_read_unlock();
> > > > > >> >>> + vma = lock_vma_under_mmap_lock(mm, iter, address);
> > > > > >> >>> + rcu_read_lock();
> > > > > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> > > > > >> >> we lock under mmap lock.
> > > > > >> > Correct.
> > > > > >>
> > > > > >> I wonder if it's really necessary? Can't it be done just inside
> > > > > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
> > > > > >>
> > > > > >> Even if we later manage to extend this approach to smaps and employ rcu
> > > > > >> locking to traverse the page tables, I'd think it's best to separate and
> > > > > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> > > > > >> avoid too long time under the lock.
> > > > > >
> > > > > > I thought we would need to be in the same rcu read section while
> > > > > > traversing the maple tree using vma_next() but now looking at it,
> > > > > > maybe we can indeed enter only while finding and locking the next
> > > > > > vma...
> > > > > > Liam, would that work? I see struct ma_state containing a node field.
> > > > > > Can it be freed from under us if we find a vma, exit rcu read section
> > > > > > then re-enter rcu and use the same iterator to find the next vma?
> > > > >
> > > > > If the rcu protection needs to be contigous, and patch 8 avoids the issue by
> > > > > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
> > > > > the issue or is it why we see the syzbot reports?) then I guess in the code
> > > > > quoted above we also need a vma_iter_init() after the rcu_read_lock(),
> > > > > because although the iterator was used briefly under mmap_lock protection,
> > > > > that was then unlocked and there can be a race before the rcu_read_lock().
> > > >
> > > > Quite true. So, let's wait for Liam's confirmation and based on his
> > > > answer I'll change the patch by either reducing the rcu read section
> > > > or adding the missing vma_iter_init() after we switch to mmap_lock.
> > >
> > > You need to either be under rcu or mmap lock to ensure the node in the
> > > maple state hasn't been freed (and potentially, reallocated).
> > >
> > > So in this case, in the higher level, we can hold the rcu read lock for
> > > a series of walks and avoid re-walking the tree then the performance
> > > would be better.
> >
> > Got it. Thanks for confirming!
> >
> > >
> > > When we return to userspace, then we should drop the rcu read lock and
> > > will need to vma_iter_set()/vma_iter_invalidate() on return. I thought
> > > this was being done (through vma_iter_init()), but syzbot seems to
> > > indicate a path that was missed?
> >
> > We do that in m_start()/m_stop() by calling
> > lock_vma_range()/unlock_vma_range() but I think I have two problems
> > here:
> > 1. As Vlastimil mentioned I do not reset the iterator when falling
> > back to mmap_lock and exiting and then re-entering rcu read section;
> > 2. I do not reset the iterator after exiting rcu read section in
> > m_stop() and re-entering it in m_start(), so the later call to
> > lock_next_vma() might be using an iterator with a node that was freed
> > (and possibly reallocated).
> >
> > >
> > > This is the same thing that needed to be done previously with the mmap
> > > lock, but now under the rcu lock.
> > >
> > > I'm not sure how to mitigate the issue with the page table, maybe we
> > > guess on the number of vmas that we were doing for 4k blocks of output
> > > and just drop/reacquire then. Probably a problem for another day
> > > anyways.
> > >
> > > Also, I think you can also change the vma_iter_init() to vma_iter_set(),
> > > which is slightly less code under the hood. Vlastimil asked about this
> > > and it's probably a better choice.
> >
> > Ack.
> > I'll update my series with these fixes and all comments I received so
> > far, will run the reproducers to confirm no issues and repost them
> > later today.
>
> I have the patchset ready but would like to test it some more. Will
> post it tomorrow.
Ok, I found a couple of issues using the syzbot reproducer [1] (which
is awesome BTW!):
1. rwsem_acquire_read() inside vma_start_read() at [2] should be moved
after the last check, otherwise the lock is considered taken on
vma->vm_refcnt overflow;
2. query_matching_vma() is missing unlock_vma() call when it does
"goto next_vma;" and re-issues query_vma_find_by_addr(). The previous
vma is left locked;
[1] https://syzkaller.appspot.com/x/repro.c?x=101edf70580000
[2] https://elixir.bootlin.com/linux/v6.15.5/source/include/linux/mm.h#L747
After these fixes it's much harder to fail but I still get one more
error copied below. I will continue the investigation and will hold
off reposting until this is fixed. That will be next week since I'll
be out of town the rest of this week.
Andrew, could you please remove this patchset from mm-unstable for now
until I fix the issue and re-post the new version?
The error I got after these fixes is:
[ 56.342886]
[ 56.342910] ------------[ cut here ]------------
[ 56.342934] WARNING: CPU: 46 PID: 5701 at lib/maple_tree.c:4734
mas_next_slot+0x552/0x840
[ 56.344691] ================================================
[ 56.344695] WARNING: lock held when returning to user space!
[ 56.344698] 6.16.0-rc5-00321-g31d640f7b07c-dirty #379 Not tainted
[ 56.344702] ------------------------------------------------
[ 56.344704] syzbot_repro1/5700 is leaving the kernel with locks still held!
[ 56.344715] 1 lock held by syzbot_repro1/5700:
[ 56.344720] #0: ffff93a8c2cea788 (vm_lock){++++}-{0:0}
[ 56.349286] Modules linked in:
[ 56.355569] , at: get_next_vma+0x91/0xe0
[ 56.377452]
[ 56.377929] CPU: 46 UID: 0 PID: 5701 Comm: syzbot_repro1 Not
tainted 6.16.0-rc5-00321-g31d640f7b07c-dirty #379 PREEMPT(voluntary)
[ 56.381592] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 56.384664] RIP: 0010:mas_next_slot+0x552/0x840
[ 56.386097] Code: 43 38 83 e8 02 83 f8 01 77 c4 e9 e5 fa ff ff 48
8b 43 28 e9 83 fb ff ff 49 8b 06 30 c0 49 39 c6 74 be c7 43 38 05 00
00 00 90 <0f> 0b 90 48 c7 04 24 00 001
[ 56.392303] RSP: 0018:ffffa01188217cd8 EFLAGS: 00010206
[ 56.393928] RAX: ffff93a8c2724e00 RBX: ffff93a8c1af2898 RCX: 1ffff27519183e61
[ 56.396300] RDX: ffff93a8c2724e0e RSI: 0000000000000000 RDI: ffff93a8c1af2898
[ 56.398515] RBP: 00007ffd83c2ffff R08: 00000000ffffffff R09: ffff93a8c8c1f308
[ 56.400722] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff
[ 56.402935] R13: 0000000000000001 R14: ffff93a8c8c1f300 R15: ffff93a8c8c1f308
[ 56.405222] FS: 00007ff71a3946c0(0000) GS:ffff93b83a9af000(0000)
knlGS:0000000000000000
[ 56.408236] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 56.409994] CR2: 00007ff71a393bb0 CR3: 0000000102f84000 CR4: 0000000000750ef0
[ 56.412367] PKRU: 55555554
[ 56.413231] Call Trace:
[ 56.413955] <TASK>
[ 56.414672] mas_find+0x5c/0x1c0
[ 56.415713] lock_next_vma+0x41/0x4d0
[ 56.416869] get_next_vma+0x91/0xe0
[ 56.417954] do_procmap_query+0x249/0xa90
[ 56.419310] ? do_procmap_query+0x1b8/0xa90
[ 56.420591] procfs_procmap_ioctl+0x20/0x40
[ 56.421896] __x64_sys_ioctl+0x8e/0xe0
[ 56.423514] do_syscall_64+0xbb/0x360
[ 56.424715] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 56.426296] RIP: 0033:0x41a7e9
[ 56.427254] Code: c0 79 93 eb d5 48 8d 7c 1d 00 eb 99 0f 1f 44 00
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c38
[ 56.432893] RSP: 002b:00007ff71a3941f8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 56.435222] RAX: ffffffffffffffda RBX: 00007ff71a394cdc RCX: 000000000041a7e9
[ 56.437475] RDX: 0000200000000180 RSI: 00000000c0686611 RDI: 0000000000000003
[ 56.440084] RBP: 00007ff71a394220 R08: 00007ff71a3946c0 R09: 00007ff71a394210
[ 56.442345] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffd0
[ 56.444545] R13: 0000000000000000 R14: 00007ffd83c4fe30 R15: 00007ffd83c4ff18
[ 56.446732] </TASK>
[ 56.447436] Kernel panic - not syncing: kernel: panic_on_warn set ...
[ 56.449433] CPU: 46 UID: 0 PID: 5701 Comm: syzbot_repro1 Not
tainted 6.16.0-rc5-00321-g31d640f7b07c-dirty #379 PREEMPT(voluntary)
[ 56.453043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 56.456564] Call Trace:
[ 56.457340] <TASK>
[ 56.457969] dump_stack_lvl+0x5d/0x80
[ 56.459188] ? mas_next_slot+0x510/0x840
[ 56.460388] panic+0x11a/0x2ce
[ 56.461353] ? mas_next_slot+0x552/0x840
[ 56.462564] check_panic_on_warn.cold+0xf/0x1e
[ 56.463960] __warn.cold+0xc3/0x153
[ 56.465043] ? mas_next_slot+0x552/0x840
[ 56.466367] report_bug+0xff/0x140
[ 56.467441] ? mas_next_slot+0x552/0x840
[ 56.468993] handle_bug+0x163/0x1e0
[ 56.470236] exc_invalid_op+0x17/0x70
[ 56.471366] asm_exc_invalid_op+0x1a/0x20
[ 56.472629] RIP: 0010:mas_next_slot+0x552/0x840
[ 56.474053] Code: 43 38 83 e8 02 83 f8 01 77 c4 e9 e5 fa ff ff 48
8b 43 28 e9 83 fb ff ff 49 8b 06 30 c0 49 39 c6 74 be c7 43 38 05 00
00 00 90 <0f> 0b 90 48 c7 04 24 00 001
[ 56.479861] RSP: 0018:ffffa01188217cd8 EFLAGS: 00010206
[ 56.481501] RAX: ffff93a8c2724e00 RBX: ffff93a8c1af2898 RCX: 1ffff27519183e61
[ 56.483665] RDX: ffff93a8c2724e0e RSI: 0000000000000000 RDI: ffff93a8c1af2898
[ 56.486323] RBP: 00007ffd83c2ffff R08: 00000000ffffffff R09: ffff93a8c8c1f308
[ 56.488491] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff
[ 56.490634] R13: 0000000000000001 R14: ffff93a8c8c1f300 R15: ffff93a8c8c1f308
[ 56.492856] mas_find+0x5c/0x1c0
[ 56.493888] lock_next_vma+0x41/0x4d0
[ 56.495022] get_next_vma+0x91/0xe0
[ 56.496204] do_procmap_query+0x249/0xa90
[ 56.497515] ? do_procmap_query+0x1b8/0xa90
[ 56.499232] procfs_procmap_ioctl+0x20/0x40
[ 56.500500] __x64_sys_ioctl+0x8e/0xe0
[ 56.501682] do_syscall_64+0xbb/0x360
[ 56.502845] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 56.504408] RIP: 0033:0x41a7e9
[ 56.505362] Code: c0 79 93 eb d5 48 8d 7c 1d 00 eb 99 0f 1f 44 00
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c38
[ 56.511066] RSP: 002b:00007ff71a3941f8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 56.513523] RAX: ffffffffffffffda RBX: 00007ff71a394cdc RCX: 000000000041a7e9
[ 56.515977] RDX: 0000200000000180 RSI: 00000000c0686611 RDI: 0000000000000003
[ 56.518440] RBP: 00007ff71a394220 R08: 00007ff71a3946c0 R09: 00007ff71a394210
[ 56.520643] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffd0
[ 56.522884] R13: 0000000000000000 R14: 00007ffd83c4fe30 R15: 00007ffd83c4ff18
[ 56.525170] </TASK>
[ 56.527859] Kernel Offset: 0x1a00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 56.531106] Rebooting in 86400 seconds..
>
> > Thanks,
> > Suren.
> >
> > >
> > > Thanks,
> > > Liam
> > >
Powered by blists - more mailing lists