lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzYBdG95Zhi0M0CDTHAU6V9sF+kGSLB+346dq0Aa4Timmg@mail.gmail.com>
Date: Thu, 24 Apr 2025 09:03:17 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>, akpm@...ux-foundation.org, 
	lorenzo.stoakes@...cle.com, david@...hat.com, vbabka@...e.cz, 
	peterx@...hat.com, jannh@...gle.com, hannes@...xchg.org, mhocko@...nel.org, 
	paulmck@...nel.org, shuah@...nel.org, adobriyan@...il.com, brauner@...nel.org, 
	josef@...icpanda.com, yebin10@...wei.com, linux@...ssschuh.net, 
	willy@...radead.org, osalvador@...e.de, andrii@...nel.org, 
	ryan.roberts@....com, christophe.leroy@...roup.eu, 
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-mm@...ck.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
>
> On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
> >
> > * Andrii Nakryiko <andrii.nakryiko@...il.com> [250423 18:06]:
> > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > > >
> > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > <andrii.nakryiko@...il.com> wrote:
> > > > >
> > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > > > > >
> > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > pin pointer fields used when generating the output (currently only
> > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > >
> > > > > This is probably a stupid question, but why do we need to take a lock
> > > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > > even for reads, and still record this seq counter. And then detect
> > > > > whether there were any modifications in between. Why does this change
> > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > >
> > > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > > to finish what it's doing and then we continue by recording a new
> > > > sequence counter value and call mmap_read_unlock. This is what
> > > > get_vma_snapshot() does. But your question made me realize that we can
> > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > dance we do in the get_vma_snapshot(). I think that should work.
> > >
> > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> >
> > We can check this counter once per 4k block and maintain the same
> > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > they had an issue with changing it, but since we're on this road anyways
> > I'd thought I'd point out where we could end up.
>
> We would need to run that check on the last call to show_map() right
> before seq_file detects the overflow and flushes the page. On
> contention we will also be throwing away more prepared data (up to a
> page worth of records) vs only the last record. All in all I'm not
> convinced this is worth doing unless increased chances of data tearing
> is identified as a problem.
>

Yep, I agree, with filling out 4K of data we run into much higher
chances of conflict, IMO. Not worth it, I'd say.

> >
> > I am concerned about live locking in either scenario, but I haven't
> > looked too deep into this pattern.
> >
> > I also don't love (as usual) the lack of ensured forward progress.
>
> Hmm. Maybe we should add a retry limit on
> mmap_lock_speculate_try_begin() and once the limit is hit we just take
> the mmap_read_lock and proceed with it? That would prevent a
> hyperactive writer from blocking the reader's forward progress
> indefinitely.

Came here to say the same. I'd add a small number of retries (3-5?)
and then fallback to the read-locked approach. The main challenge is
to keep all this logic nicely isolated from the main VMA
search/printing logic.

For a similar pattern in uprobes, we don't even bother to rety, we
just fallback to mmap_read_lock and proceed, under the assumption that
this is going to be very rare and thus not important from the overall
performance perspective.

>
> >
> > It seems like we have four cases for the vm area state now:
> > 1. we want to read a stable vma or set of vmas (per-vma locking)
> > 2. we want to read a stable mm state for reading (the very short named
> > mmap_lock_speculate_try_begin)
>
> and we don't mind retrying on contention. This one should be done
> under RCU protection.
>
> > 3. we ensure a stable vma/mm state for reading (mmap read lock)
> > 4. we are writing - get out of my way (mmap write lock).
>
> I wouldn't call #2 a vma state. More of a usecase when we want to read
> vma under RCU (valid but can change from under us) and then retry if
> it might have been modified from under us.
>
> >
> > Cheers,
> > Liam
> >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ