linux-kernel - Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <by4pd6zomtvo64vjddthqu3ps2n7fqzaeqttinmy5nzttxjjd6@ch2uxmy2bgks>
Date: Thu, 24 Apr 2025 12:42:44 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Suren Baghdasaryan <surenb@...gle.com>, akpm@...ux-foundation.org,
        lorenzo.stoakes@...cle.com, david@...hat.com, vbabka@...e.cz,
        peterx@...hat.com, jannh@...gle.com, hannes@...xchg.org,
        mhocko@...nel.org, paulmck@...nel.org, shuah@...nel.org,
        adobriyan@...il.com, brauner@...nel.org, josef@...icpanda.com,
        yebin10@...wei.com, linux@...ssschuh.net, willy@...radead.org,
        osalvador@...e.de, andrii@...nel.org, ryan.roberts@....com,
        christophe.leroy@...roup.eu, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

* Andrii Nakryiko <andrii.nakryiko@...il.com> [250424 12:04]:
> On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> >
> > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
> > >
> > > * Andrii Nakryiko <andrii.nakryiko@...il.com> [250423 18:06]:
> > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > > > >
> > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > > <andrii.nakryiko@...il.com> wrote:
> > > > > >
> > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > > > > > >
> > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > > pin pointer fields used when generating the output (currently only
> > > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > >
> > > > > > This is probably a stupid question, but why do we need to take a lock
> > > > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > > > even for reads, and still record this seq counter. And then detect
> > > > > > whether there were any modifications in between. Why does this change
> > > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > > >
> > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > > > to finish what it's doing and then we continue by recording a new
> > > > > sequence counter value and call mmap_read_unlock. This is what
> > > > > get_vma_snapshot() does. But your question made me realize that we can
> > > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > > dance we do in the get_vma_snapshot(). I think that should work.
> > > >
> > > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> > >
> > > We can check this counter once per 4k block and maintain the same
> > > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > > they had an issue with changing it, but since we're on this road anyways
> > > I'd thought I'd point out where we could end up.
> >
> > We would need to run that check on the last call to show_map() right
> > before seq_file detects the overflow and flushes the page. On
> > contention we will also be throwing away more prepared data (up to a
> > page worth of records) vs only the last record. All in all I'm not
> > convinced this is worth doing unless increased chances of data tearing
> > is identified as a problem.
> >
> 
> Yep, I agree, with filling out 4K of data we run into much higher
> chances of conflict, IMO. Not worth it, I'd say.

Sounds good.

If this is an issue we do have a path forward still.  Although it's less
desirable.

> 
> > >
> > > I am concerned about live locking in either scenario, but I haven't
> > > looked too deep into this pattern.
> > >
> > > I also don't love (as usual) the lack of ensured forward progress.
> >
> > Hmm. Maybe we should add a retry limit on
> > mmap_lock_speculate_try_begin() and once the limit is hit we just take
> > the mmap_read_lock and proceed with it? That would prevent a
> > hyperactive writer from blocking the reader's forward progress
> > indefinitely.
> 
> Came here to say the same. I'd add a small number of retries (3-5?)
> and then fallback to the read-locked approach. The main challenge is
> to keep all this logic nicely isolated from the main VMA
> search/printing logic.
> 
> For a similar pattern in uprobes, we don't even bother to rety, we
> just fallback to mmap_read_lock and proceed, under the assumption that
> this is going to be very rare and thus not important from the overall
> performance perspective.

In this problem space we are dealing with a herd of readers caused by
writers delaying an ever-growing line of readers, right?

Assuming there is a backup caused by a writer, then I don't know if the
retry is going to do anything more than heat the data centre.

The readers that take the read lock will get the data, while the others
who arrive during read locked time can try lockless, but will most
likely have a run time that extends beyond the readers holding the lock
and will probably be interrupted by the writer.

We can predict the new readers will also not make it through in time
because the earlier ones failed.  The new readers will then take the
lock and grow the line of readers.

Does that make sense?

Thanks,
Liam