lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJuCfpFAmTfmTpd_8gJ1KYAT2ujiJKCjrJjEfp2pjrfqZzr+gg@mail.gmail.com>
Date: Sun, 21 Jan 2024 23:23:47 -0800
From: Suren Baghdasaryan <surenb@...gle.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: akpm@...ux-foundation.org, viro@...iv.linux.org.uk, brauner@...nel.org, 
	jack@...e.cz, dchinner@...hat.com, casey@...aufler-ca.com, 
	ben.wolsieffer@...ring.com, paulmck@...nel.org, david@...hat.com, 
	avagin@...gle.com, usama.anjum@...labora.com, peterx@...hat.com, 
	hughd@...gle.com, ryan.roberts@....com, wangkefeng.wang@...wei.com, 
	Liam.Howlett@...cle.com, yuzhao@...gle.com, axelrasmussen@...gle.com, 
	lstoakes@...il.com, talumbau@...gle.com, willy@...radead.org, 
	mgorman@...hsingularity.net, jhubbard@...dia.com, vishal.moola@...il.com, 
	mathieu.desnoyers@...icios.com, dhowells@...hat.com, jgg@...pe.ca, 
	sidhartha.kumar@...cle.com, andriy.shevchenko@...ux.intel.com, 
	yangxingui@...wei.com, keescook@...omium.org, linux-kernel@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, kernel-team@...roid.com
Subject: Re: [RFC 0/3] reading proc/pid/maps under RCU

On Thu, Jan 18, 2024 at 9:58 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
>
> On Tue, Jan 16, 2024 at 9:57 AM Suren Baghdasaryan <surenb@...glecom> wrote:
> >
> > On Tue, Jan 16, 2024 at 6:46 AM Vlastimil Babka <vbabka@...e.cz> wrote:
> > >
> > > On 1/16/24 15:42, Vlastimil Babka wrote:
> > > > On 1/15/24 19:38, Suren Baghdasaryan wrote:
> > > >
> > > > Hi,
> > > >
> > > >> The issue this patchset is trying to address is mmap_lock contention when
> > > >> a low priority task (monitoring, data collecting, etc.) blocks a higher
> > > >> priority task from making updated to the address space. The contention is
> > > >> due to the mmap_lock being held for read when reading proc/pid/maps.
> > > >> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> > > >> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> > > >> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> > > >> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> > > >> RCU-safe;
> > > >> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> > > >> page boundaries only, does not get worse.
> > > >
> > > > Hm I thought we were to only choose this more complicated in case additional
> > > > tearing becomes a problem, and at first assume that if software can deal
> > > > with page boundary tearing, it can deal with sub-page tearing too?
> >
> > Hi Vlastimil,
> > Thanks for the feedback!
> > Yes, originally I thought we wouldn't be able to avoid additional
> > tearing without a big change but then realized it's not that hard, so
> > I tried to keep the change in behavior transparent to the userspace.
>
> In the absence of other feedback I'm going to implement and post the
> originally envisioned approach: remove validation step and avoid any
> possibility of blocking but allowing for sub-page tearing. Will use
> Matthew's rwsem_wait() to deal with possible inconsistent maple_tree
> state.

I posted v1 at
https://lore.kernel.org/all/20240122071324.2099712-1-surenb@google.com/
In the RFC I used mm_struct.mm_lock_seq to detect if mm is being
changed but then I realized that won't work. mm_struct.mm_lock_seq is
incremented after mm is changed and right before mmap_lock is
write-unlocked. Instead I need a counter that changes once we
write-lock mmap_lock and before any mm changes. So the new patchset
introduces a separate counter to detect possible mm changes. In
addition, I could not use rwsem_wait() and instead had to take
mmap_lock for read to wait for the writer to finish and then record
the new counter while holding mmap_lock for read. That prevents
concurrent mm changes while we are recording the new counter value.

> Thanks,
> Suren.
>
> >
> > > >
> > > >> The patchset deals with these issues but there is a downside which I would
> > > >> like to get input on:
> > > >> This change introduces unfairness towards the reader of proc/pid/maps,
> > > >> which can be blocked by an overly active/malicious address space modifyer.
> > > >
> > > > So this is a consequence of the validate() operation, right? We could avoid
> > > > this if we allowed sub-page tearing.
> >
> > Yes, if we don't care about sub-page tearing then we could get rid of
> > validate step and this issue with updaters blocking the reader would
> > go away. If we choose that direction there will be one more issue to
> > fix, namely the maple_tree temporary inconsistent state when a VMA is
> > replaced with another one and we might observe NULL there. We might be
> > able to use Matthew's rwsem_wait() to deal with that issue.
> >
> > > >
> > > >> A couple of ways I though we can address this issue are:
> > > >> 1. After several lock-less retries (or some time limit) to fall back to
> > > >> taking mmap_lock.
> > > >> 2. Employ lock-less reading only if the reader has low priority,
> > > >> indicating that blocking it is not critical.
> > > >> 3. Introducing a separate procfs file which publishes the same data in
> > > >> lock-less manner.
> > >
> > > Oh and if this option 3 becomes necessary, then such new file shouldn't
> > > validate() either, and whoever wants to avoid the reader contention and
> > > converts their monitoring to the new file will have to account for this
> > > possible extra tearing from the start. So I would suggest trying to change
> > > the existing file with no validate() first, and if existing userspace gets
> > > broken, employ option 3. This would mean no validate() in either case?
> >
> > Yes but I was trying to avoid introducing additional file which
> > publishes the same content in a slightly different way. We will have
> > to explain when userspace should use one vs the other and that would
> > require going into low level implementation details, I think. Don't
> > know if that's acceptable/preferable.
> > Thanks,
> > Suren.
> >
> > >
> > > >> I imagine a combination of these approaches can also be employed.
> > > >> I would like to get feedback on this from the Linux community.
> > > >>
> > > >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> > > >> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> > > >> in [1].
> > > >>
> > > >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
> > > >>
> > > >> Suren Baghdasaryan (3):
> > > >>   mm: make vm_area_struct anon_name field RCU-safe
> > > >>   seq_file: add validate() operation to seq_operations
> > > >>   mm/maps: read proc/pid/maps under RCU
> > > >>
> > > >>  fs/proc/internal.h        |   3 +
> > > >>  fs/proc/task_mmu.c        | 130 ++++++++++++++++++++++++++++++++++----
> > > >>  fs/seq_file.c             |  24 ++++++-
> > > >>  include/linux/mm_inline.h |  10 ++-
> > > >>  include/linux/mm_types.h  |   3 +-
> > > >>  include/linux/seq_file.h  |   1 +
> > > >>  mm/madvise.c              |  30 +++++++--
> > > >>  7 files changed, 181 insertions(+), 20 deletions(-)
> > > >>
> > > >
> > >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ