lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpHfnG8b4_RkkGhu+HveF-K_7o9UVGdToVuUCf-qD05Q4Q@mail.gmail.com>
Date:   Thu, 28 Oct 2021 15:08:36 -0700
From:   Suren Baghdasaryan <surenb@...gle.com>
To:     akpm@...ux-foundation.org
Cc:     Alexey Alexandrov <aalexand@...gle.com>, ccross@...gle.com,
        sumit.semwal@...aro.org, mhocko@...e.com, dave.hansen@...el.com,
        keescook@...omium.org, willy@...radead.org,
        kirill.shutemov@...ux.intel.com, vbabka@...e.cz,
        hannes@...xchg.org, corbet@....net, viro@...iv.linux.org.uk,
        rdunlap@...radead.org, kaleshsingh@...gle.com, peterx@...hat.com,
        rppt@...nel.org, peterz@...radead.org, catalin.marinas@....com,
        vincenzo.frascino@....com, chinwen.chang@...iatek.com,
        axelrasmussen@...gle.com, aarcange@...hat.com, jannh@...gle.com,
        apopple@...dia.com, jhubbard@...dia.com, yuzhao@...gle.com,
        will@...nel.org, fenghua.yu@...el.com, thunder.leizhen@...wei.com,
        hughd@...gle.com, feng.tang@...el.com, jgg@...pe.ca, guro@...com,
        tglx@...utronix.de, krisman@...labora.com, chris.hyser@...cle.com,
        pcc@...gle.com, ebiederm@...ssion.com, axboe@...nel.dk,
        legion@...nel.org, eb@...ix.com, gorcunov@...il.com, pavel@....cz,
        songmuchun@...edance.com, viresh.kumar@...aro.org,
        thomascedeno@...gle.com, sashal@...nel.org, cxfcosmos@...il.com,
        linux@...musvillemoes.dk, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-mm@...ck.org, kernel-team@...roid.com
Subject: Re: [PATCH v11 2/3] mm: add a field to store names for private
 anonymous memory

On Wed, Oct 27, 2021 at 1:01 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
>
> On Wed, Oct 27, 2021 at 11:35 AM Alexey Alexandrov <aalexand@...gle.com> wrote:
> >
> > > On Oct 19, 2021, at 2:55 PM, Suren Baghdasaryan <surenb@...gle.com> wrote:
> > >
> > > From: Colin Cross <ccross@...gle.com>
> > >
> > > In many userspace applications, and especially in VM based applications
> > > like Android uses heavily, there are multiple different allocators in use.
> > > At a minimum there is libc malloc and the stack, and in many cases there
> > > are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
> > > multiple VM heaps (one for small objects, one for big objects, etc.).
> > > Each of these layers usually has its own tools to inspect its usage;
> > > malloc by compiling a debug version, the VM through heap inspection tools,
> > > and for direct syscalls there is usually no way to track them.
> > >
> > > On Android we heavily use a set of tools that use an extended version of
> > > the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> > > in userspace and slice their usage by process, shared (COW) vs.  unique
> > > mappings, backing, etc.  This can account for real physical memory usage
> > > even in cases like fork without exec (which Android uses heavily to share
> > > as many private COW pages as possible between processes), Kernel SamePage
> > > Merging, and clean zero pages.  It produces a measurement of the pages
> > > that only exist in that process (USS, for unique), and a measurement of
> > > the physical memory usage of that process with the cost of shared pages
> > > being evenly split between processes that share them (PSS).
> > >
> > > If all anonymous memory is indistinguishable then figuring out the real
> > > physical memory usage (PSS) of each heap requires either a pagemap walking
> > > tool that can understand the heap debugging of every layer, or for every
> > > layer's heap debugging tools to implement the pagemap walking logic, in
> > > which case it is hard to get a consistent view of memory across the whole
> > > system.
> > >
> > > Tracking the information in userspace leads to all sorts of problems.
> > > It either needs to be stored inside the process, which means every
> > > process has to have an API to export its current heap information upon
> > > request, or it has to be stored externally in a filesystem that
> > > somebody needs to clean up on crashes.  It needs to be readable while
> > > the process is still running, so it has to have some sort of
> > > synchronization with every layer of userspace.  Efficiently tracking
> > > the ranges requires reimplementing something like the kernel vma
> > > trees, and linking to it from every layer of userspace.  It requires
> > > more memory, more syscalls, more runtime cost, and more complexity to
> > > separately track regions that the kernel is already tracking.
> > >
> > > This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> > > userspace-provided name for anonymous vmas.  The names of named anonymous
> > > vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> > >
> > > Userspace can set the name for a region of memory by calling
> > > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> > > Setting the name to NULL clears it. The name length limit is 80 bytes
> > > including NUL-terminator and is checked to contain only printable ascii
> > > characters (including space), except '[',']','\','$' and '`'. Ascii
> > > strings are being used to have a descriptive identifiers for vmas, which
> > > can be understood by the users reading /proc/pid/maps or /proc/pid/smaps.
> > > Names can be standardized for a given system and they can include some
> > > variable parts such as the name of the allocator or a library, tid of
> > > the thread using it, etc.
> > >
> > > The name is stored in a pointer in the shared union in vm_area_struct
> > > that points to a null terminated string. Anonymous vmas with the same
> > > name (equivalent strings) and are otherwise mergeable will be merged.
> > > The name pointers are not shared between vmas even if they contain the
> > > same name. The name pointer is stored in a union with fields that are
> > > only used on file-backed mappings, so it does not increase memory usage.
> > >
> > > CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
> > > feature. It keeps the feature disabled by default to prevent any
> > > additional memory overhead and to avoid confusing procfs parsers on
> > > systems which are not ready to support named anonymous vmas.
> > >
> > > The patch is based on the original patch developed by Colin Cross, more
> > > specifically on its latest version [1] posted upstream by Sumit Semwal.
> > > It used a userspace pointer to store vma names. In that design, name
> > > pointers could be shared between vmas. However during the last upstreaming
> > > attempt, Kees Cook raised concerns [2] about this approach and suggested
> > > to copy the name into kernel memory space, perform validity checks [3]
> > > and store as a string referenced from vm_area_struct.
> > > One big concern is about fork() performance which would need to strdup
> > > anonymous vma names. Dave Hansen suggested experimenting with worst-case
> > > scenario of forking a process with 64k vmas having longest possible names
> > > [4]. I ran this experiment on an ARM64 Android device and recorded a
> > > worst-case regression of almost 40% when forking such a process. This
> > > regression is addressed in the followup patch which replaces the pointer
> > > to a name with a refcounted structure that allows sharing the name pointer
> > > between vmas of the same name. Instead of duplicating the string during
> > > fork() or when splitting a vma it increments the refcount.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
> > > [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
> > > [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
> > > [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
> > >
> > > Changes for prctl(2) manual page (in the options section):
> > >
> > > PR_SET_VMA
> > >       Sets an attribute specified in arg2 for virtual memory areas
> > >       starting from the address specified in arg3 and spanning the
> > >       size specified  in arg4. arg5 specifies the value of the attribute
> > >       to be set. Note that assigning an attribute to a virtual memory
> > >       area might prevent it from being merged with adjacent virtual
> > >       memory areas due to the difference in that attribute's value.
> > >
> > >       Currently, arg2 must be one of:
> > >
> > >       PR_SET_VMA_ANON_NAME
> > >               Set a name for anonymous virtual memory areas. arg5 should
> > >               be a pointer to a null-terminated string containing the
> > >               name. The name length including null byte cannot exceed
> > >               80 bytes. If arg5 is NULL, the name of the appropriate
> > >               anonymous virtual memory areas will be reset. The name
> > >               can contain only printable ascii characters (including
> > >                space), except '[',']','\','$' and '`'.
> > >
> > >                This feature is available only if the kernel is built with
> > >                the CONFIG_ANON_VMA_NAME option enabled.
> >
> > For what it’s worth, it’s definitely interesting to see this going upstream.
> > In particular, we would use it for high-level grouping of the data in
> > production profiling when proper symbolization is not available:
> >
> > * JVM could associate a name with the memory regions it uses for the JIT
> >   code so that Linux perf data are associated with a high level name like
> >   "Java JIT" even if the proper Java JIT profiling is not enabled.
> > * Similar for other JIT engines like v8 - they could annotate the memory
> >   regions they manage and use as well.
> > * Traditional memory allocators like tcmalloc can use this as well so
> >   that the associated name is used in data access profiling via Linux perf.
>
> Hi Alexey,
> Thanks for providing your feedback! Nice to hear that this can be
> useful outside of Android.

Folks, it has been almost two weeks since I posted this v11 patchset.
Is there anything else I can do to advance it towards merging?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ