linux-kernel - Re: [PATCH v11 2/3] mm: add a field to store names for private anonymous memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpE-fR+M_funJ4Kd+gMK9q0QHyOUD7YK0ES6En4y7E1tjg@mail.gmail.com>
Date:   Wed, 27 Oct 2021 13:01:11 -0700
From:   Suren Baghdasaryan <surenb@...gle.com>
To:     Alexey Alexandrov <aalexand@...gle.com>
Cc:     akpm@...ux-foundation.org, ccross@...gle.com,
        sumit.semwal@...aro.org, mhocko@...e.com, dave.hansen@...el.com,
        keescook@...omium.org, willy@...radead.org,
        kirill.shutemov@...ux.intel.com, vbabka@...e.cz,
        hannes@...xchg.org, corbet@....net, viro@...iv.linux.org.uk,
        rdunlap@...radead.org, kaleshsingh@...gle.com, peterx@...hat.com,
        rppt@...nel.org, peterz@...radead.org, catalin.marinas@....com,
        vincenzo.frascino@....com, chinwen.chang@...iatek.com,
        axelrasmussen@...gle.com, aarcange@...hat.com, jannh@...gle.com,
        apopple@...dia.com, jhubbard@...dia.com, yuzhao@...gle.com,
        will@...nel.org, fenghua.yu@...el.com, thunder.leizhen@...wei.com,
        hughd@...gle.com, feng.tang@...el.com, jgg@...pe.ca, guro@...com,
        tglx@...utronix.de, krisman@...labora.com, chris.hyser@...cle.com,
        pcc@...gle.com, ebiederm@...ssion.com, axboe@...nel.dk,
        legion@...nel.org, eb@...ix.com, gorcunov@...il.com, pavel@....cz,
        songmuchun@...edance.com, viresh.kumar@...aro.org,
        thomascedeno@...gle.com, sashal@...nel.org, cxfcosmos@...il.com,
        linux@...musvillemoes.dk, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-mm@...ck.org, kernel-team@...roid.com
Subject: Re: [PATCH v11 2/3] mm: add a field to store names for private
 anonymous memory

On Wed, Oct 27, 2021 at 11:35 AM Alexey Alexandrov <aalexand@...gle.com> wrote:
>
> > On Oct 19, 2021, at 2:55 PM, Suren Baghdasaryan <surenb@...gle.com> wrote:
> >
> > From: Colin Cross <ccross@...gle.com>
> >
> > In many userspace applications, and especially in VM based applications
> > like Android uses heavily, there are multiple different allocators in use.
> > At a minimum there is libc malloc and the stack, and in many cases there
> > are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
> > multiple VM heaps (one for small objects, one for big objects, etc.).
> > Each of these layers usually has its own tools to inspect its usage;
> > malloc by compiling a debug version, the VM through heap inspection tools,
> > and for direct syscalls there is usually no way to track them.
> >
> > On Android we heavily use a set of tools that use an extended version of
> > the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> > in userspace and slice their usage by process, shared (COW) vs.  unique
> > mappings, backing, etc.  This can account for real physical memory usage
> > even in cases like fork without exec (which Android uses heavily to share
> > as many private COW pages as possible between processes), Kernel SamePage
> > Merging, and clean zero pages.  It produces a measurement of the pages
> > that only exist in that process (USS, for unique), and a measurement of
> > the physical memory usage of that process with the cost of shared pages
> > being evenly split between processes that share them (PSS).
> >
> > If all anonymous memory is indistinguishable then figuring out the real
> > physical memory usage (PSS) of each heap requires either a pagemap walking
> > tool that can understand the heap debugging of every layer, or for every
> > layer's heap debugging tools to implement the pagemap walking logic, in
> > which case it is hard to get a consistent view of memory across the whole
> > system.
> >
> > Tracking the information in userspace leads to all sorts of problems.
> > It either needs to be stored inside the process, which means every
> > process has to have an API to export its current heap information upon
> > request, or it has to be stored externally in a filesystem that
> > somebody needs to clean up on crashes.  It needs to be readable while
> > the process is still running, so it has to have some sort of
> > synchronization with every layer of userspace.  Efficiently tracking
> > the ranges requires reimplementing something like the kernel vma
> > trees, and linking to it from every layer of userspace.  It requires
> > more memory, more syscalls, more runtime cost, and more complexity to
> > separately track regions that the kernel is already tracking.
> >
> > This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> > userspace-provided name for anonymous vmas.  The names of named anonymous
> > vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> >
> > Userspace can set the name for a region of memory by calling
> > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> > Setting the name to NULL clears it. The name length limit is 80 bytes
> > including NUL-terminator and is checked to contain only printable ascii
> > characters (including space), except '[',']','\','$' and '`'. Ascii
> > strings are being used to have a descriptive identifiers for vmas, which
> > can be understood by the users reading /proc/pid/maps or /proc/pid/smaps.
> > Names can be standardized for a given system and they can include some
> > variable parts such as the name of the allocator or a library, tid of
> > the thread using it, etc.
> >
> > The name is stored in a pointer in the shared union in vm_area_struct
> > that points to a null terminated string. Anonymous vmas with the same
> > name (equivalent strings) and are otherwise mergeable will be merged.
> > The name pointers are not shared between vmas even if they contain the
> > same name. The name pointer is stored in a union with fields that are
> > only used on file-backed mappings, so it does not increase memory usage.
> >
> > CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
> > feature. It keeps the feature disabled by default to prevent any
> > additional memory overhead and to avoid confusing procfs parsers on
> > systems which are not ready to support named anonymous vmas.
> >
> > The patch is based on the original patch developed by Colin Cross, more
> > specifically on its latest version [1] posted upstream by Sumit Semwal.
> > It used a userspace pointer to store vma names. In that design, name
> > pointers could be shared between vmas. However during the last upstreaming
> > attempt, Kees Cook raised concerns [2] about this approach and suggested
> > to copy the name into kernel memory space, perform validity checks [3]
> > and store as a string referenced from vm_area_struct.
> > One big concern is about fork() performance which would need to strdup
> > anonymous vma names. Dave Hansen suggested experimenting with worst-case
> > scenario of forking a process with 64k vmas having longest possible names
> > [4]. I ran this experiment on an ARM64 Android device and recorded a
> > worst-case regression of almost 40% when forking such a process. This
> > regression is addressed in the followup patch which replaces the pointer
> > to a name with a refcounted structure that allows sharing the name pointer
> > between vmas of the same name. Instead of duplicating the string during
> > fork() or when splitting a vma it increments the refcount.
> >
> > [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
> > [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
> > [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
> > [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
> >
> > Changes for prctl(2) manual page (in the options section):
> >
> > PR_SET_VMA
> >       Sets an attribute specified in arg2 for virtual memory areas
> >       starting from the address specified in arg3 and spanning the
> >       size specified  in arg4. arg5 specifies the value of the attribute
> >       to be set. Note that assigning an attribute to a virtual memory
> >       area might prevent it from being merged with adjacent virtual
> >       memory areas due to the difference in that attribute's value.
> >
> >       Currently, arg2 must be one of:
> >
> >       PR_SET_VMA_ANON_NAME
> >               Set a name for anonymous virtual memory areas. arg5 should
> >               be a pointer to a null-terminated string containing the
> >               name. The name length including null byte cannot exceed
> >               80 bytes. If arg5 is NULL, the name of the appropriate
> >               anonymous virtual memory areas will be reset. The name
> >               can contain only printable ascii characters (including
> >                space), except '[',']','\','$' and '`'.
> >
> >                This feature is available only if the kernel is built with
> >                the CONFIG_ANON_VMA_NAME option enabled.
>
> For what it’s worth, it’s definitely interesting to see this going upstream.
> In particular, we would use it for high-level grouping of the data in
> production profiling when proper symbolization is not available:
>
> * JVM could associate a name with the memory regions it uses for the JIT
>   code so that Linux perf data are associated with a high level name like
>   "Java JIT" even if the proper Java JIT profiling is not enabled.
> * Similar for other JIT engines like v8 - they could annotate the memory
>   regions they manage and use as well.
> * Traditional memory allocators like tcmalloc can use this as well so
>   that the associated name is used in data access profiling via Linux perf.

Hi Alexey,
Thanks for providing your feedback! Nice to hear that this can be
useful outside of Android.