linux-kernel - Re: [PATCH v10 2/3] mm: add a field to store names for private anonymous memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20211001160830.700c36b32b736478000b3420@linux-foundation.org>
Date:   Fri, 1 Oct 2021 16:08:30 -0700
From:   Andrew Morton <akpm@...ux-foundation.org>
To:     Suren Baghdasaryan <surenb@...gle.com>
Cc:     ccross@...gle.com, sumit.semwal@...aro.org, mhocko@...e.com,
        dave.hansen@...el.com, keescook@...omium.org, willy@...radead.org,
        kirill.shutemov@...ux.intel.com, vbabka@...e.cz,
        hannes@...xchg.org, corbet@....net, viro@...iv.linux.org.uk,
        rdunlap@...radead.org, kaleshsingh@...gle.com, peterx@...hat.com,
        rppt@...nel.org, peterz@...radead.org, catalin.marinas@....com,
        vincenzo.frascino@....com, chinwen.chang@...iatek.com,
        axelrasmussen@...gle.com, aarcange@...hat.com, jannh@...gle.com,
        apopple@...dia.com, jhubbard@...dia.com, yuzhao@...gle.com,
        will@...nel.org, fenghua.yu@...el.com, thunder.leizhen@...wei.com,
        hughd@...gle.com, feng.tang@...el.com, jgg@...pe.ca, guro@...com,
        tglx@...utronix.de, krisman@...labora.com, chris.hyser@...cle.com,
        pcc@...gle.com, ebiederm@...ssion.com, axboe@...nel.dk,
        legion@...nel.org, eb@...ix.com, gorcunov@...il.com, pavel@....cz,
        songmuchun@...edance.com, viresh.kumar@...aro.org,
        thomascedeno@...gle.com, sashal@...nel.org, cxfcosmos@...il.com,
        linux@...musvillemoes.dk, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-mm@...ck.org, kernel-team@...roid.com
Subject: Re: [PATCH v10 2/3] mm: add a field to store names for private
 anonymous memory

On Fri,  1 Oct 2021 13:56:56 -0700 Suren Baghdasaryan <surenb@...gle.com> wrote:

> From: Colin Cross <ccross@...gle.com>
> 
> In many userspace applications, and especially in VM based applications
> like Android uses heavily, there are multiple different allocators in use.
>  At a minimum there is libc malloc and the stack, and in many cases there
> are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
> multiple VM heaps (one for small objects, one for big objects, etc.).
> Each of these layers usually has its own tools to inspect its usage;
> malloc by compiling a debug version, the VM through heap inspection tools,
> and for direct syscalls there is usually no way to track them.
> 
> On Android we heavily use a set of tools that use an extended version of
> the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> in userspace and slice their usage by process, shared (COW) vs.  unique
> mappings, backing, etc.  This can account for real physical memory usage
> even in cases like fork without exec (which Android uses heavily to share
> as many private COW pages as possible between processes), Kernel SamePage
> Merging, and clean zero pages.  It produces a measurement of the pages
> that only exist in that process (USS, for unique), and a measurement of
> the physical memory usage of that process with the cost of shared pages
> being evenly split between processes that share them (PSS).
> 
> If all anonymous memory is indistinguishable then figuring out the real
> physical memory usage (PSS) of each heap requires either a pagemap walking
> tool that can understand the heap debugging of every layer, or for every
> layer's heap debugging tools to implement the pagemap walking logic, in
> which case it is hard to get a consistent view of memory across the whole
> system.
> 
> Tracking the information in userspace leads to all sorts of problems.
> It either needs to be stored inside the process, which means every
> process has to have an API to export its current heap information upon
> request, or it has to be stored externally in a filesystem that
> somebody needs to clean up on crashes.  It needs to be readable while
> the process is still running, so it has to have some sort of
> synchronization with every layer of userspace.  Efficiently tracking
> the ranges requires reimplementing something like the kernel vma
> trees, and linking to it from every layer of userspace.  It requires
> more memory, more syscalls, more runtime cost, and more complexity to
> separately track regions that the kernel is already tracking.
> 
> This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> userspace-provided name for anonymous vmas.  The names of named anonymous
> vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> 
> Userspace can set the name for a region of memory by calling
> prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);

So this can cause a vma to be split, if [start,len] doesn't exactly
describe an existing vma?  If so, is this at all useful?  If not then
`len' isn't needed - just pass in some address within an existing vma?

> Setting the name to NULL clears it. The name length limit is 80 bytes
> including NUL-terminator and is checked to contain only printable ascii
> characters (including space), except '[',']','\','$' and '`'.
> 
> The name is stored in a pointer in the shared union in vm_area_struct
> that points to a null terminated string. Anonymous vmas with the same
> name (equivalent strings) and are otherwise mergeable will be merged.

So this can prevent vma merging if used incorrectly (or maliciously -
can't think how)?  What are the potential impacts of this?

> The name pointers are not shared between vmas even if they contain the
> same name. The name pointer is stored in a union with fields that are
> only used on file-backed mappings, so it does not increase memory usage.
> 
> The patch is based on the original patch developed by Colin Cross, more
> specifically on its latest version [1] posted upstream by Sumit Semwal.
> It used a userspace pointer to store vma names. In that design, name
> pointers could be shared between vmas. However during the last upstreaming
> attempt, Kees Cook raised concerns [2] about this approach and suggested
> to copy the name into kernel memory space, perform validity checks [3]
> and store as a string referenced from vm_area_struct.
> One big concern is about fork() performance which would need to strdup
> anonymous vma names. Dave Hansen suggested experimenting with worst-case
> scenario of forking a process with 64k vmas having longest possible names
> [4]. I ran this experiment on an ARM64 Android device and recorded a
> worst-case regression of almost 40% when forking such a process. This
> regression is addressed in the followup patch which replaces the pointer
> to a name with a refcounted structure that allows sharing the name pointer
> between vmas of the same name. Instead of duplicating the string during
> fork() or when splitting a vma it increments the refcount.

Generally, the patch adds a bunch of code which a lot of users won't
want.  Did we bust a gut to reduce this impact?  Was a standalone
config setting considered?

And what would be the impact of making this feature optional?  Is a
proliferation of formats in /proc/pid/maps going to make userspace
parsers harder to develop and test?

I do think that saying "The names of named anonymous vmas are shown in
/proc/pid/maps and /proc/pid/smaps as [anon:<name>]." is a bit thin. 
Please provide sample output so we can consider these things better.

What are the risks that existing parsers will be broken by such changes?