linux-kernel - Re: [RFC PATCH 0/5] madvise MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <63a7404c-e4f6-a82e-257b-217585b0277f@oracle.com>
Date:   Thu, 30 Jul 2020 13:35:51 -0400
From:   Steven Sistare <steven.sistare@...cle.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     Anthony Yznaga <anthony.yznaga@...cle.com>,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-arch@...r.kernel.org, mhocko@...nel.org,
        tglx@...utronix.de, mingo@...hat.com, bp@...en8.de, x86@...nel.org,
        hpa@...or.com, viro@...iv.linux.org.uk, akpm@...ux-foundation.org,
        arnd@...db.de, ebiederm@...ssion.com, keescook@...omium.org,
        gerg@...ux-m68k.org, ktkhai@...tuozzo.com,
        christian.brauner@...ntu.com, peterz@...radead.org,
        esyr@...hat.com, jgg@...pe.ca, christian@...lner.me,
        areber@...hat.com, cyphar@...har.com
Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC

On 7/30/2020 1:12 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
>> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
>>> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>>>> This patchset adds support for preserving an anonymous memory range across
>>>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>>>> sharing memory in this manner, as opposed to re-attaching to a named shared
>>>> memory segment, is to ensure it is mapped at the same virtual address in
>>>> the new process as it was in the old one.  An intended use for this is to
>>>> preserve guest memory for guests using vfio while qemu exec's an updated
>>>> version of itself.  By ensuring the memory is preserved at a fixed address,
>>>> vfio mappings and their associated kernel data structures can remain valid.
>>>> In addition, for the qemu use case, qemu instances that back guest RAM with
>>>> anonymous memory can be updated.
>>>
>>> I just realised that something else I'm working on might be a suitable
>>> alternative to this.  Apologies for not realising it sooner.
>>>
>>> http://www.wil.cx/~willy/linux/sileby.html
>>>
>>> To use this, you'd mshare() the anonymous memory range, essentially
>>> detaching the VMA from the current process's mm_struct and reparenting
>>> it to this new mm_struct, which has an fd referencing it.
>>>
>>> Then you call exec(), and the exec'ed task gets to call mmap() on that
>>> new fd to attach the memory range to its own address space.
>>>
>>> Presto!
>>
>> To be suitable for the qemu use case, we need a guarantee that the same VA range
>> is available in the new process, with nothing else mapped there.  From your spec,
>> it sounds like the new process could do a series of unrelated mmap's which could
>> overlap the desired va range before the silby mmap(fd) is performed??
> 
> That could happen.  eg libc might get its text segment mapped there
> randomly.  I believe Khalid was working on a solution for reserving
> memory ranges.

mshare + VA reservation is another possible solution.

Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

>> Also, we need to support updating legacy processes that already created anon segments.
>> We inject code that calls MADV_DOEXEC for such segments.
> 
> Yes, I was assuming you'd inject code that called mshare().

OK, mshare works on existing memory and builds a new vma.

> Actually, since you're injecting code, why do you need the kernel to
> be involved?  You can mmap the new executable and any libraries it depends
> upon, set up a new stack and jump to the main() entry point, all without
> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> be in userspace and you can probably steal / reuse code from ld.so (I'm
> not familiar with the details of how setting up an executable is done).

Duplicating all the work that the kernel and loader do to exec a process would
be error prone, require ongoing maintenance, and be redundant.  Better to define 
a small kernel extension and leave exec to the kernel.

- Steve