linux-kernel - Re: [RFC PATCH 0/5] madvise MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e4bf2e19-adc2-ad5e-f516-e8014500456d@oracle.com>
Date:   Mon, 3 Aug 2020 16:03:59 -0400
From:   Steven Sistare <steven.sistare@...cle.com>
To:     James Bottomley <James.Bottomley@...senPartnership.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Matthew Wilcox <willy@...radead.org>,
        Anthony Yznaga <anthony.yznaga@...cle.com>,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-arch@...r.kernel.org, mhocko@...nel.org,
        tglx@...utronix.de, mingo@...hat.com, bp@...en8.de, x86@...nel.org,
        hpa@...or.com, viro@...iv.linux.org.uk, akpm@...ux-foundation.org,
        arnd@...db.de, keescook@...omium.org, gerg@...ux-m68k.org,
        ktkhai@...tuozzo.com, christian.brauner@...ntu.com,
        peterz@...radead.org, esyr@...hat.com, jgg@...pe.ca,
        christian@...lner.me, areber@...hat.com, cyphar@...har.com
Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC

On 8/3/2020 11:42 AM, James Bottomley wrote:
> On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> [...]
>> What is wrong with live migration between one qemu process and
>> another qemu process on the same machine not work for this use case?
>>
>> Just reusing live migration would seem to be the simplest path of
>> all, as the code is already implemented.  Further if something goes
>> wrong with the live migration you can fallback to the existing
>> process.  With exec there is no fallback if the new version does not
>> properly support the handoff protocol of the old version.
> 
> Actually, could I ask this another way: the other patch set you sent to
> the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> kexec using zero copy for extremely fast save/restore.  The original
> idea was to use this as part of a CRIU based snapshot, kexec to new
> system, restore.  However, why can't you do a local snapshot, restart
> qemu, restore using the PKRAM capsule to achieve exactly the same as
> MADV_DOEXEC does but using a system that's easy to reason about?  It
> may be slightly slower, but I think we're still talking milliseconds.

Hi James, good to hear from you.  PKRAM or SysV shm could be used for
a restart in that manner, but it would only support sriov guests if the
guest exports an agent that supports suspend-to-ram, and if all guest
drivers support the suspend-to-ram method.  I have done this using a linux
guest and qemu guest agent, and IIRC the guest pause time is 500 - 1000 msec.
With MADV_DOEXEC, pause time is 100 - 200 msec.  The pause time is a handful
of seconds if the guest uses an nvme drive because CC.SHN takes so long
to persist metadata to stable storage.

We could instead pass vfio descriptors from the old process to a 3rd party escrow 
process and pass  them back to the new qemu process, but the shm that vfio has 
already registered must be remapped at the same VA as the previous process, and 
there is no interface to guarantee that.  MAP_FIXED blows away existing mappings 
and breaks the app. MAP_FIXED_NOREPLACE respects existing mappings but cannot map 
the shm and breaks the app.  Adding a feature that reserves VAs would fix that, we 
have experimnted with one.  Fixing the vfio kernel implementation to not use the 
original VA base would also work, but I don't know how doable/difficult that would be.

Both solutions would require a qemu instance to be stopped and relaunched using shm
as guest ram, and its guest rebooted, so they do not let us update legacy 
already-running instances that use anon memory.  That problem solves itself if we 
get these rfe's into linux and qemu, and eventually users shut down the legacy
instances, but that takes years and we need to do it sooner.

- Steve