linux-kernel - Re: [PATCH 0/5] x86: Impplement support for unaccepted memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <82b8836f-a467-e5ff-08f3-704a85b9faa0@intel.com>
Date:   Tue, 10 Aug 2021 08:51:01 -0700
From:   Dave Hansen <dave.hansen@...el.com>
To:     "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Cc:     "Kirill A. Shutemov" <kirill@...temov.name>,
        Borislav Petkov <bp@...en8.de>,
        Andy Lutomirski <luto@...nel.org>,
        Sean Christopherson <seanjc@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Joerg Roedel <jroedel@...e.de>,
        Andi Kleen <ak@...ux.intel.com>,
        Kuppuswamy Sathyanarayanan 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        David Rientjes <rientjes@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Tom Lendacky <thomas.lendacky@....com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        Varad Gautam <varad.gautam@...e.com>,
        Dario Faggioli <dfaggioli@...e.com>, x86@...nel.org,
        linux-mm@...ck.org, linux-coco@...ts.linux.dev,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/5] x86: Impplement support for unaccepted memory

On 8/10/21 8:15 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 07:08:58AM -0700, Dave Hansen wrote:
>> On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
>>> UEFI Specification version 2.9 introduces concept of memory acceptance:
>>> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
>>> requiring memory to be accepted before it can be used by the guest.
>>> Accepting happens via a protocol specific for the Virtrual Machine
>>> platform.
>>>
>>> Accepting memory is costly and it makes VMM allocate memory for the
>>> accepted guest physical address range. We don't want to accept all memory
>>> upfront.
>>
>> This could use a bit more explanation.  Any VM is likely to *eventually*
>> touch all its memory, so it's not like a VMM has a long-term advantage
>> by delaying this.
>>
>> So, it must have to do with resource use at boot.  Is this to help boot
>> times?
> 
> Yes, boot time is main motivation.
> 
> But I'm going also to look at long-term VM behaviour with the fixed memory
> footprint. I think if a workload allocate/free memory within the same
> amount we can keep memory beyond the size unaccepted. Few tweaks likely
> will be required such as disabling page shuffling on free to keep
> unaccepted memory at the tail of free list. More investigation needed.

OK, so this is predicated on the idea that a guest will not use all of
its assigned RAM and that the host will put that RAM to good use
elsewhere.  Right?

That's undercut by a a few of factors:
1. Many real-world cloud providers do not overcommit RAM.  If the guest
   does not use the RAM, it goes to waste.  (Yes, there are providers
   that overcommit, but we're talking generally about places where this
   proposal is useful).
2. Long-term, RAM fills up with page cache in many scenarios

So, this is really only beneficial for long-term host physical memory
use if:
1. The host is overcommitting
and
2. The guest never uses all of its RAM

Seeing as TDX doesn't support swap and can't coexist with persistent
memory, the only recourse for folks overcommitting TDX VMs when the run
out of RAM is to kill VMs.

I can't imagine that a lot of folks are going to do this.

In other words, I buy the boot speed argument.  But, I don't buy the
"this saves memory long term" argument at all.

>> I had expected this series, but I also expected it to be connected to
>> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
>> this problem is different and demands a totally orthogonal solution?
>>
>> For instance, what prevents us from declaring: "Memory is accepted at
>> the time that its 'struct page' is initialized" ?  Then, we use all the
>> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.
> 
> That was my first thought too and I tried it just to realize that it is
> not what we want. If we would accept page on page struct init it means we
> would make host allocate all memory assigned to the guest on boot even if
> guest actually use small portion of it.
> 
> Also deferred page init only allows to scale memory accept across multiple
> CPUs, but doesn't allow to get to userspace before we done with it. See
> wait_for_completion(&pgdat_init_all_done_comp).

That's good information.  It's a refinement of the "I want to boot
faster" requirement.  What you want is not just going _faster_, but
being able to run userspace before full acceptance has completed.

Would you be able to quantify how fast TDX page acceptance is?  Are we
talking about MB/s, GB/s, TB/s?  This series is rather bereft of numbers
for a feature which making a performance claim.

Let's say we have a 128GB VM.  How much does faster does this approach
reach userspace than if all memory was accepted up front?  How much
memory _could_ have been accepted at the point userspace starts running?