linux-kernel - Re: [PATCH v3 00/21] TDX host kernel support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4gEwjnNE9cWb_KLZ6C7-UxKdUMZKFPF+LAJ4L1SjByisw@mail.gmail.com>
Date:   Thu, 28 Apr 2022 20:04:19 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Kai Huang <kai.huang@...el.com>
Cc:     Dave Hansen <dave.hansen@...el.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        KVM list <kvm@...r.kernel.org>,
        Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        "Brown, Len" <len.brown@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        Rafael J Wysocki <rafael.j.wysocki@...el.com>,
        Reinette Chatre <reinette.chatre@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Andi Kleen <ak@...ux.intel.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Kuppuswamy Sathyanarayanan 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [PATCH v3 00/21] TDX host kernel support

On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <kai.huang@...el.com> wrote:
>
> On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:37, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > >
> > > > I thought we could document this in the documentation saying that this code can
> > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > can change the code and the documentation  when we add the support of those
> > > > features in the future, and update the documentation.
> > > >
> > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > machine support those features.
> > > >
> > > > I'll think about design solutions if above doesn't look good for you.
> > >
> > > No, it doesn't look good to me.
> > >
> > > You can't just say:
> > >
> > >     /*
> > >      * This code will eat puppies if used on systems with hotplug.
> > >      */
> > >
> > > and merrily await the puppy bloodbath.
> > >
> > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > safe, controlled way.
> > >
> > > > > You can't just ignore the problems because they're not present on one
> > > > > version of the hardware.
> > >
> > > Please, please read this again ^^
> >
> > OK.  I'll think about solutions and come back later.
> > >
>
> Hi Dave,
>
> I think we have two approaches to handle memory hotplug interaction with the TDX
> module initialization.
>
> The first approach is simple.  We just block memory from being added as system
> RAM managed by page allocator when the platform supports TDX [1]. It seems we
> can add some arch-specific-check to __add_memory_resource() and reject the new
> memory resource if platform supports TDX.  __add_memory_resource() is called by
> both __add_memory() and add_memory_driver_managed() so it prevents from adding
> NVDIMM as system RAM and normal ACPI memory hotplug [2].

What if the memory being added *is* TDX capable? What if someone
wanted to manage a memory range as soft-reserved and move it back and
forth from the core-mm to device access. That should be perfectly
acceptable as long as the memory is TDX capable.

> The second approach is relatively more complicated.  Instead of directly
> rejecting the new memory resource in __add_memory_resource(), we check whether
> the memory resource can be added based on CMR and the TDX module initialization
> status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
> CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
> during kernel boots.  And in __add_memory_resource() we do below check:
>
>         tdx_init_disable();     /*similar to cpu_hotplug_disable() */
>         if (tdx_module_initialized())
>                 // reject memory hotplug
>         else if (new_memory_resource NOT in CMRs)
>                 // reject memory hotplug
>         else
>                 allow memory hotplug
>         tdx_init_enable();      /*similar to cpu_hotplug_enable() */
>
> tdx_init_disable() temporarily disables TDX module initialization by trying to
> grab the mutex.  If the TDX module initialization is already on going, then it
> waits until it completes.
>
> This should work better for future platforms, but would requires non-trivial
> more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
> shutdown the TDX module in kexec().
>
> But for this series I think the second approach is overkill and we can choose to
> use the first simple approach?

This still sounds like it is trying to solve symptoms and not the root
problem. Why must the core-mm never have non-TDX memory when VMs are
fine to operate with either core-mm pages or memory from other sources
like hugetlbfs and device-dax?