linux-kernel - Re: [RFC PATCH v1 04/38] tsm: Support DMA Allocation from private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <791e259b-3a57-487d-81ca-9d83f83ad685@arm.com>
Date: Fri, 1 Aug 2025 10:30:35 +0100
From: Suzuki K Poulose <suzuki.poulose@....com>
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@...nel.org>, linux-coco@...ts.linux.dev,
 kvmarm@...ts.linux.dev, linux-pci@...r.kernel.org,
 linux-kernel@...r.kernel.org, aik@....com, lukas@...ner.de,
 Samuel Ortiz <sameo@...osinc.com>, Xu Yilun <yilun.xu@...ux.intel.com>,
 Steven Price <steven.price@....com>,
 Catalin Marinas <catalin.marinas@....com>, Marc Zyngier <maz@...nel.org>,
 Will Deacon <will@...nel.org>, Oliver Upton <oliver.upton@...ux.dev>
Subject: Re: [RFC PATCH v1 04/38] tsm: Support DMA Allocation from private
 memory

On 31/07/2025 17:44, Jason Gunthorpe wrote:
> On Thu, Jul 31, 2025 at 02:48:23PM +0100, Suzuki K Poulose wrote:
>> On 31/07/2025 13:17, Jason Gunthorpe wrote:
>>> On Wed, Jul 30, 2025 at 11:09:35AM +0100, Suzuki K Poulose wrote:
>>>>>> It is unclear whether devices would need to perform DMA to shared
>>>>>> (unencrypted) memory while operating in this mode, as TLPs with T=1
>>>>>> are generally expected to target private memory.
>>>>>
>>>>> PCI SIG supports it, kernel should support it.
>>>>
>>>> ACK. On Arm CCA, the device can access shared IPA, with T=1 transaction
>>>> as long as the mapping is active in the Stage2 managed by RMM.
>>>
>>> Right, I expect that the T=0 SMMU S2 translation is a perfect subset of
>>> the T=1 S2 rmm translation. At most pages that are not available to
>>> T=0 should be removed when making the subset.
>>
>> Yes, this is what the VMM is supposed to do today, see [0] & [1].
> 
> Okay great!
> 
>>> I'm not sure what the plan is here on ARM though, do you expect to
>>> pre-load the entire T=0 SMMU S2 with the shared IPA aliases and rely
>>> on the GPT for protection or will the hypervisor dynamically change
>>> the T=0 SMMU S2 after each shared/private change? Same question for
>>
>> Yes, share/private transitions do go all the way back to VMM and it
>> is supposed to make the necessary changes to the SMMU S2 (as in [1]).
> 
> Okay, it works, but also why?
> 
>  From a hypervisor perspective when using VFIO I'd like the guestmemfd
> to fix all the physical memory immediately, so the entire physical map
> is fixed and known. Backed by 1G huge pages most likely.
> 
> Is there a reason not to just dump that into the T=0 SMMU using 1G
> huge pages and never touch it again? The GPT provides protection?

That is possible, once we get guest_memfd mmap support merged upstream.
GPT does provide protection. The only caveat is, does the guest_memfd
support this at all ? i.e., shared->private transitions with a shared
mapping in place (Though this is in SMMU only, not the Host CPU
pagetables)


> 
> Sure sounds appealing..
> 
>> As for the RMM S2, the current plan is to re-use the CPU S2 managed
>> by RMM.
> 
> Yes, but my question is if the CPU will be prepopulated
>   
>> Actually it is. But might solve the problem for confidential VMs,
>> where the S2 mapping is kind of pinned.
> 
> Not kind of pinned, it is pinned in the hypervisor..
>   
>> Population of S2 is a bit tricky for CVMs, as there are restrictions
>> due to :
>>    1) Pre-boot measurements
>>    2) Restrictions on modifying the S2 (at least on CCA).
> 
> I haven't dug into any of this, but I'd challenge you to try to make
> it run fast if the guestmemfd has a full fixed address map in 1G pages
> and could just dump them into the RMM efficiently once during boot.
> 
> Perhaps there are ways to optimize the measurements for huge amounts
> of zero'd memory.

There is. We (VMM) can choose not to "measure" the zero'd pages.


>> Filling in the S2, with already populated S2 is complicated for CCA
>> (costly, but not impossible). But the easier way is for the Realm to
>> fault in the pages before they are used for DMA (and S2 mappings can be
>> pinned by the hyp as default). Hence that suggestion.
> 
> I guess, but it's weird, kinda slow, and the RMM can never unfault them..
> 
> How will you reconstruct the 1G huge pages in the S2 if you are only
> populating on faults? Can you really fault the entire 1G page? If so
> why can't it be prepopulated?

It is tricky to prepopulate the 1G page, as parts of the pages may be
"populated" with contents. We can recreate the 1G block mapping by
"FOLD" ing the leaf level tables, all the way upto 1G, after the
mappings are created. We have to do that anyway for CCA.

I think we can go ahead with VMM pre-populating the entire DRAM
and keeping it pinned for DA. Rather than doing this from the
vfio kernel, it could be done by the VMM as it has better knowledge
of the populated contents and map the rest as "unmeasured" 0s.

Suzuki