linux-kernel - Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <465ce78b-d023-40e6-b066-5e4a01e266b6@redhat.com>
Date: Fri, 11 Oct 2024 14:04:12 +0200
From: David Hildenbrand <david@...hat.com>
To: Fares Mehanna <faresx@...zon.de>
Cc: akpm@...ux-foundation.org, ardb@...nel.org, arnd@...db.de,
 bhelgaas@...gle.com, broonie@...nel.org, catalin.marinas@....com,
 james.morse@....com, javierm@...hat.com, jean-philippe@...aro.org,
 joey.gouly@....com, kristina.martsenko@....com, kvmarm@...ts.linux.dev,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, mark.rutland@....com, maz@...nel.org,
 nh-open-source@...zon.com, oliver.upton@...ux.dev, ptosi@...gle.com,
 rdunlap@...radead.org, rkagan@...zon.de, rppt@...nel.org,
 shikemeng@...weicloud.com, suzuki.poulose@....com, tabba@...gle.com,
 will@...nel.org, yuzenghui@...wei.com
Subject: Re: [RFC PATCH 0/7] support for mm-local memory allocations and use
 it

On 10.10.24 17:52, Fares Mehanna wrote:
>>> In a series posted a few years ago [1], a proposal was put forward to allow the
>>> kernel to allocate memory local to a mm and thus push it out of reach for
>>> current and future speculation-based cross-process attacks.  We still believe
>>> this is a nice thing to have.
>>>
>>> However, in the time passed since that post Linux mm has grown quite a few new
>>> goodies, so we'd like to explore possibilities to implement this functionality
>>> with less effort and churn leveraging the now available facilities.
>>>
>>> An RFC was posted few months back [2] to show the proof of concept and a simple
>>> test driver.
>>>
>>> In this RFC, we're using the same approach of implementing mm-local allocations
>>> piggy-backing on memfd_secret(), using regular user addresses but pinning the
>>> pages and flipping the user/supervisor flag on the respective PTEs to make them
>>> directly accessible from kernel.
>>> In addition to that we are submitting 5 patches to use the secret memory to hide
>>> the vCPU gp-regs and fp-regs on arm64 VHE systems.
>>
>> I'm a bit lost on what exactly we want to achieve. The point where we
>> start flipping user/supervisor flags confuses me :)
>>
>> With secretmem, you'd get memory allocated that
>> (a) Is accessible by user space -- mapped into user space.
>> (b) Is inaccessible by kernel space -- not mapped into the direct map
>> (c) GUP will fail, but copy_from / copy_to user will work.
>>
>>
>> Another way, without secretmem, would be to consider these "secrets"
>> kernel allocations that can be mapped into user space using mmap() of a
>> special fd. That is, they wouldn't have their origin in secretmem, but
>> in KVM as a kernel allocation. It could be achieved by using VM_MIXEDMAP
>> with vm_insert_pages(), manually removing them from the directmap.
>>
>> But, I am not sure who is supposed to access what. Let's explore the
>> requirements. I assume we want:
>>
>> (a) Pages accessible by user space -- mapped into user space.
>> (b) Pages inaccessible by kernel space -- not mapped into the direct map
>> (c) GUP to fail (no direct map).
>> (d) copy_from / copy_to user to fail?
>>
>> And on top of that, some way to access these pages on demand from kernel
>> space? (temporary CPU-local mapping?)
>>
>> Or how would the kernel make use of these allocations?
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
> 
> Hi David,

Hi Fares!

> 
> Thanks for taking a look at the patches!
> 
> We're trying to allocate a kernel memory that is accessible to the kernel but
> only when the context of the process is loaded.
> 
> So this is a kernel memory that is not needed to operate the kernel itself, it
> is to store & process data on behalf of a process. The requirement for this
> memory is that it would never be touched unless the process is scheduled on this
> core. otherwise any other access will crash the kernel.
> 
> So this memory should only be directly readable and writable by the kernel, but
> only when the process context is loaded. The memory shouldn't be readable or
> writable by the owner process at all.
> 
> This is basically done by removing those pages from kernel linear address and
> attaching them only in the process mm_struct. So during context switching the
> kernel loses access to the secret memory scheduled out and gain access to the
> new process secret memory.
> 
> This generally protects against speculation attacks, and if other process managed
> to trick the kernel to leak data from memory. In this case the kernel will crash
> if it tries to access other processes secret memory.
> 
> Since this memory is special in the sense that it is kernel memory but only make
> sense in the term of the owner process, I tried in this patch series to explore
> the possibility of reusing memfd_secret() to allocate this memory in user virtual
> address space, manage it in a VMA, flipping the permissions while keeping the
> control of the mapping exclusively with the kernel.
> 
> Right now it is:
> (a) Pages not accessible by user space -- even though they are mapped into user
>      space, the PTEs are marked for kernel usage.

Ah, that is the detail I was missing, now I see what you are trying to 
achieve, thanks!

It is a bit architecture specific, because ... imagine architectures 
that have separate kernel+user space page table hierarchies, and not a 
simple PTE flag to change access permissions between kernel/user space.

IIRC s390 is one such architecture that uses separate page tables for 
the user-space + kernel-space portions.

> (b) Pages accessible by kernel space -- even though they are not mapped into the
>      direct map, the PTEs in uvaddr are marked for kernel usage.
> (c) copy_from / copy_to user won't fail -- because it is in the user range, but
>      this can be fixed by allocating specific range in user vaddr to this feature
>      and check against this range there.
> (d) The secret memory vaddr is guessable by the owner process -- that can also
>      be fixed by allocating bigger chunk of user vaddr for this feature and
>      randomly placing the secret memory there.
> (e) Mapping is off-limits to the owner process by marking the VMA as locked,
>      sealed and special.

Okay, so in this RFC you are jumping through quite some hoops to have a 
kernel allocation unmapped from the direct map but mapped into a 
per-process page table only accessible by kernel space. :)

So you really don't want this mapped into user space at all 
(consequently, no GUP, no access, no copy_from_user ...). In this RFC 
it's mapped but turned inaccessible by flipping the "kernel vs. user" 
switch.

> 
> Other alternative (that was implemented in the first submission) is to track those
> allocations in a non-shared kernel PGD per process, then handle creating, forking
> and context-switching this PGD.

That sounds like a better approach. So we would remove the pages from 
the shared kernel direct map and map them into a separate kernel-portion 
in the per-MM page tables?

Can you envision that would also work with architectures like s390x? I 
assume we would not only need the per-MM user space page table 
hierarchy, but also a per-MM kernel space page table hierarchy, into 
which we also map the common/shared-among-all-processes kernel space 
page tables (e.g., directmap).

> 
> What I like about the memfd_secret() approach is the simplicity and being arch
> agnostic, what I don't like is the increased attack surface by using VMAs to
> track those allocations.

Yes, but memfd_secret() was really design for user space to hold 
secrets. But I can see how you came to this solution.

> 
> I'm thinking of working on a PoC to implement the first approach of using a
> non-shared kernel PGD for secret memory allocations on arm64. This includes
> adding kernel page table per process where all PGDs are shared but one which
> will be used for secret allocations mapping. And handle the fork & context
> switching (TTBR1 switching(?)) correctly for the secret memory PGD.
> 
> What do you think? I'd really appreciate opinions and possible ways forward.

Naive question: does arm64 rather resemble the s390x model or the x86-64 
model?

-- 
Cheers,

David / dhildenb