linux-kernel - Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f9ba14a-909b-4b49-b1de-3dc98b31aee0@redhat.com>
Date: Fri, 18 Oct 2024 20:52:43 +0200
From: David Hildenbrand <david@...hat.com>
To: Fares Mehanna <faresx@...zon.de>
Cc: akpm@...ux-foundation.org, ardb@...nel.org, arnd@...db.de,
 bhelgaas@...gle.com, broonie@...nel.org, catalin.marinas@....com,
 james.morse@....com, javierm@...hat.com, jean-philippe@...aro.org,
 joey.gouly@....com, kristina.martsenko@....com, kvmarm@...ts.linux.dev,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, mark.rutland@....com, maz@...nel.org, mediou@...zon.de,
 nh-open-source@...zon.com, oliver.upton@...ux.dev, ptosi@...gle.com,
 rdunlap@...radead.org, rkagan@...zon.de, rppt@...nel.org,
 shikemeng@...weicloud.com, suzuki.poulose@....com, tabba@...gle.com,
 will@...nel.org, yuzenghui@...wei.com
Subject: Re: [RFC PATCH 0/7] support for mm-local memory allocations and use
 it

On 11.10.24 16:25, Fares Mehanna wrote:
>>>
>>>
>>>> On 11. Oct 2024, at 14:36, Mediouni, Mohamed <mediou@...zon.de> wrote:
>>>>
>>>>
>>>>
>>>>> On 11. Oct 2024, at 14:04, David Hildenbrand <david@...hat.com> wrote:
>>>>>
>>>>> On 10.10.24 17:52, Fares Mehanna wrote:
>>>>>>>> In a series posted a few years ago [1], a proposal was put forward to allow the
>>>>>>>> kernel to allocate memory local to a mm and thus push it out of reach for
>>>>>>>> current and future speculation-based cross-process attacks.  We still believe
>>>>>>>> this is a nice thing to have.
>>>>>>>>
>>>>>>>> However, in the time passed since that post Linux mm has grown quite a few new
>>>>>>>> goodies, so we'd like to explore possibilities to implement this functionality
>>>>>>>> with less effort and churn leveraging the now available facilities.
>>>>>>>>
>>>>>>>> An RFC was posted few months back [2] to show the proof of concept and a simple
>>>>>>>> test driver.
>>>>>>>>
>>>>>>>> In this RFC, we're using the same approach of implementing mm-local allocations
>>>>>>>> piggy-backing on memfd_secret(), using regular user addresses but pinning the
>>>>>>>> pages and flipping the user/supervisor flag on the respective PTEs to make them
>>>>>>>> directly accessible from kernel.
>>>>>>>> In addition to that we are submitting 5 patches to use the secret memory to hide
>>>>>>>> the vCPU gp-regs and fp-regs on arm64 VHE systems.
>>>>>>>
>>>>>>> I'm a bit lost on what exactly we want to achieve. The point where we
>>>>>>> start flipping user/supervisor flags confuses me :)
>>>>>>>
>>>>>>> With secretmem, you'd get memory allocated that
>>>>>>> (a) Is accessible by user space -- mapped into user space.
>>>>>>> (b) Is inaccessible by kernel space -- not mapped into the direct map
>>>>>>> (c) GUP will fail, but copy_from / copy_to user will work.
>>>>>>>
>>>>>>>
>>>>>>> Another way, without secretmem, would be to consider these "secrets"
>>>>>>> kernel allocations that can be mapped into user space using mmap() of a
>>>>>>> special fd. That is, they wouldn't have their origin in secretmem, but
>>>>>>> in KVM as a kernel allocation. It could be achieved by using VM_MIXEDMAP
>>>>>>> with vm_insert_pages(), manually removing them from the directmap.
>>>>>>>
>>>>>>> But, I am not sure who is supposed to access what. Let's explore the
>>>>>>> requirements. I assume we want:
>>>>>>>
>>>>>>> (a) Pages accessible by user space -- mapped into user space.
>>>>>>> (b) Pages inaccessible by kernel space -- not mapped into the direct map
>>>>>>> (c) GUP to fail (no direct map).
>>>>>>> (d) copy_from / copy_to user to fail?
>>>>>>>
>>>>>>> And on top of that, some way to access these pages on demand from kernel
>>>>>>> space? (temporary CPU-local mapping?)
>>>>>>>
>>>>>>> Or how would the kernel make use of these allocations?
>>>>>>>
>>>>>>> -- 
>>>>>>> Cheers,
>>>>>>>
>>>>>>> David / dhildenb
>>>>>> Hi David,
>>>>>
>>>>> Hi Fares!
>>>>>
>>>>>> Thanks for taking a look at the patches!
>>>>>> We're trying to allocate a kernel memory that is accessible to the kernel but
>>>>>> only when the context of the process is loaded.
>>>>>> So this is a kernel memory that is not needed to operate the kernel itself, it
>>>>>> is to store & process data on behalf of a process. The requirement for this
>>>>>> memory is that it would never be touched unless the process is scheduled on this
>>>>>> core. otherwise any other access will crash the kernel.
>>>>>> So this memory should only be directly readable and writable by the kernel, but
>>>>>> only when the process context is loaded. The memory shouldn't be readable or
>>>>>> writable by the owner process at all.
>>>>>> This is basically done by removing those pages from kernel linear address and
>>>>>> attaching them only in the process mm_struct. So during context switching the
>>>>>> kernel loses access to the secret memory scheduled out and gain access to the
>>>>>> new process secret memory.
>>>>>> This generally protects against speculation attacks, and if other process managed
>>>>>> to trick the kernel to leak data from memory. In this case the kernel will crash
>>>>>> if it tries to access other processes secret memory.
>>>>>> Since this memory is special in the sense that it is kernel memory but only make
>>>>>> sense in the term of the owner process, I tried in this patch series to explore
>>>>>> the possibility of reusing memfd_secret() to allocate this memory in user virtual
>>>>>> address space, manage it in a VMA, flipping the permissions while keeping the
>>>>>> control of the mapping exclusively with the kernel.
>>>>>> Right now it is:
>>>>>> (a) Pages not accessible by user space -- even though they are mapped into user
>>>>>>      space, the PTEs are marked for kernel usage.
>>>>>
>>>>> Ah, that is the detail I was missing, now I see what you are trying to achieve, thanks!
>>>>>
>>>>> It is a bit architecture specific, because ... imagine architectures that have separate kernel+user space page table hierarchies, and not a simple PTE flag
>> to change access permissions between kernel/user space.
>>>>>
>>>>> IIRC s390 is one such architecture that uses separate page tables for the user-space + kernel-space portions.
>>>>>
>>>>>> (b) Pages accessible by kernel space -- even though they are not mapped into the
>>>>>>      direct map, the PTEs in uvaddr are marked for kernel usage.
>>>>>> (c) copy_from / copy_to user won't fail -- because it is in the user range, but
>>>>>>      this can be fixed by allocating specific range in user vaddr to this feature
>>>>>>      and check against this range there.
>>>>>> (d) The secret memory vaddr is guessable by the owner process -- that can also
>>>>>>      be fixed by allocating bigger chunk of user vaddr for this feature and
>>>>>>      randomly placing the secret memory there.
>>>>>> (e) Mapping is off-limits to the owner process by marking the VMA as locked,
>>>>>>      sealed and special.
>>>>>
>>>>> Okay, so in this RFC you are jumping through quite some hoops to have a kernel allocation unmapped from the direct map but mapped into a per-process page
>> table only accessible by kernel space. :)
>>>>>
>>>>> So you really don't want this mapped into user space at all (consequently, no GUP, no access, no copy_from_user ...). In this RFC it's mapped but turned
>> inaccessible by flipping the "kernel vs. user" switch.
>>>>>
>>>>>> Other alternative (that was implemented in the first submission) is to track those
>>>>>> allocations in a non-shared kernel PGD per process, then handle creating, forking
>>>>>> and context-switching this PGD.
>>>>>
>>>>> That sounds like a better approach. So we would remove the pages from the shared kernel direct map and map them into a separate kernel-portion in the per-MM
>> page tables?
>>>>>
>>>>> Can you envision that would also work with architectures like s390x? I assume we would not only need the per-MM user space page table hierarchy, but also a
>> per-MM kernel space page table hierarchy, into which we also map the common/shared-among-all-processes kernel space page tables (e.g., directmap).
>>>> Yes, that’s also applicable to arm64. There’s currently no separate per-mm user space page hierarchy there.
>>> typo, read kernel
>>
>>
>> Okay, thanks. So going into that direction makes more sense.
>>
>> I do wonder if we really have to deal with fork() ... if the primary
>> users don't really have meaning in the forked child (e.g., just like
>> fork() with KVM IIRC) we might just get away by "losing" these
>> allocations in the child process.
>>
>> Happy to learn why fork() must be supported.
> 
> It really depends on the use cases of the kernel secret allocation, but in my
> mind a troubling scenario:
> 1. Process A had a resource X.
> 2. Kernel decided to keep some data related to resource X in process A secret
>     memory.
> 3. Process A decided to fork, now process B share the resource X.
> 4. Process B started using resource X. <-- This will crash the kernel as the
>     used kernel page table on process B has no mapping for the secret memory used
>     in resource X.
> 
> I haven't tried to trigger this crash myself though.
> 

Right, and if we can rule out any users that are supposed to work after 
fork(), we can just disregard that in the first version.

I never played with this, but let's assume you make use of these 
mm-local allocations in KVM context.

What would happens if you fork() with a KVM fd and try accessing that fd 
from the other process using ioctls? I recall that KVM will not be 
"duplicated".

What would happen if you send that fd over to a completely different 
process and try accessing that fd from the other process using ioctls?

Of course, question being: if you have MM-local allocations in both 
cases and there is suddenly a different MM ... assuming that both cases 
are even possible (if they are not possible, great! :) ).

I think I am supposed to know if these things are possible or not and 
what would happen, but it's late Friday and my brain is begging for some 
Weekend :D

> I didn't think in depth about this issue yet, but I need to because duplicating
> the secret memory mappings in the new forked process is easy (To give kernel
> access on the secret memory), but tearing them down across all forked processes
> is a bit complicated (To clean stale mappings on parent/child processes). Right
> now tearing down the mapping will only happen on mm_struct which allocated the
> secret memory.

If an allocation is MM-local, I would assume that fork() would 
*duplicate* that allocation (leaving CoW out of the picture :D ), but 
that's where the fun begins (see above regarding my confusion about KVM 
and fork() behavior ... ).

-- 
Cheers,

David / dhildenb