linux-kernel - Re: [PATCH 0/6] rust: page: Support borrowing `struct page` and physaddr conversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f042dcf3-10b9-4b58-9c98-5b83910ab188@asahilina.net>
Date: Fri, 7 Feb 2025 04:27:03 +0900
From: Asahi Lina <lina@...hilina.net>
To: David Hildenbrand <david@...hat.com>, Zi Yan <ziy@...dia.com>
Cc: Miguel Ojeda <ojeda@...nel.org>, Alex Gaynor <alex.gaynor@...il.com>,
 Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
 Björn Roy Baron <bjorn3_gh@...tonmail.com>,
 Benno Lossin <benno.lossin@...ton.me>,
 Andreas Hindborg <a.hindborg@...nel.org>, Alice Ryhl <aliceryhl@...gle.com>,
 Trevor Gross <tmgross@...ch.edu>, Jann Horn <jannh@...gle.com>,
 Matthew Wilcox <willy@...radead.org>, Paolo Bonzini <pbonzini@...hat.com>,
 Danilo Krummrich <dakr@...nel.org>, Wedson Almeida Filho
 <wedsonaf@...il.com>, Valentin Obst <kernel@...entinobst.de>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 airlied@...hat.com, Abdiel Janulgue <abdiel.janulgue@...il.com>,
 rust-for-linux@...r.kernel.org, linux-kernel@...r.kernel.org,
 asahi@...ts.linux.dev, Oscar Salvador <osalvador@...e.de>,
 Muchun Song <muchun.song@...ux.dev>
Subject: Re: [PATCH 0/6] rust: page: Support borrowing `struct page` and
 physaddr conversion



On 2/7/25 4:18 AM, Asahi Lina wrote:
> 
> 
> On 2/7/25 2:58 AM, David Hildenbrand wrote:
>> On 04.02.25 22:06, Asahi Lina wrote:
>>>
>>>
>>> On 2/5/25 5:10 AM, David Hildenbrand wrote:
>>>> On 04.02.25 18:59, Asahi Lina wrote:
>>>>> On 2/4/25 11:38 PM, David Hildenbrand wrote:
>>>>>>>>> If the answer is "no" then that's fine. It's still an unsafe
>>>>>>>>> function
>>>>>>>>> and we need to document in the safety section that it should
>>>>>>>>> only be
>>>>>>>>> used for memory that is either known to be allocated and pinned and
>>>>>>>>> will
>>>>>>>>> not be freed while the `struct page` is borrowed, or memory that is
>>>>>>>>> reserved and not owned by the buddy allocator, so in practice
>>>>>>>>> correct
>>>>>>>>> use would not be racy with memory hot-remove anyway.
>>>>>>>>>
>>>>>>>>> This is already the case for the drm/asahi use case, where the pfns
>>>>>>>>> looked up will only ever be one of:
>>>>>>>>>
>>>>>>>>> - GEM objects that are mapped to the GPU and whose physical
>>>>>>>>> pages are
>>>>>>>>> therefore pinned (and the VM is locked while this happens so the
>>>>>>>>> objects
>>>>>>>>> cannot become unpinned out from under the running code),
>>>>>>>>
>>>>>>>> How exactly are these pages pinned/obtained?
>>>>>>>
>>>>>>> Under the hood it's shmem. For pinning, it winds up at
>>>>>>> `drm_gem_get_pages()`, which I think does a
>>>>>>> `shmem_read_folio_gfp()` on
>>>>>>> a mapping set as unevictable.
>>>>>>
>>>>>> Thanks. So we grab another folio reference via shmem_read_folio_gfp()-
>>>>>>> shmem_get_folio_gfp().
>>>>>>
>>>>>> Hm, I wonder if we might end up holding folios residing in
>>>>>> ZONE_MOVABLE/
>>>>>> MIGRATE_CMA longer than we should.
>>>>>>
>>>>>> Compared to memfd_pin_folios(), which simulates FOLL_LONGTERM and
>>>>>> makes
>>>>>> sure to migrate pages out of ZONE_MOVABLE/MIGRATE_CMA.
>>>>>>
>>>>>> But that's a different discussion, just pointing it out, maybe I'm
>>>>>> missing something :)
>>>>>
>>>>> I think this is a little over my head. Though I only just realized that
>>>>> we seem to be keeping the GEM objects pinned forever, even after unmap,
>>>>> in the drm-shmem core API (I see no drm-shmem entry point that would
>>>>> allow the sgt to be freed and its corresponding pages ref to be
>>>>> dropped,
>>>>> other than a purge of purgeable objects or final destruction of the
>>>>> object). I'll poke around since this feels wrong, I thought we were
>>>>> supposed to be able to have shrinker support for swapping out whole GPU
>>>>> VMs in the modern GPU MM model, but I guess there's no
>>>>> implementation of
>>>>> that for gem-shmem drivers yet...?
>>>>
>>>> I recall that shrinker as well, ... or at least a discussion around it.
>>>>
>>>> [...]
>>>>
>>>>>>
>>>>>> If it's only for crash dumps etc. that might even be opt-in, it makes
>>>>>> the whole thing a lot less scary. Maybe this could be opt-in
>>>>>> somewhere,
>>>>>> to "unlock" this interface? Just an idea.
>>>>>
>>>>> Just to make sure we're on the same page, I don't think there's
>>>>> anything
>>>>> to unlock in the Rust abstraction side (this series). At the end of the
>>>>> day, if nothing else, the unchecked interface (which the regular
>>>>> non-crash page table management code uses for performance) will let you
>>>>> use any pfn you want, it's up to documentation and human review to
>>>>> specify how it should be used by drivers. What Rust gives us here is
>>>>> the
>>>>> mandatory `unsafe {}`, so any attempts to use this API will necessarily
>>>>> stick out during review as potentially dangerous code that needs extra
>>>>> scrutiny.
>>>>>
>>>>> For the client driver itself, I could gate the devcoredump stuff behind
>>>>> a module parameter or something... but I don't think it's really worth
>>>>> it. We don't have a way to reboot the firmware or recover from this
>>>>> condition (platform limitations), so end users are stuck rebooting to
>>>>> get back a usable machine anyway. If something goes wrong in the
>>>>> crashdump code and the machine oopses or locks up worse... it doesn't
>>>>> really make much of a difference for normal end users. I don't think
>>>>> this will ever really happen given the constraints I described, but if
>>>>> somehow it does (some other bug somewhere?), well... the machine was
>>>>> already in an unrecoverable state anyway.
>>>>>
>>>>> It would be nice to have userspace tooling deployed by default that
>>>>> saves off the devcoredump somewhere, so we can have a chance at
>>>>> debugging hard-to-hit firmware crashes... if it's opt-in, it would only
>>>>> really be useful for developers and CI machines.
>>>>
>>>> Is this something that possibly kdump can save or analyze? Because that
>>>> is our default "oops, kernel crashed, let's dump the old content so we
>>>> can dump it" mechanism on production systems.
>>>
>>> kdump does not work on Apple ARM systems because kexec is broken and
>>> cannot be fully fixed, due to multiple platform/firmware limitations. A
>>> very limited version of kexec might work well enough for kdump, but I
>>> don't think anyone has looked into making that work yet...
>>>
>>>> but ... I am not familiar with devcoredump. So I don't know when/how it
>>>> runs, and if the source system is still alive (and remains alive --  in
>>>> contrast to a kernel crash).
>>>
>>> Devcoredump just makes the dump available via /sys so it can be
>>> collected by the user. The system is still alive, the GPU is just dead
>>> and all future GPU job submissions fail. You can still SSH in or (at
>>> least in theory, if enough moving parts are graceful about it) VT-switch
>>> to a TTY. The display controller is not part of the GPU, it is separate
>>> hardware.
>>
>>
>> Thanks for all the details (and sorry for the delay, I'm on PTO until
>> Monday ... :)
>>
>> (regarding the other mail) Adding that stuff to rust just so we have a
>> devcoredump that ideally wouldn't exist is a bit unfortunate.
>>
>> So I'm curious: we do have /proc/kcore, where we do all of the required
>> filtering, only allowing for reading memory that is online, not
>> hwpoisoned etc.
>>
>> makedumpfile already supports /proc/kcore.
>>
>> Would it be possible to avoid Devcoredump completely either by dumping /
>> proc/kcore directly or by having a user-space script that walks the page
>> tables to dump the content purely based on /proc/kcore?
>>
>> If relevant memory ranges are inaccessible from /proc/kcore, we could
>> look into exposing them.
> 
> I'm not sure that's a good idea... the dump code runs when the GPU
> crashes, and makes copies of all the memory pages into newly allocated
> pages (this is around 16MB for a typical dump, and if allocation fails
> we just bail and clean up). Then userspace can read the coredump at its
> leisure. AIUI, this is exactly the intended use case of devcoredump. It
> also means that anyone can grab a core dump with just a `cp`, without
> needing any bespoke tools.
> 
> After the snapshot is taken, the kernel will complete (fail) all GPU
> jobs, which means much of the shared memory will be freed and some
> structures will change contents. If we defer the coredump to userspace,
> then it would not be able to capture the state of all relevant memory
> exactly at the crash time, which could be very confusing.
> 
> In theory I could change the allocators to not free or touch anything
> after a crash, and add guards to any mutations in the driver to avoid
> any changes after a crash... but that feels a lot more brittle and
> error-prone than just taking the core dump at the right time.
> 

If the arbitrary page lookups are that big a problem, I think I would
rather just memremap the all the bootloader-mapped firmware areas, hook
into all the allocators to provide a backdoor into the backing objects,
and just piece everything together by mapping page addresses to those.
It would be a bunch of extra code and scaffolding in the driver, and
require device tree and bootloader changes to link up the GPU node to
its firmware nodes, but it's still better than trying to do it all from
userspace IMO...

~~ Lina