[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wi5pLoe3szxLREQGGJuWU0w_POK9Sv6717UH3b7OvvfjQ@mail.gmail.com>
Date: Mon, 31 Mar 2025 12:12:08 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
Masami Hiramatsu <mhiramat@...nel.org>, Mark Rutland <mark.rutland@....com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Andrew Morton <akpm@...ux-foundation.org>,
Vincent Donnefort <vdonnefort@...gle.com>, Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Kees Cook <kees@...nel.org>, Tony Luck <tony.luck@...el.com>,
"Guilherme G. Piccoli" <gpiccoli@...lia.com>, linux-hardening@...r.kernel.org
Subject: Re: [PATCH v2 1/2] tracing: ring-buffer: Have the ring buffer code do
the vmap of physical memory
On Mon, 31 Mar 2025 at 10:38, Steven Rostedt <rostedt@...dmis.org> wrote:
>
> I just did this to be robust in case what was passed in was not aligned. In
> all my use cases, it is.
I really think that ALIGN() is absolutely the opposite of robust.
It silently just makes incorrect values generate incorrect code.
Robust is dealing correctly with incorrect values, not making it
incorrect data just create even more incorrect data.
> OK, so I did copy this from fs/pstore/ram_core.c as this does basically the
> same thing as pstore. And it looks like pstore should be updated too.
Please just stop copying code from random drivers or filesystems.
Christ.
I've said this before: core kernel code has higher quality
requirements than random drivers (or even random filesystems).
You can't copy crazy incorrect snippets from random sources and put it
in core kernel code.
> This is due to what the "reserve_mem" kernel command line feature gives
> back. It reserves physical memory that is not part of the kernel memory
> allocator (it also works with memmap ie. memmap=12M$0x284500000, which also
> requires physical memory addresses that are not part of the memory
> allocator).
Yeah, so in that case turning those into "struct page *" is
horrendously wrong, because you've literally taken it away from the
memory management.
If it's not part of the kernel memory allocator, it does not
necessarily have a "struct page *" associated with it. Using a pointer
to 'struct page' and passing it off is just fundamentally more than a
bug: it's seriously and deeply broken.
It's probably much more obviously broken if the physical addresses
came from a PCI device, and this all just happens to work because it's
actually real RAM and we ended up having a 'struct page []' for it for
some random reason.
But the generral rule is that if all you have physical RAM addresses,
you can use them as phys_addr_t (or turn them into pfn's, which is
just the page index of the physical address).
I think that *completely* bogus code worked only entirely due to luck
- the 'struct page *' was never actually dereferenced, and it got
turned back into a pfn and then a page table entry by the vmap code
using purely address arithmetic (page_to_phys() and page_to_pfn()).
So it probably only ever used pointer calculations, although some of
those actually can end up depending on dereferencing 'struct page' (to
find the base zone of the page, to find the mapping).
Add even with just pointer games, that should actually have triggered
a warning in vmap_pages_pte_range() from this:
if (WARN_ON(!pfn_valid(page_to_pfn(page))))
which I suspect means that we actually *do* end up having 'struct
page' backing for those..
Looking closer...
I think what happened is that reserve_mem() actually does end up
giving you pages that have the 'struct page' backing store even if
they aren't mapped. Which makes the 'struct page' pointer stuff work,
and those pages end up working as 'reserved pages'.
And I suspect that happens exactly because we had users that mis-used
these 'struct page *' things, so it might even be intentional.
Or it might just be because that memory *has* been in the E280 memory
map before it was reserved, and the 'strict page' arrays may have been
sized for the original E280 information, not the "after stealing"
information.
I didn't go look into that early memory reservation code, but I would
not be surprised if there have been fixups to work around bugs like
this. I haven't had to look at it in ages, but maybe the bootmem code
knows that bootmem allocations then might be given back to the VM and
need to have those 'struct page' backing store things
Or - and this is also entirely possible - maybe you were just very
lucky indeed because the code to allocate the 'struct page' regions
ends up intentionally over-allocating and rounding things up.
Because of how the buddy allocator works, each 'struct page' array
needs to be rounded up to the maximum page allocation, so I think we
always allocate the page arrays of a memory zone up to at least
PAGE_SIZE << NR_PAGE_ORDERS boundaries (so 8MB-aligned, I think). So
there's fundamnetally some slop in there.
Anyway, if you allocated them as physical addresses outside of the
regular MM code, you should not use 'struct page' in *any* form. You
really can't just turn any random physical address into a page, unless
it came from the page allocator.
And yes, I'm not at all surprised that we'd have other mis-users of
this. We have some very historical code that depends on reserved pages
going back all the way to linux-0.01 I suspect, because things like
the original VGA code knew that the physical addresses were in the
BIOS hole region that was backed by 'struct page'.
Linus
Powered by blists - more mailing lists