[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87o75y428z.fsf@nvdebian.thelocal>
Date: Mon, 12 Aug 2024 17:41:52 +1000
From: Alistair Popple <apopple@...dia.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Dan Williams <dan.j.williams@...el.com>, dave.hansen@...ux.intel.com,
luto@...nel.org, peterz@...radead.org, max8rr8@...il.com,
linux-kernel@...r.kernel.org, x86@...nel.org, jhubbard@...dia.com, Kees
Cook <keescook@...omium.org>
Subject: Re: [PATCH 1/1] x86/ioremap: Use is_vmalloc_addr in iounmap
Thomas Gleixner <tglx@...utronix.de> writes:
> On Thu, Aug 08 2024 at 20:55, Dan Williams wrote:
>> Alistair Popple wrote:
>>> Dan Williams <dan.j.williams@...el.com> writes:
>>> > Thomas Gleixner wrote:
>>> >> Wait. MAX_PHYSMEM_BITS is either 46 (4-level) or 52 (5-level), which is
>>> >> fully covered by the identity map space.
>>> >>
>>> >> So even if the search starts from top of that space, how do we end up
>>> >> with high_memory > VMALLOC_START?
>>> >>
>>> >> That does not make any sense at all
>>> >
>>> > Max, or Alistair can you provide more details of how private memory spills over
>>> > into the VMALLOC space on these platforms?
>>>
>>> Well I was hoping pleading ignorance on x86 memory maps would get me out
>>> of having to look too deeply :-) But alas...
>>>
>>> It appears the problem originates in KASLR which can cause the VMALLOC
>>> region to overlap with the top of the linear map.
>>>
>>> > I too would have thought that MAX_PHYSMEM_BITS protects against this?
>>>
>>> Me too, until about an hour ago. As noted above
>>> request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
>>> down. Therefore VMALLOC_START needs to be greater than PAGE_OFFSET + (1
>>> << MAX_PHYSMEM_BITS) - 1. However the default configuration for KASLR
>>> as set by RANDOMIZE_MEMORY_PHYSICAL_PADDING is to only provide 10TB
>>> above what max_pfn is set to at boot time (and even then only if
>>> CONFIG_MEMORY_HOTPLUG is enabled).
>
> Duh.
>
>>> Obviously ZONE_DEVICE memory ends up being way above that and crosses
>>> into the VMALLOC region.
>
> So we need to go through all usage sites of MAX_PHYSMEM_BITS and fix
> them up...
>
>>> @@ -2277,6 +2277,7 @@ config RANDOMIZE_MEMORY_PHYSICAL_PADDING
>>> depends on RANDOMIZE_MEMORY
>>> default "0xa" if MEMORY_HOTPLUG
>>> default "0x0"
>>> + range 0x40 0x40 if GET_FREE_REGION
>>> range 0x1 0x40 if MEMORY_HOTPLUG
>>> range 0x0 0x40
>>> help
>
> That reduces the entropy to the minimum and papers over the problem with
> 4-level paging, but it won't solve the problem on systems with 5-level
> paging.
Actually I've never observed this problem on systems with 5-level page
tables, although I haven't tested extensively there. I don't know of any
reason we wouldn't though, so I guess with the increased entropy I've
just been lucky.
> There the 64T of padding are just not cutting it. MAX_PHYSMEM_BITS is 52
> for 5 level ....
>
>> @@ -1824,10 +1824,11 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size,
>> resource_size_t align, unsigned long flags)
>> {
>> if (flags & GFR_DESCENDING) {
>> + u64 kaslr_pad = CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING << 40;
>> resource_size_t end;
>>
>> end = min_t(resource_size_t, base->end,
>> - (1ULL << MAX_PHYSMEM_BITS) - 1);
>> + (1ULL << MAX_PHYSMEM_BITS) - kaslr_pad - 1);
>> return end - size + 1;
>> }
>
> This needs a fixup of gfr_continue() too, but it does not work at
> all. The size of the direct map is calculated as:
>
> memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
> CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
>
> That size is used to calculate the address above which the vmalloc area
> is placed.
>
> So assume 6T RAM installed + 10T space for hotplug memory. That means
> the vmalloc area can start right above 16T. But 64T - 10T = 54T which is
> slightly larger than 16T :)
>
> I came up with the uncompiled below. I fixed up the obvious places, but
> did not go through all usage sites of MAX_PHYSMEM_BITS yet.
I looked at more of the MAX_PHYSMEM_BITS usage today but nothing else
stood out as obviously wrong.
> Thanks,
>
> tglx
> ---
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -187,6 +187,8 @@ extern unsigned int ptrs_per_p4d;
> #define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN)
> #endif /* CONFIG_KMSAN */
>
> +#define DIRECT_MAP_END (VMALLOC_START - 1)
If I'm understanding the KASLR implementation correctly then this
doesn't seem quite right - the entropy means there is a hole from the
end of the direct map (ie. the range PAGE_OFFSET to
PAGE_OFFSET+kaslr_regions[0].size_tb) and VMALLOC_START which shouldn't
be used.
In practice hotplugging memory into that range probably works, but it
seems like it would set us up for future bugs. It also means memory
hotplug could fail intermittently based on the per-boot entropy.
For example with CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING set to 10T one
could imagine hotplugging 16T would mostly work except on some system
boots if KASLR happens to randomly place VMALLOC_START close to the end
of the direct map.
Therefore to keep memory hotplug deterministic I think it would be
better to define DIRECT_MAP_END as a variable that KASLR sets/updates.
> +
> #define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
> /* The module sections ends with the start of the fixmap */
> #ifndef CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -97,6 +97,10 @@ extern const int mmap_rnd_compat_bits_ma
> extern int mmap_rnd_compat_bits __read_mostly;
> #endif
>
> +#ifndef DIRECT_MAP_END
> +# define DIRECT_MAP_END ((1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT)) - 1)
> +#endif
> +
> #include <asm/page.h>
> #include <asm/processor.h>
>
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1826,8 +1826,7 @@ static resource_size_t gfr_start(struct
> if (flags & GFR_DESCENDING) {
> resource_size_t end;
>
> - end = min_t(resource_size_t, base->end,
> - (1ULL << MAX_PHYSMEM_BITS) - 1);
> + end = min_t(resource_size_t, base->end, DIRECT_MAP_END);
This does not look right to me - isn't DIRECT_MAP_END a virtual address
where as the resource ranges are physical addresses? Ie. I think this
should be:
+ end = min_t(resource_size_t, base->end, __pa(DIRECT_MAP_END));
The same applies to the rest of the DIRECT_MAP_END users here. Perhaps
it would be better to define this as DIRECT_MAP_SIZE and calculate this
based off PAGE_OFFSET instead?
> return end - size + 1;
> }
>
> @@ -1844,8 +1843,7 @@ static bool gfr_continue(struct resource
> * @size did not wrap 0.
> */
> return addr > addr - size &&
> - addr <= min_t(resource_size_t, base->end,
> - (1ULL << MAX_PHYSMEM_BITS) - 1);
> + addr <= min_t(resource_size_t, base->end, DIRECT_MAP_END);
> }
>
> static resource_size_t gfr_next(resource_size_t addr, resource_size_t size,
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1681,7 +1681,7 @@ struct range __weak arch_get_mappable_ra
>
> struct range mhp_get_pluggable_range(bool need_mapping)
> {
> - const u64 max_phys = (1ULL << MAX_PHYSMEM_BITS) - 1;
> + const u64 max_phys = DIRECT_MAP_END;
> struct range mhp_range;
>
> if (need_mapping) {
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -129,7 +129,7 @@ static inline int sparse_early_nid(struc
> static void __meminit mminit_validate_memmodel_limits(unsigned long *start_pfn,
> unsigned long *end_pfn)
> {
> - unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> + unsigned long max_sparsemem_pfn = (DIRECT_MAP_END + 1) >> PAGE_SHIFT;
>
> /*
> * Sanity checks - do not allow an architecture to pass
Powered by blists - more mailing lists