lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zh6KNglOu8mpTPHE@kernel.org>
Date: Tue, 16 Apr 2024 17:24:54 +0300
From: Mike Rapoport <rppt@...nel.org>
To: Björn Töpel <bjorn@...nel.org>
Cc: Christian Brauner <brauner@...nel.org>, Nam Cao <namcao@...utronix.de>,
	Andreas Dilger <adilger@...ger.ca>,
	Al Viro <viro@...iv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>,
	Jan Kara <jack@...e.cz>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-riscv@...ts.infradead.org, Theodore Ts'o <tytso@....edu>,
	Ext4 Developers List <linux-ext4@...r.kernel.org>,
	Conor Dooley <conor@...nel.org>,
	"Matthew Wilcox (Oracle)" <willy@...radead.org>,
	Anders Roxell <anders.roxell@...aro.org>,
	Alexandre Ghiti <alex@...ti.fr>
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

Hi,

On Tue, Apr 16, 2024 at 01:02:20PM +0200, Björn Töpel wrote:
> Christian Brauner <brauner@...nel.org> writes:
> 
> > [Adding Mike who's knowledgeable in this area]
> 
> >> > Further, it seems like riscv32 indeed inserts a page like that to the
> >> > buddy allocator, when the memblock is free'd:
> >> > 
> >> >   | [<c024961c>] __free_one_page+0x2a4/0x3ea
> >> >   | [<c024a448>] __free_pages_ok+0x158/0x3cc
> >> >   | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> >> >   | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> >> >   | [<c0c17676>] memblock_free_all+0x1ee/0x278
> >> >   | [<c0c050b0>] mem_init+0x10/0xa4
> >> >   | [<c0c1447c>] mm_core_init+0x11a/0x2da
> >> >   | [<c0c00bb6>] start_kernel+0x3c4/0x6de
> >> > 
> >> > Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> >> > lucky (unlucky?) that page was used for the page cache.
> >> 
> >> I just educated myself about memory mapping last night, so the below
> >> may be complete nonsense. Take it with a grain of salt.
> >> 
> >> In riscv's setup_bootmem(), we have this line:
> >> 	max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> >> 
> >> I think this is the root cause: max_low_pfn indicates the last page
> >> to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
> >> getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
> >> to the last page, we get the reported problem.
> >> 
> >> There seems to be some code to make sure the last page is not used
> >> (the call to memblock_set_current_limit() right above this line). It is
> >> unclear to me why this still lets the problem slip through.
> >> 
> >> The fix is simple: never let max_low_pfn gets mapped to the last page.
> >> The below patch fixes the problem for me. But I am not entirely sure if
> >> this is the correct fix, further investigation needed.
> >> 
> >> Best regards,
> >> Nam
> >> 
> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> index fa34cf55037b..17cab0a52726 100644
> >> --- a/arch/riscv/mm/init.c
> >> +++ b/arch/riscv/mm/init.c
> >> @@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
> >>  	}
> >>  
> >>  	min_low_pfn = PFN_UP(phys_ram_base);
> >> -	max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> >> +	max_low_pfn = PFN_DOWN(memblock_get_current_limit());
> >> +	max_pfn = PFN_DOWN(phys_ram_end);
> >>  	high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
> >>  
> >>  	dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));
> 
> Yeah, AFAIU memblock_set_current_limit() only limits the allocation from
> memblock. The "forbidden" page (PA 0xc03ff000 VA 0xfffff000) will still
> be allowed in the zone.
> 
> I think your patch requires memblock_set_current_limit() is
> unconditionally called, which currently is not done.
> 
> The hack I tried was (which seems to work):
> 
> --
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fe8e159394d8..3a1f25d41794 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -245,8 +245,10 @@ static void __init setup_bootmem(void)
>          */
>         if (!IS_ENABLED(CONFIG_64BIT)) {
>                 max_mapped_addr = __pa(~(ulong)0);
> -               if (max_mapped_addr == (phys_ram_end - 1))
> +               if (max_mapped_addr == (phys_ram_end - 1)) {
>                         memblock_set_current_limit(max_mapped_addr - 4096);
> +                       phys_ram_end -= 4096;
> +               }
>         }

You can just memblock_reserve() the last page of the first gigabyte, e.g.

	if (!IS_ENABLED(CONFIG_64BIT)
		memblock_reserve(SZ_1G - PAGE_SIZE, PAGE_SIZE);

The page will still be mapped, but it will never make it to the page
allocator.

The nice thing about it is, that memblock lets you to reserve regions that are
not necessarily populated, so there's no need to check where the actual RAM
ends.

>  
>         min_low_pfn = PFN_UP(phys_ram_base);
> --
> 
> I'd really like to see an actual MM person (Mike or Alex?) have some
> input here, and not simply my pasta-on-wall approach. ;-)
> 
> 
> Björn

-- 
Sincerely yours,
Mike.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ