linux-kernel - Re: next/master boot bisection: next-20190215 on beaglebone-black

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <21d138a5-13e4-9e83-d7fe-e0639a8d180a@collabora.com>
Date:   Thu, 7 Mar 2019 09:16:20 +0000
From:   Guillaume Tucker <guillaume.tucker@...labora.com>
To:     Mike Rapoport <rppt@...ux.ibm.com>
Cc:     Dan Williams <dan.j.williams@...el.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...e.com>,
        Mark Brown <broonie@...nel.org>,
        Tomeu Vizoso <tomeu.vizoso@...labora.com>,
        Matt Hart <matthew.hart@...aro.org>,
        Stephen Rothwell <sfr@...b.auug.org.au>, khilman@...libre.com,
        enric.balletbo@...labora.com, Nicholas Piggin <npiggin@...il.com>,
        Dominik Brodowski <linux@...inikbrodowski.net>,
        Masahiro Yamada <yamada.masahiro@...ionext.com>,
        Kees Cook <keescook@...omium.org>,
        Adrian Reber <adrian@...as.de>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Linux MM <linux-mm@...ck.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Richard Guy Briggs <rgb@...hat.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>, info@...nelci.org
Subject: Re: next/master boot bisection: next-20190215 on beaglebone-black

On 06/03/2019 14:05, Mike Rapoport wrote:
> On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
>> On 01/03/2019 23:23, Dan Williams wrote:
>>> On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
>>> <guillaume.tucker@...labora.com> wrote:
>>>
>>> Is there an early-printk facility that can be turned on to see how far
>>> we get in the boot?
>>
>> Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
>> earlyprintk in the command line.  Here's the result, with the
>> commit cherry picked on top of next-20190304:
>>
>>   https://lava.collabora.co.uk/scheduler/job/1526326
>>
>> [    1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
>> [    1.396718] Unable to handle kernel paging request at virtual address 77bb4003
>> [    1.404203] pgd = (ptrval)
>> [    1.406971] [77bb4003] *pgd=00000000
>> [    1.410650] Internal error: Oops: 5 [#1] ARM
>> [...]
>> [    1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
>> [    1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)
>>
>> It's always failing at that point in the code.  Also when
>> enabling "debug" on the kernel command line, the issue goes
>> away (exact same binaries etc..):
>>
>>   https://lava.collabora.co.uk/scheduler/job/1526327
>>
>> For the record, here's the branch I've been using:
>>
>>   https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug
>>
>> The board otherwise boots fine with next-20190304 (SMP=n), and
>> also with the patch applied but the shuffle configs set to n.
>>
>>> Were there any boot *successes* on ARM with shuffling enabled? I.e.
>>> clues about what's different about the specific memory setup for
>>> beagle-bone-black.
>>
>> Looking at the KernelCI results from next-20190215, it looks like
>> only the BeagleBone Black with SMP=n failed to boot:
>>
>>   https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/
>>
>> Of course that's not all the ARM boards that exist out there, but
>> it's a fairly large coverage already.
>>
>> As the kernel panic always seems to originate in ti-sysc.c,
>> there's a chance it's only visible on that platform...  I'm doing
>> a KernelCI run now with my test branch to double check that,
>> it'll take a few hours so I'll send an update later if I get
>> anything useful out of it.

Here's the result, there were a couple of failures but some were
due to infrastructure errors (nyan-big) and I'm not sure about
what was the problem with the meson boards:

  https://staging.kernelci.org/boot/all/job/gtucker/branch/kernelci-local/kernel/next-20190304-1-g4f0b547b03da/

So there's no clear indicator that the shuffle config is causing
any issue on any other platform than the BeagleBone Black.

>> In the meantime, I'm happy to try out other things with more
>> debug configs turned on or any potential fixes someone might
>> have.
> 
> ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
> failure has something to do with it...
> 
> Guillaume, can you try this patch:

Sure, it doesn't seem to be fixing the problem though:

  https://lava.collabora.co.uk/scheduler/job/1527471

I've added the patch to the same branch based on next-20190304.

I guess this needs to be debugged a little further to see what
the panic really is about.  I'll see if I can spend a bit more
time on it this week, unless there's any BeagleBone expert
available to help or if someone has another fix to try out.

Guillaume

> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 3ce1248..4a04aac 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -58,7 +58,8 @@ module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
>   * For two pages to be swapped in the shuffle, they must be free (on a
>   * 'free_area' lru), have the same order, and have the same migratetype.
>   */
> -static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
> +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order,
> +						  struct zone *z)
>  {
>  	struct page *page;
>  
> @@ -80,6 +81,9 @@ static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
>  	if (!PageBuddy(page))
>  		return NULL;
>  
> +	if (!memmap_valid_within(pfn, page, z))
> +		return NULL;
> +
>  	/*
>  	 * ...is the page on the same list as the page we will
>  	 * shuffle it with?
> @@ -123,7 +127,7 @@ void __meminit __shuffle_zone(struct zone *z)
>  		 * page_j randomly selected in the span @zone_start_pfn to
>  		 * @spanned_pages.
>  		 */
> -		page_i = shuffle_valid_page(i, order);
> +		page_i = shuffle_valid_page(i, order, z);
>  		if (!page_i)
>  			continue;
>  
> @@ -137,7 +141,7 @@ void __meminit __shuffle_zone(struct zone *z)
>  			j = z->zone_start_pfn +
>  				ALIGN_DOWN(get_random_long() % z->spanned_pages,
>  						order_pages);
> -			page_j = shuffle_valid_page(j, order);
> +			page_j = shuffle_valid_page(j, order, z);
>  			if (page_j && page_j != page_i)
>  				break;
>  		}
>  
>