linux-kernel - Re: Kernel Panic in skb_release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <77d412b4-cdd6-ea86-d7fd-adb3af8970d9@gmail.com>
Date:   Fri, 28 May 2021 09:48:14 -0700
From:   Florian Fainelli <f.fainelli@...il.com>
To:     Maxime Ripard <maxime@...no.tech>,
        Florian Fainelli <f.fainelli@...il.com>
Cc:     Doug Berger <opendmb@...il.com>,
        bcm-kernel-feedback-list@...adcom.com,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        Nicolas Saenz Julienne <nsaenz@...nel.org>
Subject: Re: Kernel Panic in skb_release_data using genet

On 5/28/21 9:32 AM, Maxime Ripard wrote:
> hi Florian,
> 
> On Fri, May 28, 2021 at 09:21:27AM -0700, Florian Fainelli wrote:
>> On 5/24/21 8:37 AM, Florian Fainelli wrote:
>>>
>>>
>>> On 5/24/2021 8:13 AM, Maxime Ripard wrote:
>>>> Hi Florian,
>>>>
>>>> On Mon, May 24, 2021 at 07:49:25AM -0700, Florian Fainelli wrote:
>>>>> Hi Maxime,
>>>>>
>>>>> On 5/24/2021 6:01 AM, Maxime Ripard wrote:
>>>>>> Hi Doug, Florian,
>>>>>>
>>>>>> I've been running a RaspberryPi4 with a mainline kernel for a while,
>>>>>> booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
>>>>>> I'm getting a kernel panic around the time init is started.
>>>>>>
>>>>>> I was debugging a kernel based on drm-misc-next-2021-05-17 today with
>>>>>> KASAN enabled and got this, which looks related:
>>>>>
>>>>> Is there a known good version that could be used for bisection or you
>>>>> just started to do this test and you have no reference point?
>>>>
>>>> I've had this issue for over a year and never (I think?) got a good
>>>> version, so while it might be a regression, it's not a recent one.
>>>
>>> OK, this helps and does not really help.
>>>
>>>>
>>>>> How stable in terms of clocking is the configuration that you are using?
>>>>> I could try to fire up a similar test on a Pi4 at home, or use one of
>>>>> our 72112 systems which is the closest we have to a Pi4 and see if that
>>>>> happens there as well.
>>>>
>>>> I'm not really sure about the clocking. Is there any clock you want to
>>>> look at in particular?
>>>
>>> ARM, DDR, AXI, anything that could cause some memory corruption to occur
>>> essentially. GENET clocks are fairly fixed, you have a 250MHz clock and
>>> a 125MHz clock feeding the data path.
>>>
>>>>
>>>> My setup is fairly simple: the firmware and kernel are loaded over TFTP
>>>> and the rootfs is mounted over NFS, and the crash always occur around
>>>> init start, so I guess when it actually starts to transmit a decent
>>>> amount of data?
>>>
>>> Do you reproduce this problem with KASAN disabled, do you eventually
>>> have a crash pointing back to the same location?
>>>
>>> I have a suspicion that this is all Pi4 specific because we regularly
>>> run the GENET driver through various kernel versions (4.9, 5.4 and 5.10
>>> and mainline) and did not run into that.
>>
>> I have not had time to get a set-up to reproduce what you are seeing,
>> could you share your .config meanwhile? Thanks
> 
> Sorry, I didn't have the time to check how the clock were behaving.
> 
> You'll find attached my config.txt file and .config
> 
> I'm booting the board entirely from TFTP (which might introduce some
> issues in the "handoff" from the bootloader to the kernel), you'll find
> some guide there:
> 
> https://www.raspberrypi.org/documentation/hardware/raspberrypi/bootmodes/net_tutorial.md

That is also how I boot my Pi4 at home, and I suspect you are right, if
the VPU does not shut down GENET's DMA, and leaves buffer addresses in
the on-chip descriptors that point to an address space that is managed
totally differently by Linux, then we can have a serious problem and
create some memory corruption when the ring is being reclaimed. I will
run a few experiments to test that theory and there may be a solution
using the SW_INIT reset controller to have a big reset of the controller
before handing it over to the Linux driver.
-- 
Florian