linux-kernel - Re: Kernel Panic in skb_release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210625125906.gj45zykbemh5zzhw@gilmour>
Date:   Fri, 25 Jun 2021 14:59:06 +0200
From:   Maxime Ripard <maxime@...no.tech>
To:     Florian Fainelli <f.fainelli@...il.com>
Cc:     nicolas saenz julienne <nsaenz@...nel.org>,
        Doug Berger <opendmb@...il.com>,
        bcm-kernel-feedback-list@...adcom.com,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: Kernel Panic in skb_release_data using genet

Hi Florian,

Sorry for the late reply

On Thu, Jun 10, 2021 at 02:33:17PM -0700, Florian Fainelli wrote:
> On 6/2/2021 6:28 AM, Maxime Ripard wrote:
> > On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
> >> On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
> >>>> That is also how I boot my Pi4 at home, and I suspect you are right, if
> >>>> the VPU does not shut down GENET's DMA, and leaves buffer addresses in
> >>>> the on-chip descriptors that point to an address space that is managed
> >>>> totally differently by Linux, then we can have a serious problem and
> >>>> create some memory corruption when the ring is being reclaimed. I will
> >>>> run a few experiments to test that theory and there may be a solution
> >>>> using the SW_INIT reset controller to have a big reset of the controller
> >>>> before handing it over to the Linux driver.
> >>>
> >>> Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
> >>> that the TX or RX DMA have been left running during the hand over from
> >>> the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
> >>> as much as possible the differences between your set-up and my set-up
> >>> but so far have not been able to reproduce the crash in booting from NFS
> >>> repeatedly, I will try again.
> >>
> >> FWIW I can reproduce the error too. That said it's rather hard to reproduce,
> >> something in the order of 1 failure every 20 tries.
> > 
> > Yeah, it looks like it's only from a cold boot and comes in "bursts",
> > where you would get like 5 in a row and be done with it for a while.
> 
> Here are two patches that you could try exclusive from one another
> 
> 1) Limit GENET to a single queue
> 
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index fcca023f22e5..e400c12e6868 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
> net_device *dev, bool new_carrier)
>         return 0;
>  }
> 
> +static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
> *skb,
> +                                struct net_device *sb_dev)
> +{
> +       return 0;
> +}
> +
>  static const struct net_device_ops bcmgenet_netdev_ops = {
>         .ndo_open               = bcmgenet_open,
>         .ndo_stop               = bcmgenet_close,
> @@ -3666,6 +3672,7 @@ static const struct net_device_ops
> bcmgenet_netdev_ops = {
>  #endif
>         .ndo_get_stats          = bcmgenet_get_stats,
>         .ndo_change_carrier     = bcmgenet_change_carrier,
> +       .ndo_select_queue       = bcmgenet_select_queue,
>  };
> 
>  /* Array of GENET hardware parameters/characteristics */
> 
> 2) Ensure that all TX/RX queues are disabled upon DMA initialization
> 
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index fcca023f22e5..7f8a5996fbbb 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
> bcmgenet_priv *priv,
>  /* Returns a reusable dma control register value */
>  static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
>  {
> +       unsigned int i;
>         u32 reg;
>         u32 dma_ctrl;
> 
>         /* disable DMA */
>         dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> +       for (i = 0; i < priv->hw_params->tx_queues; i++)
> +               dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
>         reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
>         reg &= ~dma_ctrl;
>         bcmgenet_tdma_writel(priv, reg, DMA_CTRL);
> 
> +       dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> +       for (i = 0; i < priv->hw_params->rx_queues; i++)
> +               dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
>         reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
>         reg &= ~dma_ctrl;
>         bcmgenet_rdma_writel(priv, reg, DMA_CTRL);

I had a bunch of issues popping up today so I took the occasion to test
those patches. The first one doesn't change anything, I still had the
crash occurring with it. With the second applied (in addition), it seems
like it's fixed. I'll keep testing and will let you know.

Maxime

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)