linux-kernel - Re: [PATCH v2] tg3: Disable tg3 device on system reboot to avoid triggering AER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACKFLi==iTvmSKwAcaFFpwCDNd+EQ9N1vS7+o4DRiTHdScMs4w@mail.gmail.com>
Date:   Fri, 26 Aug 2022 09:38:51 -0700
From:   Michael Chan <michael.chan@...adcom.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Kai-Heng Feng <kai.heng.feng@...onical.com>,
        Siva Reddy Kallam <siva.kallam@...adcom.com>,
        Prashant Sreedharan <prashant@...adcom.com>,
        Michael Chan <mchan@...adcom.com>,
        Josef Bacik <josef@...icpanda.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] tg3: Disable tg3 device on system reboot to avoid
 triggering AER

On Fri, Aug 26, 2022 at 9:19 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, Aug 25, 2022 at 5:25 PM Kai-Heng Feng
> <kai.heng.feng@...onical.com> wrote:
> >
> > Commit d60cd06331a3 ("PM: ACPI: reboot: Use S5 for reboot") caused a
> > reboot hang on one Dell servers so the commit was reverted.
> >
> > Someone managed to collect the AER log and it's caused by MSI:
> > [ 148.762067] ACPI: Preparing to enter system sleep state S5
> > [ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
> > [ 148.803731] {1}[Hardware Error]: event severity: recoverable
> > [ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
> > [ 148.816088] {1}[Hardware Error]: section_type: PCIe error
> > [ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
> > [ 148.829026] {1}[Hardware Error]: version: 3.0
> > [ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
> > [ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
> > [ 148.847309] {1}[Hardware Error]: slot: 0
> > [ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
> > [ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
> > [ 148.865145] {1}[Hardware Error]: class_code: 020000
> > [ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
> > [ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
> > [ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
> > [ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
> > [ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
> > [ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
> > [ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
> > [ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000
> >
> > The MSI is probably raised by incoming packets, so power down the device
> > and disable bus mastering to stop the traffic, as user confirmed this
> > approach works.
> >
> > In addition to that, be extra safe and cancel reset task if it's running.
> >
> > Cc: Josef Bacik <josef@...icpanda.com>
> > Link: https://lore.kernel.org/all/b8db79e6857c41dab4ef08bdf826ea7c47e3bafc.1615947283.git.josef@toxicpanda.com/
> > BugLink: https://bugs.launchpad.net/bugs/1917471
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@...onical.com>
> > ---
> > v2:
> >  - Move tg3_reset_task_cancel() outside of rtnl_lock() to prevent
> >    deadlock.
> >
>
> It seems tg3_reset_task_cancel() is already called while rtnl is held/owned.
> Should we worry about that ?

In this shutdown code path, if we cancel it before rtnl_lock(), the
TG3_FLAG_RESET_TASK_PENDING flag will be cleared and we will not try
to cancel it again later when rtnl_lock() is held.

But I agree that calling tg3_reset_task_cancel() under rtnl_lock can
potentially deadlock if the reset_task is scheduled and we wait for it
to finish.  That logic should be fixed separately.

>
> >  drivers/net/ethernet/broadcom/tg3.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> > index db1e9d810b416..89889d8150da1 100644
> > --- a/drivers/net/ethernet/broadcom/tg3.c
> > +++ b/drivers/net/ethernet/broadcom/tg3.c
> > @@ -18076,16 +18076,20 @@ static void tg3_shutdown(struct pci_dev *pdev)
> >         struct net_device *dev = pci_get_drvdata(pdev);
> >         struct tg3 *tp = netdev_priv(dev);
> >
> > +       tg3_reset_task_cancel(tp);
> > +
> >         rtnl_lock();
> > +
> >         netif_device_detach(dev);
> >
> >         if (netif_running(dev))
> >                 dev_close(dev);
> >
> > -       if (system_state == SYSTEM_POWER_OFF)
> > -               tg3_power_down(tp);
> > +       tg3_power_down(tp);
> >
> >         rtnl_unlock();
> > +
> > +       pci_disable_device(pdev);
> >  }
> >
> >  /**
> > --
> > 2.36.1
> >

Download attachment "smime.p7s" of type "application/pkcs7-signature" (4209 bytes)