[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoM=DUtju91y_0zsyyJJ+bPxTRAAWyBA_1tM+RwY8VXbbRw@mail.gmail.com>
Date: Fri, 10 Nov 2023 08:11:00 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: davem@...emloft.net, netdev@...r.kernel.org, edumazet@...gle.com,
pabeni@...hat.com, syzbot+d55372214aff0faa1f1f@...kaller.appspotmail.com,
xiyou.wangcong@...il.com, jiri@...nulli.us
Subject: Re: [RFC net-next] net: don't dump stack on queue timeout
On Wed, Nov 8, 2023 at 7:09 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> The top syzbot report for networking (#14 for the entire kernel)
> is the queue timeout splat. We kept it around for a long time,
> because in real life it provides pretty strong signal that
> something is wrong with the driver or the device.
>
> Removing it is also likely to break monitoring for those who
> track it as a kernel warning.
>
> Nevertheless, WARN()ings are best suited for catching kernel
> programming bugs. If a Tx queue gets starved due to a pause
> storm, priority configuration, or other weirdness - that's
> obviously a problem, but not a problem we can fix at
> the kernel level.
>
> Bite the bullet and convert the WARN() to a print.
>
> Before:
>
> NETDEV WATCHDOG: eni1np1 (netdevsim): transmit queue 0 timed out 1975 ms
> WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x39e/0x3b0
> [... completely pointless stack trace of a timer follows ...]
>
> Now:
>
> netdevsim netdevsim1 eni1np1: NETDEV WATCHDOG: CPU: 0: transmit queue 0 timed out 1769 ms
>
> Alternatively we could mark the drivers which syzbot has
> learned to abuse as "print-instead-of-WARN" selectively.
>
> Reported-by: syzbot+d55372214aff0faa1f1f@...kaller.appspotmail.com
> Signed-off-by: Jakub Kicinski <kuba@...nel.org>
Reviewed-by: Jamal Hadi Salim <jhs@...atatu.com>
cheers,
jamal
> ---
> CC: jhs@...atatu.com
> CC: xiyou.wangcong@...il.com
> CC: jiri@...nulli.us
> ---
> net/sched/sch_generic.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 4195a4bc26ca..8dd0e5925342 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
>
> if (unlikely(timedout_ms)) {
> trace_net_dev_xmit_timeout(dev, i);
> - WARN_ONCE(1, "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out %u ms\n",
> - dev->name, netdev_drivername(dev), i, timedout_ms);
> + netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
> + raw_smp_processor_id(),
> + i, timedout_ms);
> netif_freeze_queues(dev);
> dev->netdev_ops->ndo_tx_timeout(dev, i);
> netif_unfreeze_queues(dev);
> --
> 2.41.0
>
Powered by blists - more mailing lists