netdev - Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200412192336.GD334007@unreal>
Date:   Sun, 12 Apr 2020 22:23:36 +0300
From:   Leon Romanovsky <leon@...nel.org>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     "David S. Miller" <davem@...emloft.net>,
        Arjan van de Ven <arjan@...ux.intel.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Jiri Pirko <jiri@...nulli.us>, netdev@...r.kernel.org
Subject: Re: [PATCH net v1] net/sched: Don't print dump stack in event of
 transmission timeout

On Sun, Apr 12, 2020 at 11:59:13AM -0700, Jakub Kicinski wrote:
> On Sun, 12 Apr 2020 09:08:54 +0300 Leon Romanovsky wrote:
> > Hi Dave,
> >
> > This is a new version of previously sent v0 [1] with change in print error
> > level as was suggested by Jakub and Cong. I'm asking you to reevaluate
> > your previous decision [2] given the fact that this is user triggered
> > bug and very similar scenario was committed by Linus "fs/filesystems.c:
> > downgrade user-reachable WARN_ONCE() to pr_warn_once()" a couple of days
> > ago [3].
> >
> > [1] https://lore.kernel.org/netdev/20200402152336.538433-1-leon@kernel.org
> > [2] https://lore.kernel.org/netdev/20200402.180218.940555077368617365.davem@davemloft.net
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=26c5d78c976ca298e59a56f6101a97b618ba3539
>
> How is it user triggerable? If there's a IB-specific reason maybe ib
> netdev should stop implementing ndo_tx_timeout.

It is happening if device is extremely over loaded with traffic,
internally HW decreases the performance (HW bug), it is causing to
the TX timeouts and to the WARN_ON splat.

We don't want to stop implementing ndo_tx_timeout, because it works
right most (if not all) of the time.

If it is very important, I will dig into internal bug reports to see
the possible reproduction scenarios, but from what I saw till now,
it is statistical failure.

And it is not IB specific, but mlx4 specific.

Thanks