[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 13 Apr 2020 08:03:57 +0300
From: Leon Romanovsky <leon@...nel.org>
To: David Miller <davem@...emloft.net>
Cc: kuba@...nel.org, arjan@...ux.intel.com, xiyou.wangcong@...il.com,
jhs@...atatu.com, jiri@...nulli.us, netdev@...r.kernel.org
Subject: Re: [PATCH net v1] net/sched: Don't print dump stack in event of
transmission timeout
On Sun, Apr 12, 2020 at 09:19:25PM -0700, David Miller wrote:
>
> This is cause by a device"overwhelmed with traffic"? Sounds like
> normal operation to me.
>
> That's a bug, and the driver handling the device with this problem
> should adjust how it implements TX timeouts to accomodate this.
>From the internal bug description, hope that it makes sense.
-----
A timeout may occur if the amount of the reported bytes higher than the queue limit,
in this case, the kernel closes the queue and only after getting a completion it wil
reopen it.
In the debug we saw that in some situations the driver gets a **delayed completion**,
completions arrive after **1 min**, therefore, the amount of queued bytes exceeds the
DQL max size.
As a result, the kernel after watchdog_timeo calls the driver's timeout function,
that prints timeout to dmesg.
After debugging the issue with FW to understand the root cause of the delayed completions
we understand that since the IB and the TCP traffic are running at the same service level (SL),
the same schedule queue schedules between all the QPs, and in this case if one of the IB QPs get
stuck because of congestion, all other QPs will be stuck (include the TCP QPs) until releasing
the stuck QP.
-----
User separates traffic to different SLs.
Thanks
Powered by blists - more mailing lists