netdev - Re: [BUG Report] hns3: tx_timeout on high memory pressure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <EC72134F-DF4F-4CC5-90B8-D2DC037AAAD0@gmail.com>
Date: Tue, 24 Sep 2024 14:23:55 +0800
From: Miao Wang <shankerwangmiao@...il.com>
To: Jijie Shao <shaojijie@...wei.com>
Cc: netdev@...r.kernel.org,
 Shengqi Chen <harry-chen@...look.com>,
 Yuxiang Zhang <zz593141477@...il.com>,
 Jiajie Chen <jiegec@...com>,
 Mirror Admin Tuna <mirroradmin@...a.tsinghua.edu.cn>,
 Salil Mehta <salil.mehta@...wei.com>
Subject: Re: [BUG Report] hns3: tx_timeout on high memory pressure



> 2024年9月24日 10:06，Jijie Shao <shaojijie@...wei.com> 写道：
> 
> 
> on 2024/9/24 0:12, Miao Wang wrote:
>>> 2024年9月23日 21:37，Jijie Shao <shaojijie@...wei.com> 写道：
>>> 
>>> 
>>> our analysis. I wonder how can I verify the scheduling of NAPI.
>>> You can use napi trace to verify it:
>>> echo 1 > /sys/kernel/debug/tracing/events/napi/napi_poll/enable
>>> cat /sys/kernel/debug/tracing/trace
>> I managed to make a reproduce. Attached is the dmesg and the trace log. It seems
>> that in the trace log, napi_poll() is kept called.
>> 
>> My reproducing environment is a smart git http server, which is using nginx as
>> the frontend, git-http-backend as the CGI server and fastcgiwrapper for
>> connecting them, running on a Taishan server. The served git repo is linux.git.
>> At the same time, start some programs taking up about 70% of the system memory.
>> Using several other hosts as git client, start as many git clone processes as
>> possible on the client hosts, about 2000 in total, at the same time, pointing
>> to the git server, letting the forked git processes on the server side take up
>> all the memory left, and causing OOM on the server.
>> 
> Hi Miao,
> Thanks for the reproduce. I checked the dmesg and trace log of napi.
> We can see the first tx timeout occured at the time [19555675.553853], and we
> can see the napi poll seems keep being called all the time. Exactly the trace
> log is for all the napi context, and we can differentiate the napis by the
> address of napi struct.
> 
> For we can't direct map the queue id with the napi poll, so I just searched
> several poll records of them.
> We can see some napi poll haven't been called for more than 5s, exceed the tx tiemout interval. It may caused by CPU busy, or no tx traffic for the queue during
> the time.

Thanks for your analysis. There are network interfaces driven by hns3, tuntap,
bonding, vxlan and veth on that machine. As I can find in the source code, only
hns3 and veth are using napi interfaces, but I cannot find a way to map the
struct address with a specific interface. 

It is also strange that why the stall happens, since I've never seen stalls
like this happening on x86 machines with the i40e driver. Currently, I'm lack
of testing resources to see whether this only happens on the hns3 driver or it
can happen on other drivers on the arm64 platform. It would be nice if you can
also reproduce this symptom.

Cheers,

Miao Wang