netdev - [QUESTION] potential issue - unusual drops on XL710 (40gbit) cards with ksoftirqd hogging one of cpus near 100%

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <e28faa37-549d-4c49-824f-1d0dfbfb9538@yandex.pl>
Date: Mon, 23 Oct 2023 15:59:09 +0200
From: Michal Soltys <msoltyspl@...dex.pl>
To: netdev@...r.kernel.org
Cc: Rafał Golcz <rgl@...k.pl>, Piotr Przybylski <ppr@...k.pl>
Subject: [QUESTION] potential issue - unusual drops on XL710 (40gbit) cards
 with ksoftirqd hogging one of cpus near 100%

Hi,

A while ago we have noticed some unusual RX drops during more busy day 
periods (but nowhere near hitting any hardware limits) on our production 
edge servers. More details on their usage below.

First the hardware in question:

"older" servers:
Huawei FusionServer RH1288 V3 / 40x Intel(R) Xeon(R) CPU E5-2640 v4

"newer" servers:
Huawei FusionServer Pro 1288H V5 / 40x Intel(R) Xeon(R) Gold 5115

In both cases the servers have 512 GB ram and are using two XL710 40GbE 
cards in 802.3ad bond (the traffic is very well spread out).

Network card details:

Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)

Driver info as reported by ethtool same for both types:

driver: i40e
firmware-version: 8.60 0x8000bd5f 1.3140.0
or
firmware-version: 8.60 0x8000bd85 1.3140.0

These are running under Ubuntu 20.04.6 LTS server with 5.15 kernels 
(although they differ by minor versions, the issue by now happened on 
most of those).

The servers are doing content delivery work, mostly sending the data, 
primarily from the page cache. At the busiest periods the traffic 
approaches roughly ~50gbit per server across those 2 bonded network 
cards (outbound traffic). Inbound traffic in comparison is a fraction of 
that, reaching maybe 1gbit on average.

The traffic is handled via Open Resty (nginx) with additional tr/edge 
logic coded in lua. When everything is fine, we have:

- outbound 30-50gbit spread across both NICs
- inbound 500mbit-1gbit
- NET_RX softirqs averaging ~20k/s per cpu
- NET_TX softirqs averaging 5-10/s per cpu
- no packet drops
- cpu usage around ~10%-20% per core
- ram used by nginx processes and the rest of the system up to around 15g
- the rest of the ram in practice used as a page cache

Sometimes (once per few days, on random of those servers) we have weird 
anomaly happening during the busy hours:

- lasts around 10-15 minutes, starts suddenly and ends suddenly as well
- on one of the cpus we get the following anomalies:
   - NET_RX softirqs drop to ~1k/s
   - NET_TX softirqs rise to ~500-1k/s
   - ksoftirqd hogs that particular cpu at >90% usage
- significant packet drop on the inbound side - roughly around 10-20% 
incoming packets
- lots of nginx context switches
- aggressively reclaimed page cache - up to ~200 GB memory is reclaimed 
and immediately start filling up again with the data normally served by 
those servers
- the actual memory used by nginx/userland rises a tiny bit by ~1 GB 
while that happens

 From things we know:

- none of the network cards ever reach their theoretical capability, as 
the traffic is well spread across them - when the issues happen it's 
around 20-25gbit/card
- we are not saturating inter-socket QPI links
- this happens and stops happening pretty much suddenly
- the TX side remains w/o drop issues
- this has been happening since the december 2022, but it's hard to 
pinpoint the reason at this moment
- we have system-wide perf dumps from the period when it happens (see 
the link at the end)

Sorry for a bit chaotic writeup. At this point we are a bit out of ideas 
how to debug it further (and what data to provide to pinpoint the issue).

- is it perhaps a known issue with kernels around 5.15 and/or these 
network cards and/or their drivers ?
- any pointers what else (besides kernel/xl710/driver) could be an issue ?
- any ideas how to debug it further
- we have system-wide perf dumps from the period when it happens, if 
that would be useful for further analysis; any assistance would be 
greately appreciated

Link to aforementioned perf dump:
https://drive.google.com/file/d/11qFgRP-r03Oj42V_fAgQBp2ebJ1d4YBW/view

 From the quick check it looks like we spend a lot of time in RX path in
__tcp_push_pending_frames()