linux-kernel - [PATCH net-next 0/1] mlx5: Add netdev-genl queue stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20240503022549.49852-1-jdamato@fastly.com>
Date: Fri,  3 May 2024 02:25:48 +0000
From: Joe Damato <jdamato@...tly.com>
To: linux-kernel@...r.kernel.org,
	netdev@...r.kernel.org,
	tariqt@...dia.com,
	saeedm@...dia.com
Cc: gal@...dia.com,
	nalramli@...tly.com,
	Joe Damato <jdamato@...tly.com>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>,
	Leon Romanovsky <leon@...nel.org>,
	linux-rdma@...r.kernel.org (open list:MELLANOX MLX5 core VPI driver),
	Paolo Abeni <pabeni@...hat.com>
Subject: [PATCH net-next 0/1] mlx5: Add netdev-genl queue stats

Hi:

This is only 1 patch, so I know a cover letter isn't necessary, but it
seems there are a few things to mention.

This change adds support for the per queue netdev-genl API to mlx5,
which seems to output stats:

/cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml \
         --dump qstats-get --json '{"scope": "queue"}'

..snip
 {'ifindex': 7,
  'queue-id': 28,
  'queue-type': 'tx',
  'tx-bytes': 399462,
  'tx-packets': 3311},
..snip

I've tried to use the tooling suggested to verify that the per queue
stats match the rtnl stats by doing this:

  NETIF=eth0 tools/testing/selftests/drivers/net/stats.py

And the tool outputs that there is a failure:

  # Exception| Exception: Qstats are lower, fetched later
  not ok 3 stats.pkt_byte_sum

The other tests all pass (including stats.qstat_by_ifindex).

This appears to mean that the netdev-genl queue stats have lower numbers
than the rtnl stats even though the rtnl stats are fetched first. I
added some debugging and found that both rx and tx bytes and packets are
slightly lower.

The only explanations I can think of for this are:

1. tx_ptp_opened and rx_ptp_opened are both true, in which case
   mlx5e_fold_sw_stats64 adds bytes and packets to the rtnl struct and
   might account for the difference. I skip this case in my
   implementation, so that could certainly explain it.
2. Maybe I'm just misunderstanding how stats aggregation works in mlx5,
   and that's why the numbers are slightly off?

It appears that the driver uses a workqueue to queue stats updates which
happen periodically.

 0. the driver occasionally calls queue_work on the update_stats_work
    workqueue.
 1. This eventually calls MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw),
    in drivers/net/ethernet/mellanox/mlx5/core/en_stats.c, which appears
    to begin by first memsetting the internal stats struct where stats are
    aggregated to zero. This would mean, I think, the get_base_stats
    netdev-genl API implementation that I have is correct: simply set
    everything to 0.... otherwise we'd end up double counting in the
    netdev-genl RX and TX handlers.
 2. Next, each of the stats helpers are called to collect stats into the
    freshly 0'd internal struct (for example:
    mlx5e_stats_grp_sw_update_stats_rq_stats).

That seems to be how stats are aggregated, which would suggest that if I
simply .... do what I'm doing in this change the numbers should line up.

But they don't and its either because of PTP or because I am
misunderstanding/doing something wrong.

Maybe the MLNX folks can suggest a hint?

Thanks,
Joe

Joe Damato (1):
  net/mlx5e: Add per queue netdev-genl stats

 .../net/ethernet/mellanox/mlx5/core/en_main.c | 68 +++++++++++++++++++
 1 file changed, 68 insertions(+)

-- 
2.25.1