[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c38966ab-4a3c-4a72-a3c1-5c0301408609@kernel.org>
Date: Fri, 5 Dec 2025 17:23:46 +0100
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Florian Westphal <fw@...len.de>
Cc: netfilter-devel@...r.kernel.org, Pablo Neira Ayuso <pablo@...filter.org>,
netdev@...r.kernel.org, phil@....cc, Eric Dumazet <eric.dumazet@...il.com>,
"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, kernel-team@...udflare.com,
mfleming@...udflare.com, matt@...dmodwrite.com, nwood@...udflare.com,
aforster@...udflare.com
Subject: Re: [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account
for nth-match
On 27/11/2025 15.40, Florian Westphal wrote:
> Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>> The iptables statistic nth mode is documented to match one packet every nth
>> packets. When packets gets GRO/GSO aggregated before traversing the statistic
>> nth match, then they get accounted as a single packet.
>>
>> This patch takes into account the number of packet frags a GRO/GSO packet
>> contains for the xt_statistic match.
>
> I doubt we can do this upstream. Two reasons that come to mind, in no
> particular order:
>
> 1. It introduces asymmetry. All matches use "skb == packet" design.
> Examples that come to mind: xt_limit, xt_length.
> 2. This adds a compat issue with nftables:
> iptables-translate -A INPUT -m statistic --mode nth --packet 0 --every 10
> nft 'add rule ip filter INPUT numgen inc mod 10 0 counter'
>
> 'numgen' increments a counter for every skb, i.e. reg := i++;.
> But, after this patch -m statistics doesn't work this way anymore
> and the two rules no longer do the same thing.
At some point we do want to move from iptables to nftables, so we do
need to come up with a solution for nft.
I wonder what nft building block we need if we want something like this,
e.g. based on the number of GRO/GSO segments. We could have an inc_segs
and a greater-than-and-reset (gt_limit / gt_mod) with a counter limit
modifier that wraps increments. Like the C code does, it matches when we
overrun 'nth.every' and resets it back. For existing iptables code this
is reset to zero, and new code take into account how much we overshot.
> But even if we'd ignore this or add a flag to control behavior, I don't
> see how this could be implemented in nft.
>
> And last but not least, I'm not sure the premise is correct.
> Yes, when you think of 'packet sampling', then we don't 'match'
> often enough for gro/gso case.
>
> However, when doing '-m statistic ... -j LOG' (or whatever), then the
> entire GSO superpacket is logged, i.e. several 'packets' got matched
> at once.
>
> So the existing algorithm works correctly even when considering
> aggregation because on average the correct amount of segments gets
> matched (logged).
>
No, this is where the "on average" assumption fails. Packets with many
segments gets statistically under-represented. As far as I'm told people
noticed that the bandwidth estimate based on sampled packets were too
far off (from other correlating stats like interface stats).
I'm told (Cc Nick Wood) the statistically correct way with --probability
setting would be doing a Bernoulli trial[1] or a "binomial experiment".
This is how our userspace code (that gets all GRO/GSO packets) does
statistical sampling based on the number of segments (to get the correct
statistical probability):
The Rust code does this:
let probability = 1.0 / sample_interval as f64;
let adjusted_probability = nr_packets * probability * (1.0 -
probability).powf(nr_packets - 1.0);
[1] https://en.wikipedia.org/wiki/Bernoulli_trial
We could (also) update the kernel code for --probability to do this, but
as you can see the Rust code uses floating point calculations.
It was easier to change the nth code (and easier for me to reason about)
than dealing with converting the the formula to use an integer
approximation (given we don't have floating point calc in kernel).
> With this proposed new algo, we can now match 100% of skbs / aggregated
> segments, even for something like '--every 10'. And that seems fishy to
> me.
>
> As far as I understood its only 'more correct' in your case because the
> logging backend picks one individual segment out from the NFLOG'd
> superpacket.
>
> But if it would NOT do that, then you now sample (almost) all segments
> seen on wire. Did I misunderstand something here?
See above explanation about Bernoulli trial[1].
--Jesper
Powered by blists - more mailing lists