[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACrpuLQGj70xCi8wDH4HeKzkA=d-9+eOYkkQ47M2Tw8MA65kzQ@mail.gmail.com>
Date: Mon, 8 Dec 2025 10:37:48 +0000
From: Nick Wood <nwood@...udflare.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Florian Westphal <fw@...len.de>, netfilter-devel@...r.kernel.org,
Pablo Neira Ayuso <pablo@...filter.org>, netdev@...r.kernel.org, phil@....cc,
Eric Dumazet <eric.dumazet@...il.com>, "David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, kernel-team@...udflare.com,
mfleming@...udflare.com, matt@...dmodwrite.com, aforster@...udflare.com
Subject: Re: [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account
for nth-match
On Fri, 5 Dec 2025 at 16:23, Jesper Dangaard Brouer <hawk@...nel.org> wrote:
> > So the existing algorithm works correctly even when considering
> > aggregation because on average the correct amount of segments gets
> > matched (logged).
> >
>
> No, this is where the "on average" assumption fails. Packets with many
> segments gets statistically under-represented. As far as I'm told people
> noticed that the bandwidth estimate based on sampled packets were too
> far off (from other correlating stats like interface stats).
On-average is technically correct here; the issue is more subtle than
simple undercounting. To understand, consider a 50pps flow (without
GRO/GSO) with a sampling rate of 1/100. With 1 in k sampling we take a
sample every other second, and to estimate total flow we multiply by
the sample rate so we get a sawtooth looking wave that alternates
between 100pps and 0 every second. This is not 'correct', it's an
estimate as we'd expect, the absolute error here alternates between
+/-50 with the same frequency.
Now let's encapsulate these 50pps in a single GSO packet every second,
with 1-in-k is sampling on an all or nothing basis. For 99 seconds out
of every hundred we sample no packets -this is the under-sampling
you're referencing. But, for 1 second in 100 we sample the whole
GRO/GSO, and capture 50 samples. Causing our pps estimate for that
instant spikes to 5,000pps - so for an instant our absolute error
spikes by 2 orders of magnitude ('true' pps is 50 remember).
If you average the single spike into the 99 seconds of undersampling,
the figures do 'average out', but for very short flows this
amplification of error makes it impossible to accurately estimate the
true packet rate.
>
> I'm told (Cc Nick Wood) the statistically correct way with --probability
> setting would be doing a Bernoulli trial[1] or a "binomial experiment".
> This is how our userspace code (that gets all GRO/GSO packets) does
> statistical sampling based on the number of segments (to get the correct
> statistical probability):
>
> The Rust code does this:
> let probability = 1.0 / sample_interval as f64;
> let adjusted_probability = nr_packets * probability * (1.0 -
> probability).powf(nr_packets - 1.0);
>
> [1] https://en.wikipedia.org/wiki/Bernoulli_trial
>
> We could (also) update the kernel code for --probability to do this, but
> as you can see the Rust code uses floating point calculations.
>
> It was easier to change the nth code (and easier for me to reason about)
> than dealing with converting the the formula to use an integer
> approximation (given we don't have floating point calc in kernel).
with s = integer sample rate (i.e s=100 if we're sampling 1/100)
and n = nr_packets:
sample_threshold = [ 2**32 //s //s ] * [n*(s - (n-1))] ;
if get_random_u32 < sample_threshold {
sample_single_subpacket
}*
Is an equivalent integer calculation for a Bernoulli trial treatment.
It undersamples by about 1 in 1/100k for s=100 and n=50 which is good
enough for most purposes. Error is smaller for smaller n. For smaller
s it may warrant an additional (cubic) term in n
*I'm a mathematician, not a kernel developer
> > With this proposed new algo, we can now match 100% of skbs / aggregated
> > segments, even for something like '--every 10'. And that seems fishy to
> > me.
> >
> > As far as I understood its only 'more correct' in your case because the
> > logging backend picks one individual segment out from the NFLOG'd
> > superpacket.
> >
> > But if it would NOT do that, then you now sample (almost) all segments
> > seen on wire. Did I misunderstand something here?
>
> See above explanation about Bernoulli trial[1].
>
> --Jesper
>
Powered by blists - more mailing lists