netdev - Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f57d4702cb4d_10343208ab@john-XPS-13-9370.notmuch>
Date:   Tue, 08 Sep 2020 11:58:56 -0700
From:   John Fastabend <john.fastabend@...il.com>
To:     Lukas Wunner <lukas@...ner.de>,
        John Fastabend <john.fastabend@...il.com>
Cc:     Pablo Neira Ayuso <pablo@...filter.org>,
        Jozsef Kadlecsik <kadlec@...filter.org>,
        Florian Westphal <fw@...len.de>,
        netfilter-devel@...r.kernel.org, coreteam@...filter.org,
        netdev@...r.kernel.org, Daniel Borkmann <daniel@...earbox.net>,
        Alexei Starovoitov <ast@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Thomas Graf <tgraf@...g.ch>, Laura Garcia <nevola@...il.com>,
        David Miller <davem@...emloft.net>
Subject: Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook

Lukas Wunner wrote:
> On Wed, Sep 02, 2020 at 10:00:32PM -0700, John Fastabend wrote:
> > Lukas Wunner wrote:
> > > * Before:       4730418pps 2270Mb/sec (2270600640bps)
> > > * After:        4759206pps 2284Mb/sec (2284418880bps)
> > 
> > I used a 10Gbps ixgbe nic to measure the performance after the dummy
> > device hung on me for some reason. I'll try to investigate what happened
> > later. It was unrelated to these patches though.
> > 
> > But, with 10Gbps NIC and doing a pktgen benchmark with and without
> > the patches applied I didn't see any measurable differences. Both
> > cases reached 14Mpps.
> 
> Hm, I strongly suspect you may have measured performance of the NIC and
> that you'd get different before/after numbers with the dummy device.

OK tried again on dummy device.

> 
> 
> > > * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> > > * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)
> > 
> > Same here before/after aggregate appears to be the same. Even the
> > numbers above show a 1.2% degradation. Just curious is the above
> > from a single run or averaged over multiple runs or something
> > else? Seems like noise to me.
> 
> I performed at least 3 runs, but included just a single number in
> the commit message for brevity.  That number is intended to show
> where the numbers settled:
> 
> Before:           2257 2270 2270           Mb/sec
> After:            2282 2283 2284 2282      Mb/sec
> 
> Before + tc:      1941 1950 1951           Mb/sec
> After  + tc:      1923 1923 1919 1920 1919 Mb/sec
> 
> After + nft:      1782 1783 1782 1781      Mb/sec
> After + nft + tc: 1574 1566 1566           Mb/sec
> 
> So there's certainly some noise but still a speedup is clearly
> visible if neither tc nor nft is used, and a slight degradation
> if tc is used.

After running multiple times it does seem to be some small performance
improvement by adding noinline there. Its small though (maybe 1-2%?)
and I can't detect this on anything, but the dummy device.

But the degradation with clsact is also caused from this noinline.
If I add the noinline directly into the existing code I see the
same impact. Likely some some small performance improvement for the
no clsact case, but a very real degradation in the clsact case. Presumably,
due to having to do a call now. I didn't collect perf output
just did the simple test.

One piece I don't understand fully yet is on a single thread test
the degradation is small, but as the number of threads increases
the degradation increases. At single thread its 1-2%, but creeps
up to about 5% with 16 cores. Can you confirm this?

> 
> 
> > I did see something odd where it appeared fairness between threads
> > was slightly worse. I don't have any explanation for this? Did
> > you have a chance to run the test with -t >1?
> 
> Sorry, no, I only tested with a single thread on an otherwise idle
> machine.

We need to also consider the case with many threads or at least
become convinced its not going to change with thread count. What
I see is degradation is creeping up as cores increase.

> 
> 
> > Do you have plans to address the performance degradation? Otherwise
> > if I was building some new components its unclear why we would
> > choose the slower option over the tc hook. The two suggested
> > use cases security policy and DSR sound like new features, any
> > reason to not just use existing infrastructure?
> > 
> > Is the use case primarily legacy things already running in
> > nft infrastructure? I guess if you have code running now
> > moving it to this hook is faster and even if its 10% slower
> > than it could be that may be better than a rewrite?
> 
> nft and tc are orthogonal, i.e. filtering/mangling versus queueing.
> However tc has gained the ability to filter packets as well, hence
> there's some overlap in functionality.  Naturally tc does not allow
> the full glory of nft filtering/mangling options as Laura has stated,
> hence the need to add nft in the egress path.
> 
> 
> > huh? Its stated in the commit message with no reason for why it might
> > be the case and I can't reproduce it.
> 
> The reason as stated in the commit message is that cache pressure is
> apparently reduced with the tc handling moved out of the hotpath,
> an effect that Eric Dumazet had previously observed for the ingress path:
> 
> https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/

OK, seems possible it could be an icache miss we are hitting. To really
confirm this though I would want to look at icache statistics. Otherwise
it feels likely, but difficult to tell for sure.

This noinline change subtle and buried in another patch. If you really
want to noinline that function pull it out of the series and push as its
own patch. I am against it because it appears to be directly degrading
performance for my use case and only providing small (if measurable at all)
gain in the normal case. But, at least if its submitted as its own patch we
can debate the merits. We would need performance data for some real devices
veth, nic, and dummy device also across many threads to get a good
handle on it. Also perf data would help understand whats happening. My
preference would be to also nail down the icache stats so we can be sure
this noininline improvment in non clsact case is fully understood.

> 
> Thanks a lot for taking the time to give these patches a whirl.
> 
> Lukas