netdev - Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhQ+KBTaqQ=jtOtpx9+82ToOid5n06+NdqLX_iDhH7SQcA@mail.gmail.com>
Date: Thu, 6 Feb 2025 19:13:00 -0800
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Joe Damato <jdamato@...tly.com>, Samiullah Khawaja <skhawaja@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, 
	netdev@...r.kernel.org
Subject: Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll

On Thu, Feb 6, 2025 at 2:48 PM Joe Damato <jdamato@...tly.com> wrote:
>
> On Thu, Feb 06, 2025 at 02:06:14PM -0800, Samiullah Khawaja wrote:
> > On Thu, Feb 6, 2025 at 1:19 PM Joe Damato <jdamato@...tly.com> wrote:
> > >
> > > On Tue, Feb 04, 2025 at 07:18:36PM -0800, Joe Damato wrote:
> > > > On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> > > > > Extend the already existing support of threaded napi poll to do continuous
> > > > > busy polling.
> > > >
> > > > [...]
> > > >
> > > > Overall, +1 to everything Martin said in his response. I think I'd
> > > > like to try to reproduce this myself to better understand the stated
> > > > numbers below.
> > > >
> > > > IMHO: the cover letter needs more details.
> > > >
> > > > >
> > > > > Setup:
> > > > >
> > > > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > > > - IRQ affinity and coalascing is common for both experiments.
> > > >
> > > > As Martin suggested, a lot more detail here would be helpful.
> > >
> > > Just to give you a sense of the questions I ran into while trying to
> > > reproduce this just now:
> > >
> > > - What is the base SHA? You should use --base when using git
> > >   format-patch. I assumed the latest net-next SHA and applied the
> > >   patches to that.
> > Yes that is true. I will use --base when I do it next. Thanks for the
> > suggestion.
> > >
> > > - Which C3 instance type? I chose c3-highcpu-192-metal, but I could
> > >   have chosen c3-standard-192-metal, apparently. No idea what
> > >   difference this makes on the results, if any.
> > The tricky part is that the c3 instance variant that I am using for
> > dev is probably not available publicly.
>
> Can you use a publicly available c3 instance type instead? Maybe you
> can't, and if so you should probably mention that it's an internal
> c3 image and can't be reproduced on the public c3's because of XYZ
> in the cover letter.
>
> > It is a variant of c3-standard-192-metal but we had to enable
> > AF_XDP on it to make it stable to be able to run onload. You will
> > have to reproduce this on a platform available to you with AF_XDP
> > as suggested in the onload github I shared. This is the problem
> > with providing an installer/runner script and also system
> > configuration. My configuration would not be best for your
> > platform, but you should certainly be able to reproduce the
> > relative improvements in latency using the different busypolling
> > schemes (busy_read/busy_poll vs threaded napi busy poll) I
> > mentioned in the cover letter.
>
> Sorry, I still don't agree. Just because your configuration won't
> work for me out of the box is, IMHO, totally unrelated to what
> Martin and I are asking for.
>
> I won't speak for Martin, but I am saying that the configuration you
> are using should be thoroughly documented so that I can at least
> understand it and how I might reproduce it.
I provided all the relevant configuration I used that you can apply on
your platform. Later also provided the IRQ routing and thread affinity
as Martin asked, but as you can see it is pretty opaque and irrelevant
to the experiment I am doing and it also depends on the platform you
use. But I can add those to the cover letter next time. Specifically I
will add the following plus the things that I already provided,
- interrupt routing
- any thread affinity
- number of queues (already added)
- threaded napi enablement (with affinity)
>
> > >
> > > - Was "tier 1 networking" enabled? I enabled it. No idea if it
> > >   matters or not. I assume not, since it would be internal
> > >   networking within the GCP VPC of my instances and not real egress?
> > >
> > > - What version of onload was used? Which SHA or release tag?
> > v9.0, I agree this should be added to the cover letter.
>
> To the list of things to add to the cover letter:
>   - What OS and version are you using?
>   - How many runs of neper? It seems like it was just 1 run. Is that
>     sufficient? I'd argue you need to re-run the experiment many
>     times, with different message sizes, queue counts, etc and
>     compute some statistical analysis of the results.
>
> > >
> > > - I have no idea where to put CPU affinity for the 1 TX/RX queue, I
> > >   assume CPU 2 based on your other message.
> > Yes I replied to Martin with this information, but like I said it
> > certainly depends on your platform and hence didn't include it in the
> > cover letter. Since I don't know what/where core 2 looks like on your
> > platform.
>
> You keep referring to "your platform" and I'm still confused.
>
> Don't you think it's important to properly document _your setup_,
> including mentioning that core 2 is used for the IRQ and the
> onload+neper runs on other cores? Maybe I missed it in the cover
> letter, but that details seems pretty important for analysis,
> wouldn't you agree?
Respectfully I think here you are again confusing things, napi
threaded polling is running in a separate core (2). And the cover
letter clearly states the following about the experiment.
```
Here with NAPI threaded busy poll in a separate core, we are able to
consistently poll the NAPI to keep latency to absolute minimum.
```
>
> Even if my computer is different, there should be enough detail for
> me to form a mental model of what you are doing so that I can think
> through it, understand the data, and, if I want to, try to reproduce
> it.
I agree to this 100% and I will fill in the interrupt routing and
other affinity info so it gives you a mental model, that is I am doing
a comparison between sharing a core between application processing and
napi processing vs doing napi processing in dedicated cores. I want to
focus on the premise of the problem/use case I am trying to solve. I
mentioned this in the cover letter, but it seems you are looking for
specifics however irrelevant they might be to your platform. I will
put those in the next iteration.
>
> > >
> > > - The neper commands provided seem to be the server side since there
> > >   is no -c mentioned. What is the neper client side command?
> > Same command with -c
> > >
> > > - What do the environment variables set for onload+neper mean?
> > >
> > > ...
> > >
> > > Do you follow what I'm getting at here? The cover lacks a tremendous
> > > amount of detail that makes reproducing the setup you are using
> > > unnecessarily difficult.
> > >
> > > Do you agree that I should be able to read the cover letter and, if
> > > so desired, go off and reproduce the setup and get similar results?
> > Yes you should be able to that, but there are micro details of your
> > platform and configuration that I have no way of knowing and suggest
> > configurations. I have certainly pointed out the relevant environment
> > and special configurations (netdev queues sizes, sysctls, irq defer,
> > neper command and onload environment variables) that I did for each
> > test case in my experiment. Beyond that I have no way of providing you
> > an internal C3 platform or providing system configuration for your
> > platform.
>
> I'm going to let the thread rest at this point; I just think we are
> talking past each other here and it just doesn't feel productive.
>
> You are saying that your platform and configuration are not publicly
> available, there are too many "micro details", and that you can't
> suggest a configuration for my computer, which is sure to be wildly
> different.
>
> I keep repeating this, but I'll repeat it more explicitly: I'm not
> asking you to suggest a configuration to me for my computer.
>
> I'm asking you to thoroughly, rigorously, and verbosely document
> what _exactly_ *your setup* is, precisely how it is configured, and
> all the versions and SHAs of everything so that if I want to try to
> reproduce it I have enough information in order to do so.
>
> With your cover letter as it stands presently: this is not possible.
>
> Surely, if I can run neper and get a latency measurement, I can run
> a script that you wrote which measures CPU cycles using a publicly
> available tool, right? Then at least we are both measuring CPU
> consumption the same way and can compare latency vs CPU tradeoff
> results based on the same measurement.
I am not considering the CPU/Latency tradeoff since my use case
requires consistent latency. This is very apparent when the core is
shared between application processing and napi processing and it is
pretty intuitive. I think affinity info and the mental model you are
looking for would probably make this more apparent.
>
> By providing better documentation, you make it more likely that
> other people will try to reproduce your results. The more people who
> reproduce your results, the stronger the argument to get this merged
> becomes.