netdev - Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6WC7rysltAFwhJI@LQ3V64L9R2>
Date: Thu, 6 Feb 2025 19:50:06 -0800
From: Joe Damato <jdamato@...tly.com>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>,
	"David S . Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
	almasrymina@...gle.com, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll

On Thu, Feb 06, 2025 at 07:13:00PM -0800, Samiullah Khawaja wrote:
> On Thu, Feb 6, 2025 at 2:48 PM Joe Damato <jdamato@...tly.com> wrote:
> >
> > On Thu, Feb 06, 2025 at 02:06:14PM -0800, Samiullah Khawaja wrote:
> > > On Thu, Feb 6, 2025 at 1:19 PM Joe Damato <jdamato@...tly.com> wrote:
> > > >
> > > > On Tue, Feb 04, 2025 at 07:18:36PM -0800, Joe Damato wrote:
> > > > > On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> > > > > > Extend the already existing support of threaded napi poll to do continuous
> > > > > > busy polling.
> > > > >
> > > > > [...]
> > > > >
> > > > > Overall, +1 to everything Martin said in his response. I think I'd
> > > > > like to try to reproduce this myself to better understand the stated
> > > > > numbers below.
> > > > >
> > > > > IMHO: the cover letter needs more details.
> > > > >
> > > > > >
> > > > > > Setup:
> > > > > >
> > > > > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > > > > - IRQ affinity and coalascing is common for both experiments.
> > > > >
> > > > > As Martin suggested, a lot more detail here would be helpful.
> > > >
> > > > Just to give you a sense of the questions I ran into while trying to
> > > > reproduce this just now:
> > > >
> > > > - What is the base SHA? You should use --base when using git
> > > >   format-patch. I assumed the latest net-next SHA and applied the
> > > >   patches to that.
> > > Yes that is true. I will use --base when I do it next. Thanks for the
> > > suggestion.
> > > >
> > > > - Which C3 instance type? I chose c3-highcpu-192-metal, but I could
> > > >   have chosen c3-standard-192-metal, apparently. No idea what
> > > >   difference this makes on the results, if any.
> > > The tricky part is that the c3 instance variant that I am using for
> > > dev is probably not available publicly.
> >
> > Can you use a publicly available c3 instance type instead? Maybe you
> > can't, and if so you should probably mention that it's an internal
> > c3 image and can't be reproduced on the public c3's because of XYZ
> > in the cover letter.
> >
> > > It is a variant of c3-standard-192-metal but we had to enable
> > > AF_XDP on it to make it stable to be able to run onload. You will
> > > have to reproduce this on a platform available to you with AF_XDP
> > > as suggested in the onload github I shared. This is the problem
> > > with providing an installer/runner script and also system
> > > configuration. My configuration would not be best for your
> > > platform, but you should certainly be able to reproduce the
> > > relative improvements in latency using the different busypolling
> > > schemes (busy_read/busy_poll vs threaded napi busy poll) I
> > > mentioned in the cover letter.
> >
> > Sorry, I still don't agree. Just because your configuration won't
> > work for me out of the box is, IMHO, totally unrelated to what
> > Martin and I are asking for.
> >
> > I won't speak for Martin, but I am saying that the configuration you
> > are using should be thoroughly documented so that I can at least
> > understand it and how I might reproduce it.
> I provided all the relevant configuration I used that you can apply on
> your platform. 

Sorry, but that is not true -- both due to your below statement on
IRQ routing and thread affinity being provided later and your
agreement that you didn't include version numbers or git SHAs below.

> Later also provided the IRQ routing and thread affinity
> as Martin asked, but as you can see it is pretty opaque and irrelevant
> to the experiment I am doing and it also depends on the platform you
> use.

[...]

> > > >
> > > > - I have no idea where to put CPU affinity for the 1 TX/RX queue, I
> > > >   assume CPU 2 based on your other message.
> > > Yes I replied to Martin with this information, but like I said it
> > > certainly depends on your platform and hence didn't include it in the
> > > cover letter. Since I don't know what/where core 2 looks like on your
> > > platform.
> >
> > You keep referring to "your platform" and I'm still confused.
> >
> > Don't you think it's important to properly document _your setup_,
> > including mentioning that core 2 is used for the IRQ and the
> > onload+neper runs on other cores? Maybe I missed it in the cover
> > letter, but that details seems pretty important for analysis,
> > wouldn't you agree?
> Respectfully I think here you are again confusing things, napi
> threaded polling is running in a separate core (2). And the cover
> letter clearly states the following about the experiment.
> ```
> Here with NAPI threaded busy poll in a separate core, we are able to
> consistently poll the NAPI to keep latency to absolute minimum.
> ```

Is core 2 mentioned anywhere in the cover letter? Is there a
description of the core layout? Is core 2 NUMA local to the NIC? Are
the cores where neper+onload run NUMA local?

I'm probably "again confusing things" and this is all clearly
explained in the cover letter, I bet.

> >
> > Even if my computer is different, there should be enough detail for
> > me to form a mental model of what you are doing so that I can think
> > through it, understand the data, and, if I want to, try to reproduce
> > it.
> I agree to this 100% and I will fill in the interrupt routing and
> other affinity info so it gives you a mental model, that is I am doing
> a comparison between sharing a core between application processing and
> napi processing vs doing napi processing in dedicated cores. I want to
> focus on the premise of the problem/use case I am trying to solve. I
> mentioned this in the cover letter, but it seems you are looking for
> specifics however irrelevant they might be to your platform. I will
> put those in the next iteration.

The specifics are necessary for two reasons:
  1. So I can understand what you claim to be measuring
  2. So I can attempt to reproduce it

How odd to call this "irrelevant"; it's probably one of the most
relevant things required to understand and analyze the impact of the
proposed change.

> >
> > > >
> > > > - The neper commands provided seem to be the server side since there
> > > >   is no -c mentioned. What is the neper client side command?
> > > Same command with -c
> > > >
> > > > - What do the environment variables set for onload+neper mean?
> > > >
> > > > ...
> > > >
> > > > Do you follow what I'm getting at here? The cover lacks a tremendous
> > > > amount of detail that makes reproducing the setup you are using
> > > > unnecessarily difficult.
> > > >
> > > > Do you agree that I should be able to read the cover letter and, if
> > > > so desired, go off and reproduce the setup and get similar results?
> > > Yes you should be able to that, but there are micro details of your
> > > platform and configuration that I have no way of knowing and suggest
> > > configurations. I have certainly pointed out the relevant environment
> > > and special configurations (netdev queues sizes, sysctls, irq defer,
> > > neper command and onload environment variables) that I did for each
> > > test case in my experiment. Beyond that I have no way of providing you
> > > an internal C3 platform or providing system configuration for your
> > > platform.
> >
> > I'm going to let the thread rest at this point; I just think we are
> > talking past each other here and it just doesn't feel productive.
> >
> > You are saying that your platform and configuration are not publicly
> > available, there are too many "micro details", and that you can't
> > suggest a configuration for my computer, which is sure to be wildly
> > different.
> >
> > I keep repeating this, but I'll repeat it more explicitly: I'm not
> > asking you to suggest a configuration to me for my computer.
> >
> > I'm asking you to thoroughly, rigorously, and verbosely document
> > what _exactly_ *your setup* is, precisely how it is configured, and
> > all the versions and SHAs of everything so that if I want to try to
> > reproduce it I have enough information in order to do so.
> >
> > With your cover letter as it stands presently: this is not possible.
> >
> > Surely, if I can run neper and get a latency measurement, I can run
> > a script that you wrote which measures CPU cycles using a publicly
> > available tool, right? Then at least we are both measuring CPU
> > consumption the same way and can compare latency vs CPU tradeoff
> > results based on the same measurement.
> I am not considering the CPU/Latency tradeoff since my use case
> requires consistent latency. This is very apparent when the core is
> shared between application processing and napi processing and it is
> pretty intuitive. I think affinity info and the mental model you are
> looking for would probably make this more apparent.

FYI: I will strongly object to any future submission of this work
that do not include a rigorous, thorough, verbose, reproducible, and
clear analysis of the test system setup, test cases, and trade-offs
introduced by this change.

I'm not a maintainer though, so maybe my strong objections won't
impact this being merged.