lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ztl6HvkMzu9-7CQJ@LQ3V64L9R2>
Date: Thu, 5 Sep 2024 11:30:06 +0200
From: Joe Damato <jdamato@...tly.com>
To: Stanislav Fomichev <sdf@...ichev.me>
Cc: Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
	edumazet@...gle.com, amritha.nambiar@...el.com,
	sridhar.samudrala@...el.com, bjorn@...osinc.com, hch@...radead.org,
	willy@...radead.org, willemdebruijn.kernel@...il.com,
	skhawaja@...gle.com, Martin Karsten <mkarsten@...terloo.ca>,
	Donald Hunter <donald.hunter@...il.com>,
	"David S. Miller" <davem@...emloft.net>,
	Paolo Abeni <pabeni@...hat.com>,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	Xuan Zhuo <xuanzhuo@...ux.alibaba.com>,
	Daniel Jurgens <danielj@...dia.com>,
	open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI
 config values

On Wed, Sep 04, 2024 at 04:40:41PM -0700, Stanislav Fomichev wrote:
> On 09/02, Joe Damato wrote:
> > On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > > +      doc: Set configurable NAPI instance settings.  
> > > > > 
> > > > > We should pause and think here how configuring NAPI params should
> > > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > > device (or for some drivers change any BPF or ethtool setting)
> > > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > > 
> > > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > > 
> > > > > I guess we can either document this, and move on (which may be fine,
> > > > > you have more practical experience than me). Or we can add an internal
> > > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > > net_device and store such config there. For simplicity of matching
> > > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > > If driver wants to do something more fancy we can add a variant of
> > > > > netif_napi_add() which specifies the channel/storage to use.
> > > > > 
> > > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > > I work with unfortunate drivers...  
> > > > 
> > > > Thanks for pointing this out. I think this is an important case to
> > > > consider. Here's how I'm thinking about it.
> > > > 
> > > > There are two cases:
> > > > 
> > > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > > discarded and recreated, the code I added to netif_napi_add_weight
> > > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > > works today, I believe. I think we are good on this case ?
> > > 
> > > Agreed.
> > > 
> > > > 2) apps using netlink to set various custom settings. This seems
> > > > like a case where a future extension can be made to add a notifier
> > > > for NAPI changes (like the netdevice notifier?).
> > > 
> > > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > > 
> > > > If you think this is a good idea, then we'd do something like:
> > > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > > >   2. In the future (not part of this series) a NAPI notifier is
> > > >      added
> > > >   3. User apps can then listen for NAPI create/delete events
> > > >      and update settings when a NAPI is created. It would be
> > > >      helpful, I think, for user apps to know about NAPI
> > > >      create/delete events in general because it means NAPI IDs are
> > > >      changing.
> > > > 
> > > > One could argue:
> > > > 
> > > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > > >   queue gets a new NAPI ID associated with it. User apps operating
> > > >   at this level probably care about NAPI IDs changing (as it affects
> > > >   epoll busy poll). Since the settings in this series are per-NAPI
> > > >   (and not per HW queue), the argument could be that user apps need
> > > >   to setup NAPIs when they are created and settings do not persist
> > > >   between NAPIs with different IDs even if associated with the same
> > > >   HW queue.
> > > 
> > > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > > place. I would venture a guess that the person who added the IDs was
> > > working with NICs which have stable NAPI instances once the device is
> > > opened. This is, unfortunately, not universally the case.
> > > 
> > > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > > on an open device. Closer we get to queue API the more dynamic the whole
> > > setup will become (read: the more often reconfigurations will happen).
> > >
> > 
> > [...]
> > 
> > > > I think you have much more practical experience when it comes to
> > > > dealing with drivers, so I am happy to follow your lead on this one,
> > > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > > limited driver experience.
> > > > 
> > > > My two goals with this series are:
> > > >   1. Make it possible to set these values per NAPI
> > > >   2. Unblock the IRQ suspension series by threading the suspend
> > > >      parameter through the code path carved in this series
> > > > 
> > > > So, I'm happy to proceed with this series as you prefer whether
> > > > that's documentation or "napi_storage"; I think you are probably the
> > > > best person to answer this question :)
> > > 
> > > How do you feel about making this configuration opt-in / require driver
> > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > "index" parameter would make the whole thing much simpler.
> > 
> > What about extending netif_queue_set_napi instead? That function
> > takes a napi and a queue index.
> > 
> > Locally I kinda of hacked up something simple that:
> >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> >   - Modifies netif_queue_set_napi to:
> >      if (napi)
> >        napi->storage = dev->napi_storage[queue_index];
> > 
> > I think I'm still missing the bit about the
> > max(rx_queues,tx_queues), though :(
> > 
> > > Index would basically be an integer 0..n, where n is the number of
> > > IRQs configured for the driver. The index of a NAPI instance would
> > > likely match the queue ID of the queue the NAPI serves.
> > 
> > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > are NAPIs for which no IRQ is allocated ~someday~ ?
> > 
> > It seems like (I could totally be wrong) that netif_queue_set_napi
> > can be called and work and create the association even without an
> > IRQ allocated.
> > 
> > I guess the issue is mostly the queue index question above: combined
> > rx/tx vs drivers having different numbers of rx and tx queues.
> > 
> > > We can then allocate an array of "napi_configs" in net_device -
> > > like we allocate queues, the array size would be max(num_rx_queue,
> > > num_tx_queues). We just need to store a couple of ints so it will
> > > be tiny compared to queue structs, anyway.
> > > 
> > > The NAPI_SET netlink op can then work based on NAPI index rather 
> > > than the ephemeral NAPI ID. It can apply the config to all live
> > > NAPI instances with that index (of which there really should only 
> > > be one, unless driver is mid-reconfiguration somehow but even that
> > > won't cause issues, we can give multiple instances the same settings)
> > > and also store the user config in the array in net_device.
> > > 
> > > When new NAPI instance is associate with a NAPI index it should get
> > > all the config associated with that index applied.
> > > 
> > > Thoughts? Does that makes sense, and if so do you think it's an
> > > over-complication?
> > 
> > I think what you are proposing seems fine; I'm just working out the
> > implementation details and making sure I understand before sending
> > another revision.
> 
> What if instead of an extra storage index in UAPI, we make napi_id persistent?
> Then we can keep using napi_id as a user-facing number for the configuration.
> 
> Having a stable napi_id would also be super useful for the epoll setup so you
> don't have to match old/invalid ids to the new ones on device reset.

Up to now for prototyping purposes: the way I've been dealing with this is
using a SO_ATTACH_REUSEPORT_CBPF program like this:

struct sock_filter code[] = {
    /* A = skb->queue_mapping */
    { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_QUEUE },
    /* A = A % n */
    { BPF_ALU | BPF_MOD, 0, 0, n },
    /* return A */
    { BPF_RET | BPF_A, 0, 0, 0 },
};

with SO_BINDTODEVICE. Note that the above uses queue_mapping (not NAPI ID) so
even if the NAPI IDs change the filter still distributes connections from the
same "queue_mapping" to the same thread.

Since epoll busy poll is based on NAPI ID (and not queue_mapping), this will
probably cause some issue if the NAPI ID changes because the NAPI ID associated
with the epoll context will suddenly change meaning the "old" NAPI won't be
busy polled. This might be fine because if that happens the old NAPI is being
disabled anyway?

At any rate the user program doesn't "need" to do anything when the NAPI ID
changes... unless it has a more complicated ebpf program that relies on NAPI ID
;)

> In the code, we can keep the same idea with napi_storage in netdev and
> ask drivers to provide storage id, but keep that id internal.
> 
> The only complication with that is napi_hash_add/napi_hash_del that
> happen in netif_napi_add_weight. So for the devices that allocate
> new napi before removing the old ones (most devices?), we'd have to add
> some new netif_napi_takeover(old_napi, new_napi) to remove the
> old napi_id from the hash and reuse it in the new one.
> 
> So for mlx5, the flow would look like the following:
> 
> - mlx5e_safe_switch_params
>   - mlx5e_open_channels
>     - netif_napi_add(new_napi)
>       - adds napi with 'ephemeral' napi id
>   - mlx5e_switch_priv_channels
>     - mlx5e_deactivate_priv_channels
>       - napi_disable(old_napi)
>       - netif_napi_del(old_napi) - this frees the old napi_id
>   - mlx5e_activate_priv_channels
>     - mlx5e_activate_channels
>       - mlx5e_activate_channel
>         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> 	  - if napi is not hashed - safe to reuse?
> 	- napi_enable
> 
> This is a bit ugly because we still have random napi ids during reset, but
> is not super complicated implementation-wise. We can eventually improve
> the above by splitting netif_napi_add_weight into two steps: allocate and
> activate (to do the napi_id allocation & hashing). Thoughts?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ