[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
<PH0PR12MB54810986E221222786F83ED2DCC62@PH0PR12MB5481.namprd12.prod.outlook.com>
Date: Mon, 10 Jun 2024 13:01:09 +0000
From: Parav Pandit <parav@...dia.com>
To: Greg KH <gregkh@...uxfoundation.org>, Shay Drori <shayd@...dia.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "pabeni@...hat.com"
<pabeni@...hat.com>, "davem@...emloft.net" <davem@...emloft.net>,
"kuba@...nel.org" <kuba@...nel.org>, "edumazet@...gle.com"
<edumazet@...gle.com>, "david.m.ertman@...el.com" <david.m.ertman@...el.com>,
"rafael@...nel.org" <rafael@...nel.org>, "ira.weiny@...el.com"
<ira.weiny@...el.com>, "linux-rdma@...r.kernel.org"
<linux-rdma@...r.kernel.org>, "leon@...nel.org" <leon@...nel.org>, Tariq
Toukan <tariqt@...dia.com>
Subject: RE: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
Hi Greg,
> From: Parav Pandit
> Sent: Thursday, June 6, 2024 8:13 AM
>
> Hi Greg,
>
>
> > From: Parav Pandit
> > Sent: Monday, June 3, 2024 12:41 PM
> >
> > Hi Greg,
> >
> > An emergency has turned up for Shay. So, he is going to be out of
> > office for few days.
> > I will attempt to answer below questions and spin v6 based on your
> > guidance.
> >
> > > From: Greg KH <gregkh@...uxfoundation.org>
> > > Sent: Thursday, May 30, 2024 8:21 PM
> > >
> > > On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote:
> > > >
> > > >
> > > > On 28/05/2024 20:57, Greg KH wrote:
> > > > > External email: Use caution opening links or attachments
> > > > >
> > > > >
> > > > > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
> > > > > > Today, PCI PFs and VFs, which are anchored on the PCI bus,
> > > > > > display their IRQ information in the
> > > > > > <pci_device>/msi_irqs/<irq_num> sysfs files. PCI subfunctions
> > > > > > (SFs) are similar to PFs and VFs and these SFs are anchored on
> > > > > > the auxiliary bus. However, these PCI SFs lack such IRQ
> > > > > > information on the auxiliary bus, leaving users without
> > > > > > visibility into which IRQs are used by the SFs. This absence
> > > > > > makes it impossible to debug situations and to understand the
> > > > > > source
> > of interrupts/SFs for performance tuning and debug.
> > > > >
> > > > > Wait, again, this feels wrong. You should be able to walk back
> > > > > up the tree see the irq for the device, and vf, right? Why
> > > > > would the value be different down in the aux device?
> > > >
> > > >
> > > > you are correct, the IRQs of the aux device are subset of the IRQs
> > > > of the parent device. more detailed answer bellow.
> > >
> > > But isn't that already shown in /proc/interrupts? Why show it here as
> well?
> > >
> > Because /proc/interrupts cannot distinguish which irqs are for which SFs.
> > More below.
> >
> > > > > Does the msi irq somehow not actually show anywhere for the real
> > > > > pci
> > > device in sysfs at all today?
> > > > >
> > > > > What does sysfs look like today exactly for this information?
> > > >
> > > >
> > > > The IRQ information of all the children SFs of a PF is found in
> > > > the PF device as one single list.
> > > > There is no sane way to distinguish which IRQ is used by which SFs.
> > > > For example, in a machine with a single child SF of the PF 00:0b.0
> > > > we currently have the bellow:
> > > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
> > > > 58
> > > >
> > > > the above are IRQs of both the PF and its child SF. from here we
> > > > cannot tell which IRQ is used by the child SF.
> > >
> > > Does that matter?
> > Yes it matters. A user needs to see if a given IRQ is for SF or a PF
> > and if it is for SF, is it for which SF.
> > It needs to know this to set/modify the irq affinity via
> > /proc/irq/<irq_num>/affinity_files.
> >
> > > Seriously, what is userspace going to do with that that it can not
> > > already determine from /proc/interrupts/ or /proc/irq/?
> > >
> > User space cannot determine if a given IRQ is for SF or not from
> > /proc/interrupts and /proc/irq either.
> > Let me explain with example.
> > It cannot figure out for two reasons.
> >
> > 1. For a PF or a VF, the interrupt name in /proc/interrupts looks like below.
> > mlx5 VF device writes irq name as: mlx5_comp62@pci:0000:06:00.0.
> > Format is: "purpose, event q number @ device_name".
> > Purpose = mlx5_compl (completions)
> > Event q number = 62
> > Device = pci:0000:06.00.0.
> >
> > Can we do same for SFs?
> > For example, as,
> > mlx5_compl62@...iliary:mlx5_core.sf.2
> >
> > Will it work when SFs share the IRQ? No. it won't be accurate name
> > when a vector is shared among many SFs.
> > Also updating IRQ names dynamically to arbitrary long name won't be
> > any good either.
> >
> > 2. mlx5 do not use IRQF_SHARED.
> > This is the main reason due to which, /proc/irq/<irq_num>/devices_list
> > is missing SFs auxiliary device entries.
> >
> > We cannot share the NIC interrupt with any other arbitrary system
> > device because nic SF needs to have reasonably predictable latency.
> >
> > Shared interrupts also complicate the mlx5 driver flow even further as
> > it can fire right away.
> > And we want to avoid the already complex path further, with device
> > recovery flow in this area.
> >
> > Secondly, a while back also when team played with IRQF_SHARED there
> > was 8% or so perf drop too.
> > So mlx5 uses IRQ sharing among its own interrupt generators.
> > This is uniformly across PF, VF, SFs.
> > It also has certain good thresholds where it avoids sharing the IRQ
> > and allocates new dynamic one.
> > Thanks to Thomas's work.
> >
> > It wouldn’t be wise to use IRQF_SHARED for mlx5 NICs, unless IRQ
> > subsystem can offer APIs to restrict IRQs sharing to certain devices.
> > Sounds too complicated for the sake of showing SFs affinity.
> >
> > > I know it's /proc and not /sys but duplicating information might not
> > > be needed as I think you already have this information.
> >
> > Like I explained above, we don’t have that info. I think IRQF_SHARED
> > probably was missing piece to you previously.
> > PCI duplicates this info and widely used by irqbalance [1] which reads
> > from /sys/bus/pci/devices/<dev_name>/msi_irqs/<all_irqs file>.
> >
> > The useful part of irq files under sys is, a use space can directly
> > see the irq files under the desired device directly.
> > It would be extended for auxiliary bus SFs.
> >
> > [1] https://github.com/Irqbalance/irqbalance
> >
> > > If not, what is missing in /proc/irq today for this?
> > >
> > If IRQF_SHARED is used, nothing is missing. But we don’t find a good
> > way to use it.
> >
> > > > > And what about /proc/irq/ and /proc/interrupts/ doesn't that
> > > > > show you the needed information? Why are aux devices somehow
> "special"
> > > here?
> > > >
> > > >
> > > > /proc/interrupts interrupt name is some random driver choice.
> > >
> > > That's the same name you use here.
> > >
> > Yes, like I explained above. We can come up with some name for
> > /proc/interrupts.
> > But it won't tell which all SFs are sharing it, for example 64 SFs
> > sharing a vector won't be visible in some giant irq name in /proc/interrupts.
> >
> > So, to summarize, there are three places.
> >
> > 1. /proc/interrupts.
> > Limitations:
> > a. One line per interrupt does not fit the case.
> > b. interrupt name also not enough to show sharing SFs.
> >
> > 2. /proc/irq/<irq_number>/{device_list}
> > Limited only to IRQF_SHARED users.
> >
> > 3. /sys/bus/auxiliary/devices/mlx5_core.sf.2/irqs<irq_file_name>
> > Pros:
> > a. Matches with pci PF and VFs
> > b. will be usable by irqbalance in similar way as PCI bus c. works
> > regardless of IRQF_SHARED used or not used.
> > d. Does not require complicated IRQ subsystem changes to limit
> > IRQF_SHARING to only a specific device type.
> >
> > > > There is
> > > > no sane naming convention and some small few bytes of upper limit
> > > > of what the IRQ name.
> > >
> > > That's up to the driver to name, so wouldn't it be the same here?
> > >
> > > > > > Additionally, the SFs are multifunctional devices supporting
> > > > > > RDMA, network devices, clocks, and more, similar to their peer
> > > > > > PCI PFs and VFs. Therefore, it is desirable to have SFs' IRQ
> > > > > > information available at the bus/device level.
> > > > >
> > > > > But it should be as part of the pci device, as that's where that
> > > > > information lives and is "bound" to, not the aux device on its own.
> > > >
> > > >
> > > > Auxiliary bus level SF device is using that IRQ too. So it is
> > > > additionally shown at auxiliary device level too.
> > >
> > > Your aux bus devices are using that, it's not generic to ALL aux bus devices.
> > > Along those lines, why not just put this info in the aux bus devices
> > > that your driver is bound to?
> > >
> > mlx5 driver uses the generic aux bus for SFs.
> > mlx5 driver can create some sysfs files directly into this SF device,
> > however we strongly want to avoid doing such vendor specific files
> > based on our experience in netdev and rdma subsystems.
> >
> > There are two main reasons to not put in mlx5 driver code.
> >
> > 1. In past several times it happened that a drivers tend to abuse this
> > sysfs interface to put any sort of debug information or other things
> > and it becomes hard to get rid of.
> > An example of this is fw_pages in
> > /sys/class/infiniband/mlx5_0/device/fw_pages file. ☹ I cleaned a lot
> > in rdma now to avoid such abuse.
> > Want to avoid the same for the auxiliary bus too.
> >
> > 2. Intel sf driver is just getting started up. If mlx5 driver shows
> > "irqs", and intel driver will show this as "foo_irqs", the user will
> > not have same experience across two drivers.
> > For example, irqbalance [1] will not be able to work for these two vendors.
> >
> > So, it would be wiser to use aux bus extension for SF to show the
> > directory, like how pci bus does.
> > Vendor driver is just feeding as what irq is used by semi-virtual bus.
> >
> > > > > > To overcome the above limitations, this short series extends
> > > > > > the auxiliary bus to display IRQ information in sysfs, similar
> > > > > > to that of PFs and VFs.
> > > > >
> > > > > Again, examples of what it looks like today, and what it will
> > > > > look like with this patch set is needed in order to justify why
> > > > > this really is needed as it seems that the information should
> > > > > already be there
> > > for you.
> > > >
> > > >
> > > > full example:
> > > > in a machine with a single child SF of the PF 00:0b.0 we currently
> > > > have the bellow:
> > > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
> > > > 58
> > > >
> > > > with this patch-set we will also have:
> > > > $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
> > > > 50 51 52 53 54 55 56 57 58
> > > >
> > > > which means that IRQs 50-58, which are shown on the PF "msi_irqs"
> > > > are used by the child SF.
> > >
> > > Please document this in the changelog if you wish to persist with it.
> > >
> > Sure. Its already part of the commit log of patch 2.
> > Will add the same example in cover letter too.
> >
> > > > > > It adds an 'irqs' directory under the auxiliary device and
> > > > > > includes an <irq_num> sysfs file within it. Sometimes, the PCI
> > > > > > SF auxiliary devices share the IRQ with other SFs, a detail
> > > > > > that is also not available to the users. Consequently, this
> > > > > > <irq_num> file indicates whether the IRQ is 'exclusive' or 'shared'. This
> 'irqs'
> > > > > > directory extenstion is optional, i.e. only for PCI SFs the
> > > > > > sysfs irq
> > > information is optionally exposed.
> > > > >
> > > > > Why does userspace care about "shared" or not? What can they do
> > > > > with that, and why?
> > > >
> > > >
> > > > If IRQ is shared, userspace needs to take it into account when
> > > > looking at IRQ data and counters such as interrupts/sec.
> > > > Also, If IRQ is shared, user better not mess with the irq affinity
> > > > of such irq it as it can affect multiple SFs.
> > >
> > > But the logic of "shared" is at the irq level, not at the aux bus level.
> > > That's where that should be shown, not in a random bus subsystem,
> > > otherwise you would have to look across ALL busses to properly
> > > determine if an irq is shared or not, right?
> > >
> > Yes, that is a good point. Showing shared or exclusive irq in
> > /sys/irq/<irq_num>/device_name file is right think to do.
> > I agree with you.
> >
>
> > With that, are you ok, if we just add the
> > /sys/bus/auxiliary/devices/mlx5_core.sf.2/<irq_num> files to list IRQs
> > used by this SF?
> > And let user space deal to figure out sharing or exclusive.
> >
> > This will eliminate the global xarray in the patch-1 and "shared", "exclusive"
> > strings part of all the refcount code.
> >
>
> Did you get a chance to review above?
> Are you ok to add the sysfs files by the SF auxiliary bus driver without the
> "shared"/"exclusive" complexity?
Can you please suggest? Are you ok with this approach?
Powered by blists - more mailing lists