[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<PH0PR12MB5481E91EE2E4C00C84750B95DCFF2@PH0PR12MB5481.namprd12.prod.outlook.com>
Date: Mon, 3 Jun 2024 07:10:38 +0000
From: Parav Pandit <parav@...dia.com>
To: Greg KH <gregkh@...uxfoundation.org>, Shay Drori <shayd@...dia.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "pabeni@...hat.com"
<pabeni@...hat.com>, "davem@...emloft.net" <davem@...emloft.net>,
"kuba@...nel.org" <kuba@...nel.org>, "edumazet@...gle.com"
<edumazet@...gle.com>, "david.m.ertman@...el.com" <david.m.ertman@...el.com>,
"rafael@...nel.org" <rafael@...nel.org>, "ira.weiny@...el.com"
<ira.weiny@...el.com>, "linux-rdma@...r.kernel.org"
<linux-rdma@...r.kernel.org>, "leon@...nel.org" <leon@...nel.org>, Tariq
Toukan <tariqt@...dia.com>
Subject: RE: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
Hi Greg,
An emergency has turned up for Shay. So, he is going to be out of office for few days.
I will attempt to answer below questions and spin v6 based on your guidance.
> From: Greg KH <gregkh@...uxfoundation.org>
> Sent: Thursday, May 30, 2024 8:21 PM
>
> On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote:
> >
> >
> > On 28/05/2024 20:57, Greg KH wrote:
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
> > > > Today, PCI PFs and VFs, which are anchored on the PCI bus, display
> > > > their IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs
> > > > files. PCI subfunctions (SFs) are similar to PFs and VFs and these
> > > > SFs are anchored on the auxiliary bus. However, these PCI SFs lack
> > > > such IRQ information on the auxiliary bus, leaving users without
> > > > visibility into which IRQs are used by the SFs. This absence makes
> > > > it impossible to debug situations and to understand the source of
> > > > interrupts/SFs for performance tuning and debug.
> > >
> > > Wait, again, this feels wrong. You should be able to walk back up
> > > the tree see the irq for the device, and vf, right? Why would the
> > > value be different down in the aux device?
> >
> >
> > you are correct, the IRQs of the aux device are subset of the IRQs of
> > the parent device. more detailed answer bellow.
>
> But isn't that already shown in /proc/interrupts? Why show it here as well?
>
Because /proc/interrupts cannot distinguish which irqs are for which SFs.
More below.
> > > Does the msi irq somehow not actually show anywhere for the real pci
> device in sysfs at all today?
> > >
> > > What does sysfs look like today exactly for this information?
> >
> >
> > The IRQ information of all the children SFs of a PF is found in the PF
> > device as one single list.
> > There is no sane way to distinguish which IRQ is used by which SFs.
> > For example, in a machine with a single child SF of the PF 00:0b.0 we
> > currently have the bellow:
> > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
> >
> > the above are IRQs of both the PF and its child SF. from here we
> > cannot tell which IRQ is used by the child SF.
>
> Does that matter?
Yes it matters. A user needs to see if a given IRQ is for SF or a PF and if it is for SF, is it for which SF.
It needs to know this to set/modify the irq affinity via /proc/irq/<irq_num>/affinity_files.
> Seriously, what is userspace going to do with that that it
> can not already determine from /proc/interrupts/ or /proc/irq/?
>
User space cannot determine if a given IRQ is for SF or not from /proc/interrupts and /proc/irq either.
Let me explain with example.
It cannot figure out for two reasons.
1. For a PF or a VF, the interrupt name in /proc/interrupts looks like below.
mlx5 VF device writes irq name as: mlx5_comp62@pci:0000:06:00.0.
Format is: "purpose, event q number @ device_name".
Purpose = mlx5_compl (completions)
Event q number = 62
Device = pci:0000:06.00.0.
Can we do same for SFs?
For example, as,
mlx5_compl62@...iliary:mlx5_core.sf.2
Will it work when SFs share the IRQ? No. it won't be accurate name when a vector is shared among many SFs.
Also updating IRQ names dynamically to arbitrary long name won't be any good either.
2. mlx5 do not use IRQF_SHARED.
This is the main reason due to which, /proc/irq/<irq_num>/devices_list is missing SFs auxiliary device entries.
We cannot share the NIC interrupt with any other arbitrary system device because nic SF needs to have reasonably predictable latency.
Shared interrupts also complicate the mlx5 driver flow even further as it can fire right away.
And we want to avoid the already complex path further, with device recovery flow in this area.
Secondly, a while back also when team played with IRQF_SHARED there was 8% or so perf drop too.
So mlx5 uses IRQ sharing among its own interrupt generators.
This is uniformly across PF, VF, SFs.
It also has certain good thresholds where it avoids sharing the IRQ and allocates new dynamic one.
Thanks to Thomas's work.
It wouldn’t be wise to use IRQF_SHARED for mlx5 NICs, unless IRQ subsystem can offer APIs to restrict IRQs sharing to certain devices.
Sounds too complicated for the sake of showing SFs affinity.
> I know it's /proc and not /sys but duplicating information might not be
> needed as I think you already have this information.
Like I explained above, we don’t have that info. I think IRQF_SHARED probably was missing piece to you previously.
PCI duplicates this info and widely used by irqbalance [1] which reads from /sys/bus/pci/devices/<dev_name>/msi_irqs/<all_irqs file>.
The useful part of irq files under sys is, a use space can directly see the irq files under the desired device directly.
It would be extended for auxiliary bus SFs.
[1] https://github.com/Irqbalance/irqbalance
> If not, what is missing in /proc/irq today for this?
>
If IRQF_SHARED is used, nothing is missing. But we don’t find a good way to use it.
> > > And what about /proc/irq/ and /proc/interrupts/ doesn't that show
> > > you the needed information? Why are aux devices somehow "special"
> here?
> >
> >
> > /proc/interrupts interrupt name is some random driver choice.
>
> That's the same name you use here.
>
Yes, like I explained above. We can come up with some name for /proc/interrupts.
But it won't tell which all SFs are sharing it, for example 64 SFs sharing a vector won't be visible in some giant irq name in /proc/interrupts.
So, to summarize, there are three places.
1. /proc/interrupts.
Limitations:
a. One line per interrupt does not fit the case.
b. interrupt name also not enough to show sharing SFs.
2. /proc/irq/<irq_number>/{device_list}
Limited only to IRQF_SHARED users.
3. /sys/bus/auxiliary/devices/mlx5_core.sf.2/irqs<irq_file_name>
Pros:
a. Matches with pci PF and VFs
b. will be usable by irqbalance in similar way as PCI bus
c. works regardless of IRQF_SHARED used or not used.
d. Does not require complicated IRQ subsystem changes to limit IRQF_SHARING to only a specific device type.
> > There is
> > no sane naming convention and some small few bytes of upper limit of
> > what the IRQ name.
>
> That's up to the driver to name, so wouldn't it be the same here?
>
> > > > Additionally, the SFs are multifunctional devices supporting RDMA,
> > > > network devices, clocks, and more, similar to their peer PCI PFs
> > > > and VFs. Therefore, it is desirable to have SFs' IRQ information
> > > > available at the bus/device level.
> > >
> > > But it should be as part of the pci device, as that's where that
> > > information lives and is "bound" to, not the aux device on its own.
> >
> >
> > Auxiliary bus level SF device is using that IRQ too. So it is
> > additionally shown at auxiliary device level too.
>
> Your aux bus devices are using that, it's not generic to ALL aux bus devices.
> Along those lines, why not just put this info in the aux bus devices that your
> driver is bound to?
>
mlx5 driver uses the generic aux bus for SFs.
mlx5 driver can create some sysfs files directly into this SF device, however we strongly want to avoid doing
such vendor specific files based on our experience in netdev and rdma subsystems.
There are two main reasons to not put in mlx5 driver code.
1. In past several times it happened that a drivers tend to abuse this sysfs interface to put any
sort of debug information or other things and it becomes hard to get rid of.
An example of this is fw_pages in /sys/class/infiniband/mlx5_0/device/fw_pages file. ☹
I cleaned a lot in rdma now to avoid such abuse.
Want to avoid the same for the auxiliary bus too.
2. Intel sf driver is just getting started up. If mlx5 driver shows "irqs", and intel driver will show this as
"foo_irqs", the user will not have same experience across two drivers.
For example, irqbalance [1] will not be able to work for these two vendors.
So, it would be wiser to use aux bus extension for SF to show the directory, like how pci bus does.
Vendor driver is just feeding as what irq is used by semi-virtual bus.
> > > > To overcome the above limitations, this short series extends the
> > > > auxiliary bus to display IRQ information in sysfs, similar to that
> > > > of PFs and VFs.
> > >
> > > Again, examples of what it looks like today, and what it will look
> > > like with this patch set is needed in order to justify why this
> > > really is needed as it seems that the information should already be there
> for you.
> >
> >
> > full example:
> > in a machine with a single child SF of the PF 00:0b.0 we currently
> > have the bellow:
> > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
> >
> > with this patch-set we will also have:
> > $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
> > 50 51 52 53 54 55 56 57 58
> >
> > which means that IRQs 50-58, which are shown on the PF "msi_irqs" are
> > used by the child SF.
>
> Please document this in the changelog if you wish to persist with it.
>
Sure. Its already part of the commit log of patch 2.
Will add the same example in cover letter too.
> > > > It adds an 'irqs' directory under the auxiliary device and
> > > > includes an <irq_num> sysfs file within it. Sometimes, the PCI SF
> > > > auxiliary devices share the IRQ with other SFs, a detail that is
> > > > also not available to the users. Consequently, this <irq_num> file
> > > > indicates whether the IRQ is 'exclusive' or 'shared'. This 'irqs'
> > > > directory extenstion is optional, i.e. only for PCI SFs the sysfs irq
> information is optionally exposed.
> > >
> > > Why does userspace care about "shared" or not? What can they do
> > > with that, and why?
> >
> >
> > If IRQ is shared, userspace needs to take it into account when looking
> > at IRQ data and counters such as interrupts/sec.
> > Also, If IRQ is shared, user better not mess with the irq affinity of
> > such irq it as it can affect multiple SFs.
>
> But the logic of "shared" is at the irq level, not at the aux bus level.
> That's where that should be shown, not in a random bus subsystem,
> otherwise you would have to look across ALL busses to properly determine if
> an irq is shared or not, right?
>
Yes, that is a good point. Showing shared or exclusive irq in
/sys/irq/<irq_num>/device_name file is right think to do.
I agree with you.
With that, are you ok, if we just add the /sys/bus/auxiliary/devices/mlx5_core.sf.2/<irq_num> files to list IRQs used by this SF?
And let user space deal to figure out sharing or exclusive.
This will eliminate the global xarray in the patch-1 and "shared", "exclusive" strings part of all the refcount code.
> thanks,
>
> greg k-h
Powered by blists - more mailing lists