[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5446492849dc4b7083b7a1f50a587d92@ausx13mps321.AMER.DELL.COM>
Date: Mon, 8 Oct 2018 21:09:46 +0000
From: <Alex_Gagniuc@...lteam.com>
To: <helgaas@...nel.org>
Cc: <mr.nuke.me@...il.com>, <linux-pci@...r.kernel.org>,
<bhelgaas@...gle.com>, <keith.busch@...el.com>,
<Austin.Bolen@...l.com>, <Shyam.Iyer@...l.com>,
<ariel.elior@...ium.com>, <everest-linux-l2@...ium.com>,
<davem@...emloft.net>, <michael.chan@...adcom.com>,
<ganeshgr@...lsio.com>, <jeffrey.t.kirsher@...el.com>,
<tariqt@...lanox.com>, <saeedm@...lanox.com>, <leon@...nel.org>,
<jakub.kicinski@...ronome.com>, <dirk.vandermerwe@...ronome.com>,
<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<intel-wired-lan@...ts.osuosl.org>, <linux-rdma@...r.kernel.org>,
<oss-drivers@...ronome.com>, <stephen@...workplumber.org>,
<mj@....cz>, <alex.williamson@...hat.com>
Subject: Re: [PATCH 1/9] PCI: sysfs: Export available PCIe bandwidth
On 10/04/2018 03:45 PM, Bjorn Helgaas wrote:
>
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive information.
>
>
> [+cc Alex (VC mentioned below)]
>
> On Wed, Oct 03, 2018 at 10:00:19PM +0000, Alex_Gagniuc@...lteam.com wrote:
>> On 10/03/2018 04:31 PM, Bjorn Helgaas wrote:
>>> On Mon, Sep 03, 2018 at 01:02:28PM -0500, Alexandru Gagniuc wrote:
>>>> For certain bandwidth-critical devices (e.g. multi-port network cards)
>>>> it is useful to know the available bandwidth to the root complex. This
>>>> information is only available via the system log, which doesn't
>>>> account for link degradation after probing.
>>>>
>>>> With a sysfs attribute, we can computes the bandwidth on-demand, and
>>>> will detect degraded links.
>>>>
>>>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@...il.com>
>>>> ---
>>>> drivers/pci/pci-sysfs.c | 13 +++++++++++++
>>>> 1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>>>> index 9ecfe13157c0..6658e927b1f5 100644
>>>> --- a/drivers/pci/pci-sysfs.c
>>>> +++ b/drivers/pci/pci-sysfs.c
>>>> @@ -218,6 +218,18 @@ static ssize_t current_link_width_show(struct device *dev,
>>>> }
>>>> static DEVICE_ATTR_RO(current_link_width);
>>>>
>>>> +static ssize_t available_bandwidth_show(struct device *dev,
>>>> + struct device_attribute *attr, char *buf)
>>>> +{
>>>> + struct pci_dev *pci_dev = to_pci_dev(dev);
>>>> + u32 bw_avail;
>>>> +
>>>> + bw_avail = pcie_bandwidth_available(pci_dev, NULL, NULL, NULL);
>>>> +
>>>> + return sprintf(buf, "%u.%03u Gb/s\n", bw_avail / 1000, bw_avail % 1000);
>>>> +}
>>>> +static DEVICE_ATTR_RO(available_bandwidth);
>>>
>>> Help me understand this. We already have these sysfs attributes:
>>>
>>> max_link_speed # eg, 16 GT/s
>>> max_link_width # eg, 8
>>> current_link_speed # eg, 16 GT/s
>>> current_link_width # eg, 8
>>>
>>> so I think the raw materials are already exposed.
>>>
>>> The benefits I see for this new file are that
>>>
>>> - pcie_bandwidth_available() does the work of traversing up the
>>> tree, doing the computations (link width * speed, reduced by
>>> encoding overhead), and finding the minimum, and
>>>
>>> - it re-traverses the path every time we look at it, while the
>>> boot-time check is a one-time event.
>>>
>>> In principle this could all be done in user space with the attributes
>>> that are already exported. There's some precedent for things like
>>> this in lspci, e.g., "NUMA node" [1], and lspci might even be a more
>>> user-friendly place for users to look for this, as opposed to
>>> searching through sysfs.
>>
>> Parsing the endpoint to root port bandwidth is, in principle, possible
>> from userspace. It's just that in practice it's very clumsy to do, and,
>> as you pointed out, not that reliable.
>
> I don't remember the reliability issue. Was that in a previous
> conversation? AFAICT, using current_link_speed and current_link_width
> should be as reliable as pcie_bandwidth_available() because they're
> both based on reading PCI_EXP_LNKSTA.
>
> This patch exposes only the available bandwidth, not the limiting
> device, link speed, or link width. Maybe that extra information isn't
> important in this context. Of course, it's easy to derive using
> current_link_speed and current_link_width, but if we do that, there's
> really no benefit to adding a new file.
>
> Linux doesn't really support Virtual Channels (VC) yet, and I don't
> know much about it, but it seems like Quality-of-Service features like
> VC could affect this idea of available bandwidth.
>
> Since we already expose the necessary information, and we might throw
> additional things like VC into the mix, I'm a little hesitant about
> adding things to sysfs because they're very difficult to change later.
I understand your concerns with VC and crazy PCIe outliers.
'available_bandwidth' is meant to mean physical bandwidth, rather than
"what comcast gives you". I see now the possibility for confusion.
My motivation is to save drivers from the hassle of having to call
pcie_print_link_status(), and prevent the rare duplicate syslog messages.
I prefer re-using the kernel logic over doing it over again from
userspace, but I won't insist over your doubts.
Alex
>> I understand it's not information that all users would jump in the air
>> to know. However, it was important enough for certain use cases, that
>> the kernel already has a very reliable way to calculate it.
>>
>> It seems to me that the most elegant way is to let the kernel tell us,
>> since the kernel already has this facility. To quote one of the texts
>> under Documentation/, it is an elegant way to "avoid reinventing kernel
>> wheels in userspace".
>>
>> Alex
>>
>>> [1] https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/commit/?id=90ec4a6d0ae8
>>>
>>>> static ssize_t secondary_bus_number_show(struct device *dev,
>>>> struct device_attribute *attr,
>>>> char *buf)
>>>> @@ -786,6 +798,7 @@ static struct attribute *pcie_dev_attrs[] = {
>>>> &dev_attr_current_link_width.attr,
>>>> &dev_attr_max_link_width.attr,
>>>> &dev_attr_max_link_speed.attr,
>>>> + &dev_attr_available_bandwidth.attr,
>>>> NULL,
>>>> };
>>>>
>>>> --
>>>> 2.17.1
>>>>
>>>
>>
>
Powered by blists - more mailing lists