netdev - Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:   Mon, 6 Aug 2018 17:49:05 -0700
From:   Jakub Kicinski <jakub.kicinski@...ronome.com>
To:     Eran Ben Elisha <eranbe@...lanox.com>
Cc:     David Miller <davem@...emloft.net>, saeedm@...lanox.com,
        netdev@...r.kernel.org, jiri@...lanox.com,
        alexander.duyck@...il.com, helgaas@...nel.org
Subject: Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink
 updates 2018-07-31

On Mon, 6 Aug 2018 16:01:25 +0300, Eran Ben Elisha wrote:
> >> Hi Dave,
> >> I would like to re-state that this feature was not meant to be a generic
> >> one. This feature was added in order to resolve a HW bug which exist in
> >> a small portion of our devices.  
> > 
> > Would you mind describing the HW bug in more detail?  To a outside
> > reviewer it really looks like you're adding a feature.  What are you
> > working around?  Is the lack of full AQM on the PCIe side of the chip
> > considered a bug?  
> 
> In multiple function environment, there is an issue with buffer 
> allocation per function which may lead to starvation. 

Multi-function?  I thought you have a PF per uplink on all mlx5
silicon.  Does the problem occur in single host scenarios as well?
What if with single function the host is too slow with taking packets
off the RX ring?  Can the problem occur?  Would this feature help in
such scenario as well?

> There is an HW WA for mitigate this starvation by identifying this
> state and apply early drop/mark.

If I understand you correctly you presently have one shared buffer with
no way to place limits (quotas) on how much of it can be consumed by a
traffic for a single PF.  It remains unclear why this have not been a
problem for you until now.

To avoid the starvation you are adding AQM which is a *feature* that
may help you avoid queue build up in the NIC.  But even if you could
place quotas, why would you not expose the AQM scheme?  It looks very
useful.

> >> Those params will be used only on those current HWs and won't be in
> >> use for our future devices.  
> > 
> > I'm glad that is your plan today, however, customers may get used to
> > the simple interface you're adding now.  This means the API you are
> > adding is effectively becoming an API other drivers may need to
> > implement to keep compatibility with someone's proprietary
> > orchestration.  
> 
> This issue was refactored, thus no need to have this WA at all in
> future NICs. So I don't believe we will end up in the situation you are
> describing. It is less likely that other vendors will be facing the
> same issue and will have to support such param. it was burn out of a
> bug and not as a feature which other may follow.

Sure other vendors may have buffers quotas configurable by e.g.
devlink-sb.  But the AQM you are adding is a feature which is
potentially already supported by others.

> >> During the discussions, several alternatives where offered to be
> >> used by various members of the community. These alternatives
> >> includes TC and enhancements to PCI configuration tools.
> >>
> >> Regarding the TC, from my perspective, this is not an option as:
> >> 1) The HW mechanism handles multiple functions and therefore
> >> cannot be configured on as a regular TC  
> > 
> > Could you elaborate?  What are the multiple functions?  You seem to
> > be adding a knob to enable ECN marking and a knob for choosing
> > between some predefined slopes.  
> 
> PSB, The sloped are dynamic and enabled in a dynamic way.
> Indeed, we are adding a very specific knob for very non standard 
> specific issue which can be used in addition to standard ECN marking.
> 
> > In what way would your solution not behave like a RED offload?  
> 
> Existing Algo (RED, PIE, etc) are static, configurable. Our HW WA is 
> dynamic (dynamic slope), adjusted and auto enabled.

You mean like Adaptive RED?  The lack of documentation is making this
conversation harder to have.  What's dynamic and aggressive?  These 
are not antonyms.  Are the parameters to the algorithm configurable?

Will you not want to expose the actual threshold and adjustment values
so the customers can tweak them on their own depending on the workload?

> > With TC offload you'd also get a well-defined set of statistics, I
> > presume right now you're planning on adding a set of ethtool -S
> > counters?
> >   
> >> 2) No PF + representors modeling can be applied here, this is a
> >> MultiHost environment where one host is not aware to the other
> >> hosts, and each is running on its own pci/driver. It is a device
> >> working mode configuration.  
> > 
> > Yes, the multihost part makes it less pleasant.  But this is a
> > problem we have to tackle separately, at some point.  It's not a
> > center of attention here.  
> 
> Agree, however the multihost part makes it non-transparent if we
> chose a solution which is not based on direct vendor configuration.
> This will lead to a bad user experience.

In my experience multi-host is not a major issue in practice.  And
switchdev mode gives some visibility into statistics of others hosts 
etc., which people appreciate.