netdev - Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180523022314.783e47fa@cakuba>
Date:   Wed, 23 May 2018 02:23:14 -0700
From:   Jakub Kicinski <jakub.kicinski@...ronome.com>
To:     Huy Nguyen <huyn@...lanox.com>
Cc:     Saeed Mahameed <saeedm@...lanox.com>,
        "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        Jiri Pirko <jiri@...nulli.us>,
        Or Gerlitz <gerlitz.or@...il.com>,
        Parav Pandit <parav@...lanox.com>,
        Ido Schimmel <idosch@...lanox.com>
Subject: Re: [net-next 1/6] net/dcb: Add dcbnl buffer attribute

On Tue, 22 May 2018 20:01:21 -0500, Huy Nguyen wrote:
> On 5/22/2018 1:32 PM, Jakub Kicinski wrote:
> > On Tue, 22 May 2018 10:36:17 -0500, Huy Nguyen wrote:  
> >> On 5/22/2018 12:20 AM, Jakub Kicinski wrote:  
> >>> On Mon, 21 May 2018 14:04:57 -0700, Saeed Mahameed wrote:  
> >>>> From: Huy Nguyen <huyn@...lanox.com>
> >>>>
> >>>> In this patch, we add dcbnl buffer attribute to allow user
> >>>> change the NIC's buffer configuration such as priority
> >>>> to buffer mapping and buffer size of individual buffer.
> >>>>
> >>>> This attribute combined with pfc attribute allows advance user to
> >>>> fine tune the qos setting for specific priority queue. For example,
> >>>> user can give dedicated buffer for one or more prirorities or user
> >>>> can give large buffer to certain priorities.
> >>>>
> >>>> We present an use case scenario where dcbnl buffer attribute configured
> >>>> by advance user helps reduce the latency of messages of different sizes.
> >>>>
> >>>> Scenarios description:
> >>>> On ConnectX-5, we run latency sensitive traffic with
> >>>> small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
> >>>> traffic with large messages sizes 512KB and 1MB. We group small, medium,
> >>>> and large message sizes to their own pfc enables priorities as follow.
> >>>>     Priorities 1 & 2 (64B, 256B and 1KB)
> >>>>     Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
> >>>>     Priorities 5 & 6 (512KB and 1MB)
> >>>>
> >>>> By default, ConnectX-5 maps all pfc enabled priorities to a single
> >>>> lossless fixed buffer size of 50% of total available buffer space. The
> >>>> other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
> >>>> we create three equal size lossless buffers. Each buffer has 25% of total
> >>>> available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
> >>>> to lossless  buffer mappings are set as follow.
> >>>>     Priorities 1 & 2 on lossless buffer #1
> >>>>     Priorities 3 & 4 on lossless buffer #2
> >>>>     Priorities 5 & 6 on lossless buffer #3
> >>>>
> >>>> We observe improvements in latency for small and medium message sizes
> >>>> as follows. Please note that the large message sizes bandwidth performance is
> >>>> reduced but the total bandwidth remains the same.
> >>>>     256B message size (42 % latency reduction)
> >>>>     4K message size (21% latency reduction)
> >>>>     64K message size (16% latency reduction)
> >>>>
> >>>> Signed-off-by: Huy Nguyen <huyn@...lanox.com>
> >>>> Signed-off-by: Saeed Mahameed <saeedm@...lanox.com>  
> >>> On a cursory look this bares a lot of resemblance to devlink shared
> >>> buffer configuration ABI.  Did you look into using that?
> >>>
> >>> Just to be clear devlink shared buffer ABIs don't require representors
> >>> and "switchdev mode".
> >>> .  
> >> [HQN] Dear Jakub, there are several reasons that devlink shared buffer
> >> ABI cannot be used:
> >> 1. The devlink shared buffer ABI is written based on the switch cli
> >> which you can find out more
> >> from this link https://community.mellanox.com/docs/DOC-2558.  
> > Devlink API accommodates requirements of simpler (SwitchX2?) and more
> > advanced schemes (present in Spectrum).  The simpler/basic static
> > threshold configurations is exactly what you are doing here, AFAIU.  
> [HQN] Devlink API is tailored specifically for switch.

I hope that is not true, since we (Netronome) are trying to use it for
NIC configuration, too.  We should generalize the API if need be.

> We don't configure threshold configuration explicitly. It is done via
> PFC. Once PFC is enabled on priority, threshold is setup based on our
> proprietary formula that were tested rigorously for performance.

Are you referring to XOFF/XON thresholds?  I don't think the "threshold
type" in devlink API implies we are setting XON/XOFF thresholds
directly :S  If PFC is enabled we may be setting them indirectly,
obviously.

My understanding is that for static threshold type the size parameter
specifies the max amount of memory given pool can consume.

> >> 2. The dcbnl interfaces have been used for QoS settings.  
> > QoS settings != shared buffer configuration.  
> [HQN] I think we have different definition about "shared buffer".
> Please refer to this below switch cli link.
> It explained in detail what is the "shared buffer" in switch means.
> Our NIC does not have "shared buffer" supported.
> https://community.mellanox.com/docs/DOC-2591

Yes, we must have a different definitions of "shared buffer" :)  That
link, however, didn't clarify much for me...  In mlx5 you seem to have a
buffer which is shared between priorities, even if it's not what would
be referred to as shared buffer in switch context.

> >> In NIC, the  buffer configuration are tied to priority (ETS PFC).  
> > Some customers use DCB, a lot (most?) of them don't.  I don't think
> > the "this is a logical extension of a commonly used API" really
> > stands here.  
> [HQN] DCBNL are being actively used. The whole point of this patch
> is to tie buffer configuration with IEEE's priority and is IEEE's PFC 
> configuration.
>
> Ambitious future is to have the switch configure the NIC's buffer
> size and buffer mapping
> via TLV packet and this DCBNL interface. But we won't go too far here.

I think I can understand the motivation, and I think it's a nice thing
to expose!  The only questions are: does it really belong to DCBNL and
can existing API be used?
 
From patch description it seems like your default setup is shared
buffer split 50% (lossy)/50% (all prios) and the example you give
changes that to 25% (lossy)/25%x3 prio groups.

With existing devlink API could this be modelled by three ingress pools
with 2 TCs bound each?

> >> The buffer configuration are not tied to port like switch.  
> > It's tied to a port and TCs, you just have one port but still have 8
> > TCs exactly like a switch...  
> [HQN] No. Our buffer ties to priority not to TCs.

Right, that is a valid point.  Although TCs can be mapped to
priorities.  Some switches may tie buffers to priorities, too.  So
perhaps it's worth extending devlink?

> >> 3. Shared buffer, alpha, threshold are switch specific terms.  
> > IDK how talking about alpha is relevant, it's just one threshold
> > type the API supports.  As far as shared buffer and threshold I
> > don't know if these are switch terms (or how "switch" differs from
> > "NIC" at that level) - I personally find carving shared buffer into
> > pools very intuitive.  
> [HQN] Yes, I understand your point too. The NIC's buffer shares some 
> characteristics with the switch's buffer settings. 

Yes, and if it's not a perfect match we can extend it.

> But this DCB buffer setting is to improve the performance and work
> together with the PFC setting. We would like to keep all the qos
> setting under DCB Netlink as they are designed to be this way.

DCBNL seems to carry standard-based information, which this is not.
mlxsw supports DCBNL, will it also support this buffer configuration
mechanism?

> > Could you give examples of commands/configs one can use with your
> > new ABI?  
> [HQN] The plan is to add the support in lldptool once the kernel code
> is accepted. To test the kernel code,
> I am using small python scripts that works on top of the netlink
> library. It will be like this format which is similar to other
> options in lldptool priority2buffer: 0,2,5,7,1,2,3,6 maps priorities
> 0,1,2,3,4,5,6,7 to buffer 0,2,5,7,1,2,3,6
>      buffer_size: 87296,87296,0,87296,0,0,0,0 set receive buffer size 
> for buffer 0,1,2,3,4,5,6,7 respectively
> >    How does one query the total size of the buffer to be carved?  
> [HQN] This is not necessary. If the total size is too big, error will
> be return via DCB netlink interface.

Right, I'm not saying it's a bug :)  It's just nice when user can be
told the total size without having to probe for it :)