netdev - Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180803215909.7ab093e4@cakuba.netronome.com>
Date:   Fri, 3 Aug 2018 21:59:09 -0700
From:   Jakub Kicinski <jakub.kicinski@...ronome.com>
To:     Ido Schimmel <idosch@...sch.org>
Cc:     Eran Ben Elisha <eranbe@...lanox.com>,
        David Miller <davem@...emloft.net>, saeedm@...lanox.com,
        netdev@...r.kernel.org, jiri@...lanox.com,
        alexander.duyck@...il.com, helgaas@...nel.org
Subject: Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink
 updates 2018-07-31

On Fri, 3 Aug 2018 19:41:50 +0300, Ido Schimmel wrote:
> On Thu, Aug 02, 2018 at 03:53:15PM -0700, Jakub Kicinski wrote:
> > No one is requesting full RED offload here..  if someone sets the
> > parameters you can't support you simply won't offload them.  And ignore
> > the parameters which only make sense in software terms.  Look at the
> > docs for mlxsw:
> > 
> > https://github.com/Mellanox/mlxsw/wiki/Queues-Management#offloading-red
> > 
> > It says "not offloaded" in a number of places.
> >   
> ...
> > It's generally preferable to implement a subset of exiting well defined
> > API than create vendor knobs, hence hardly a misuse.  
> 
> Sorry for derailing the discussion, but you mentioned some points that
> have been bothering me for a while.
> 
> I think we didn't do a very good job with buffer management and this is
> exactly why you see some parameters marked as "not offloaded". Take the
> "limit" (queue size) for example. It's configured via devlink-sb, by
> setting a quota on the number of bytes that can be queued for the port
> and TC (queue) that RED manages. See:
> 
> https://github.com/Mellanox/mlxsw/wiki/Quality-of-Service#pool-binding

FWIW I was implementing a very similar thing for the NFP a while back.
devlink-sb to configure per-port limits and RED offload.  I believe we
have some more qdisc offloads but out-of-tree/for appliances.
"Switchdev mode" + qdisc offloads work quite well.  For RED I think
we also don't offload the limit.

> It would have been much better and user friendly to not ignore this
> parameter and have users configure the limit using existing interfaces
> (tc), instead of creating a discrepancy between the software and
> hardware data paths by configuring the hardware directly via devlink-sb.
>
> I believe devlink-sb is mainly the result of Linux's short comings in
> this area and our lack of perspective back then. While the qdisc layer
> (Linux's shared buffers) works for end hosts, it requires enhancements
> (mainly on ingress) for switches (physical/virtual) that forward
> packets.

I could definitely agree with you.  But there is another way to look at
this.  Memory in ASICs is fundamentally more precious.  If the problem
was never solved for Linux (placing constraints on the number of
packets in the system by ingress port) maybe it's just not important
for software stacks?  Qdiscs are focused on egress.  Perhaps a better
software equivalent to Shared Buffers would be Jesper's Buffer Pools?

With Buffer Pools the concern that a pre-configured and pinned pool of
DMA-mapped pages will start growing and eat all host's memory is more
real.  That to me that's closer.  If we develop XDP-based fastpaths
with DMA pools shared between devices - that's much more like an ASIC's
SB.

In my view we don't offload the limit not because we configure it via
an different API, but because the limit assumes there is abundance of
memory and queue has to be capped.  Limit expresses how much queue
build up is okay, while SB config is strictly a resource quota.  In
practice the quota is always a lot lower than user's desired limit so
we don't even bother with the limit.

> For example, switches (I'm familiar with Mellanox ASICs, but I assume
> the concept is similar in other ASICs) have ingress buffers where
> packets are stored while going through the pipeline. Once out of the
> pipeline you know from which port and queue the packet should egress. In
> case you have both lossless and lossy traffic in your network you
> probably want to classify it into different ingress buffers and mark the
> buffers where the lossless traffic is stored as such, so that PFC frames
> would be emitted above a certain threshold.
> 
> This is currently configured using dcbnl, but it lacks a software model
> which means that packets that are forwarded by the kernel don't get the
> same treatment (e.g., skb priority isn't set). It also means that when
> you want to limit the number of packets that are queued *from* a certain
> port and ingress buffer you resort to tools such as devlink-sb that end
> up colliding with existing tools (tc).

Extending DCB further into the kernel on ingress does not seem
impossible.  Maybe the AVB/industrial folks will tackle that at some
point?

> I was thinking (not too much...) about modelling the above using ingress
> qdiscs. They don't do any queueing, but more of accounting. Once the
> egress qdisc dequeues the packet, you give credit back to the ingress
> qdisc from which the packet came from. I believe that modelling these
> buffers using the qdisc layer is the right abstraction.

Interesting.  My concern would be mapping the packet back to ingress
port to free the right qdisc credit.  MM direction, like Buffer Pools,
seem more viable to a layman like me.  But ingress qdiscs sound worth
exploring.

> Would appreciate hearing your thoughts on the above.

Thanks a lot for your response, you've certainly given me things to
think about over the weekend :)  A lot of cool things we can build if 
we keep moving forward :)