lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKgT0UdyApTnBOr9rdsdoh4==orpxNEuhkLngX2MuEhOzqdh0w@mail.gmail.com>
Date:   Fri, 4 Feb 2022 07:46:43 -0800
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     David Laight <David.Laight@...lab.com>
Cc:     Eric Dumazet <edumazet@...gle.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        netdev <netdev@...r.kernel.org>, Coco Li <lixiaoyan@...gle.com>
Subject: Re: [PATCH net-next 09/15] net: increase MAX_SKB_FRAGS

On Fri, Feb 4, 2022 at 2:18 AM David Laight <David.Laight@...lab.com> wrote:
>
> From: Alexander Duyck
> > Sent: 03 February 2022 17:57
> ...
> > > > So a big issue I see with this patch is the potential queueing issues
> > > > it may introduce on Tx queues. I suspect it will cause a number of
> > > > performance regressions and deadlocks as it will change the Tx queueing
> > > > behavior for many NICs.
> > > >
> > > > As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> > > > the ingredients for DESC_NEEDED in order to determine if the Tx queue
> > > > needs to stop. With this change the value for igb for instance is
> > > > jumping from 21 to 49, and the wake threshold is twice that, 98. As
> > > > such the minimum Tx descriptor threshold for the driver would need to
> > > > be updated beyond 80 otherwise it is likely to deadlock the first time
> > > > it has to pause.
> > >
> > > Are these limits hard coded in Intel drivers and firmware, or do you
> > > think this can be changed ?
> >
> > This is all code in the drivers. Most drivers have them as the logic
> > is used to avoid having to return NETIDEV_TX_BUSY. Basically the
> > assumption is there is a 1:1 correlation between descriptors and
> > individual frags. So most drivers would need to increase the size of
> > their Tx descriptor rings if they were optimized for a lower value.
>
> Maybe the drivers can be a little less conservative about the number
> of fragments they expect in the next message?
> There is little point requiring 49 free descriptors when the workload
> never has more than 2 or 3 fragments.
>
> Clearly you don't want to re-enable things unless there are enough
> descriptors for an skb that has generated NETDEV_TX_BUSY, but the
> current logic of 'trying to never actually return NETDEV_TX_BUSY'
> is probably over cautious.

The problem is that NETDEV_TX_BUSY can cause all sorts of issues in
terms of the flow of packets. Basically when you start having to push
packets back from the device to the qdisc you can essentially create
head-of-line blocking type scenarios which can make things like
traffic shaping that much more difficult.

> Does Linux allow skb to have a lot of short fragments?
> If dma_map isn't cheap (probably anything with an iommu or non-coherent
> memory) them copying/merging short fragments into a pre-mapped
> buffer can easily be faster.

I know Linux skbs can have a lot of short fragments. The i40e has a
workaround for cases where more than 8 fragments are needed to
transmit a single frame for instance, see __i40e_chk_linearize().

> Many years ago we found it was worth copying anything under 1k on
> a sparc mbus+sbus system.
> I don't think Linux can generate what I've seen elsewhere - the mac
> driver being asked to transmit something with 1000+ one byte fragmemts!
>
>         David

Linux cannot generate the 1000+ fragments, mainly because it is
limited by the frags. However as I pointed out above it isn't uncommon
to see an skb composed of a number of smaller fragments.

That said, I don't know if we really need to be rewriting the code for
NETDEV_TX_BUSY handling on the drivers. It would just be a matter of
reserving more memory in the descriptor rings since the counts would
be going from 42 to 98 in order to unblock a Tx queue in the case of
igb for instance, and currently the minimum ring size is 80. So in
this case it would just be a matter of increasing the minimum so that
it cannot be configured into a deadlock.

Ultimately that is the trade-off with this approach. What we are doing
is increasing the memory footprint of the drivers and skbs in order to
allow for more buffering in the skb to increase throughput. I wonder
if it wouldn't make sense to just make MAX_SKB_FRAGS a driver level
setting like gso_max_size so that low end NICs out there aren't having
to reserve a ton of memory to store fragments they will never use.

- Alex

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ