netdev - Re: UDP sockets oddities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 23 Aug 2017 16:04:36 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Florian Fainelli <f.fainelli@...il.com>
Cc:     netdev@...r.kernel.org, edumazet@...il.com, pabeni@...hat.com,
        willemb@...gle.com, davem@...emloft.net
Subject: Re: UDP sockets oddities

On Wed, 2017-08-23 at 15:49 -0700, Florian Fainelli wrote:
> On 08/23/2017 03:26 PM, Eric Dumazet wrote:
> > On Wed, 2017-08-23 at 13:02 -0700, Florian Fainelli wrote:
> >> Hi,
> >>
> >> On Broadcom STB chips using bcmsysport.c and bcm_sf2.c we have an out of
> >> band HW mechanism (not using per-flow pause frames) where we can have
> >> the integrated network switch backpressure the CPU Ethernet controller
> >> which translates in completing TX packets interrupts at the appropriate
> >> pace and therefore get flow control applied end-to-end from the host CPU
> >> port towards any downstream port. At least that is the premise and this
> >> works reasonably well.
> >>
> >> This has a few drawbacks in that each of the bcmsysport TX queues need
> >> to semi-statically map to their switch port output queues such that the
> >> switch can calculate buffer occupancy and report congestion status,
> >> which prompted this email [1] but this is tangential and is a policy not
> >> a mechanism issue.
> >>
> >> [1]: https://www.spinics.net/lists/netdev/msg448153.html
> >>
> >> This is useful when your CPU / integrated switch links up at 1Gbits/sec
> >> internally, and tries to push 1Gbits/sec worth of UDP traffic to e.g: a
> >> downstream port linking at 100Mbits/sec, which could happen depending on
> >> what you have connected to this device.
> >>
> >> Now the problem that I am facing, is the following:
> >>
> >> - net.core.wmem_default = 160KB (default value)
> >> - using iperf -b 800M -u towards an iperf UDP server with the physical
> >> link to that server established at 100Mbits/sec
> >> - iperf does synchronous write(2) AFAICT so this gives it flow control
> >> - using the default duration of 10s, you can barely see any packet loss
> >> from one run to another
> >> - the longer the run, the higher you are going to see some packet loss,
> >> usually in the range of ~0.15% top
> >>
> >> The transmit flow looks like this:
> >>
> >> gphy (net/dsa/slave.c::dsa_slave_xmit, IFF_NO_QUEUE device)
> >>  -> eth0 (drivers/net/ethernet/broadcom/bcmsysport.c, "regular" network
> >> device)
> >>
> >> I can clearly see that the network stack pushed N UDP packets (Udp and
> >> Ip counters in /proc/net/snmp concur) however what the driver
> >> transmitted and what the switch transmistted is N - M, and matches the
> >> packet loss reported by the UDP server. I don't measure any SndbufErrors
> >> which is not making sense yet.
> >>
> >> If I reduce the default socket size to say, 10x less than 160KB, 16KB,
> >> then I either don't see any packet loss at 100Mbits/sec for 5 minutes or
> >> more, or just very very little, down to 0.001%. Now if I repeat the
> >> experiment with the physical link at 10Mbits/sec, same thing, the 16KB
> >> wmem_default setting is no longer working and we need to lower the
> >> socket write buffer size again.
> >>
> >> So what I am wondering is:
> >>
> >> - do I have an obvious flow control problem in my network driver that
> >> usually does not lead to packet loss, but may sometime happen?
> >>
> >> - why would lowering the socket write size appear to masquerade or solve
> >> this problem?
> >>
> >> I can consistently reproduce this across several kernel versions, 4.1,
> >> 4.9 and latest net-next and therefore can also test patches.
> >>
> >> Thanks for reading thus far!
> > 
> > Have you checked qdisc counters ? Maybe drops happen there.
> 
> CONFIG_NET_SCHED is actually disabled in this kernel configuration.
> 
> But even with that enabled, I don't see any drops being reported at the
> qdisc level, see below.

You might try

perf record -a -g -e skb:kfree_skb sleep 10
perf report