lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 23 Aug 2017 17:03:28 -0700
From:   Florian Fainelli <f.fainelli@...il.com>
To:     Eric Dumazet <eric.dumazet@...il.com>
Cc:     netdev@...r.kernel.org, edumazet@...il.com, pabeni@...hat.com,
        willemb@...gle.com, davem@...emloft.net
Subject: Re: UDP sockets oddities

On 08/23/2017 04:04 PM, Eric Dumazet wrote:
> On Wed, 2017-08-23 at 15:49 -0700, Florian Fainelli wrote:
>> On 08/23/2017 03:26 PM, Eric Dumazet wrote:
>>> On Wed, 2017-08-23 at 13:02 -0700, Florian Fainelli wrote:
>>>> Hi,
>>>>
>>>> On Broadcom STB chips using bcmsysport.c and bcm_sf2.c we have an out of
>>>> band HW mechanism (not using per-flow pause frames) where we can have
>>>> the integrated network switch backpressure the CPU Ethernet controller
>>>> which translates in completing TX packets interrupts at the appropriate
>>>> pace and therefore get flow control applied end-to-end from the host CPU
>>>> port towards any downstream port. At least that is the premise and this
>>>> works reasonably well.
>>>>
>>>> This has a few drawbacks in that each of the bcmsysport TX queues need
>>>> to semi-statically map to their switch port output queues such that the
>>>> switch can calculate buffer occupancy and report congestion status,
>>>> which prompted this email [1] but this is tangential and is a policy not
>>>> a mechanism issue.
>>>>
>>>> [1]: https://www.spinics.net/lists/netdev/msg448153.html
>>>>
>>>> This is useful when your CPU / integrated switch links up at 1Gbits/sec
>>>> internally, and tries to push 1Gbits/sec worth of UDP traffic to e.g: a
>>>> downstream port linking at 100Mbits/sec, which could happen depending on
>>>> what you have connected to this device.
>>>>
>>>> Now the problem that I am facing, is the following:
>>>>
>>>> - net.core.wmem_default = 160KB (default value)
>>>> - using iperf -b 800M -u towards an iperf UDP server with the physical
>>>> link to that server established at 100Mbits/sec
>>>> - iperf does synchronous write(2) AFAICT so this gives it flow control
>>>> - using the default duration of 10s, you can barely see any packet loss
>>>> from one run to another
>>>> - the longer the run, the higher you are going to see some packet loss,
>>>> usually in the range of ~0.15% top
>>>>
>>>> The transmit flow looks like this:
>>>>
>>>> gphy (net/dsa/slave.c::dsa_slave_xmit, IFF_NO_QUEUE device)
>>>>  -> eth0 (drivers/net/ethernet/broadcom/bcmsysport.c, "regular" network
>>>> device)
>>>>
>>>> I can clearly see that the network stack pushed N UDP packets (Udp and
>>>> Ip counters in /proc/net/snmp concur) however what the driver
>>>> transmitted and what the switch transmistted is N - M, and matches the
>>>> packet loss reported by the UDP server. I don't measure any SndbufErrors
>>>> which is not making sense yet.
>>>>
>>>> If I reduce the default socket size to say, 10x less than 160KB, 16KB,
>>>> then I either don't see any packet loss at 100Mbits/sec for 5 minutes or
>>>> more, or just very very little, down to 0.001%. Now if I repeat the
>>>> experiment with the physical link at 10Mbits/sec, same thing, the 16KB
>>>> wmem_default setting is no longer working and we need to lower the
>>>> socket write buffer size again.
>>>>
>>>> So what I am wondering is:
>>>>
>>>> - do I have an obvious flow control problem in my network driver that
>>>> usually does not lead to packet loss, but may sometime happen?
>>>>
>>>> - why would lowering the socket write size appear to masquerade or solve
>>>> this problem?
>>>>
>>>> I can consistently reproduce this across several kernel versions, 4.1,
>>>> 4.9 and latest net-next and therefore can also test patches.
>>>>
>>>> Thanks for reading thus far!
>>>
>>> Have you checked qdisc counters ? Maybe drops happen there.
>>
>> CONFIG_NET_SCHED is actually disabled in this kernel configuration.
>>
>> But even with that enabled, I don't see any drops being reported at the
>> qdisc level, see below.
> 
> You might try
> 
> perf record -a -g -e skb:kfree_skb sleep 10
> perf report

Attached is the perf report --stdio of:

# perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60
WARNING: option -b implies udp testing
------------------------------------------------------------
Client connecting to 192.168.1.23, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size:  160 KByte (default)
------------------------------------------------------------
[  4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-60.0 sec   685 MBytes  95.8 Mbits/sec
[  4] Sent 488633 datagrams
[  4] Server Report:
[  4]  0.0-74.4 sec   685 MBytes  77.2 Mbits/sec   0.187 ms  300/488632
(0.061%)
[  4]  0.0-74.4 sec  58 datagrams received out-of-order

It measured 488644 events which is greater than the number of packets
transmitted by iperf but I only count 8 non-IP packets being sent
(protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and
why there are not accounted for.

Almost all events came from net_tx_action() except 2 that came from
net/core/neighbour.c::neigh_probe() and 65 that came from
arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!?

I did eventually run what you suggested:

# perf record -a -g -e skb:kfree_skb sleep 10
# perf report --stdio
# To display the perf.data header info, please use
--header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3  of event 'skb:kfree_skb'
# Event count (approx.): 3
#
# Children      Self  Trace output
# ........  ........  ....................................................
#
    33.33%    33.33%  skbaddr=0xef207240 protocol=2048 location=0xc087f684
            |
            ---0xe9fa0360
               kfree_skb

    33.33%    33.33%  skbaddr=0xef207300 protocol=4 location=0xc086cdac
            |
            ---0xe9fa0360
               kfree_skb

    33.33%    33.33%  skbaddr=0xef207b40 protocol=4 location=0xc086cdac
            |
            ---0xe9fa0360
               kfree_skb

Florian

View attachment "log" of type "text/plain" (427253 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ