[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <847d24f9-9bcc-6de3-fd58-7414c22eebeb@candelatech.com>
Date: Tue, 23 Jan 2018 14:06:24 -0800
From: Ben Greear <greearb@...delatech.com>
To: Josh Hunt <joshhunt00@...il.com>
Cc: Eric Dumazet <eric.dumazet@...il.com>,
netdev <netdev@...r.kernel.org>
Subject: Re: TCP many-connection regression between 4.7 and 4.13 kernels.
On 01/22/2018 10:46 AM, Josh Hunt wrote:
> On Mon, Jan 22, 2018 at 10:30 AM, Ben Greear <greearb@...delatech.com> wrote:
>> On 01/22/2018 10:16 AM, Eric Dumazet wrote:
>>>
>>> On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:
>>>>
>>>> My test case is to have 6 processes each create 5000 TCP IPv4 connections
>>>> to each other
>>>> on a system with 16GB RAM and send slow-speed data. This works fine on a
>>>> 4.7 kernel, but
>>>> will not work at all on a 4.13. The 4.13 first complains about running
>>>> out of tcp memory,
>>>> but even after forcing those values higher, the max connections we can
>>>> get is around 15k.
>>>>
>>>> Both kernels have my out-of-tree patches applied, so it is possible it is
>>>> my fault
>>>> at this point.
>>>>
>>>> Any suggestions as to what this might be caused by, or if it is fixed in
>>>> more recent kernels?
>>>>
>>>> I will start bisecting in the meantime...
>>>>
>>>
>>> Hi Ben
>>>
>>> Unfortunately I have no idea.
>>>
>>> Are you using loopback flows, or have I misunderstood you ?
>>>
>>> How loopback connections can be slow-speed ?
>>>
>>
>> I am sending to self, but over external network interfaces, by using
>> routing tables and rules and such.
>>
>> On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
>> connections. In this case, I have a pair of 10G ports doing 15k, and then
>> I try to start 5k on two of the 1G ports....
>>
>> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
>> Down
>> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
>> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
>> Down
>> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
>> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
>> Down
>> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
>> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
>> Down
>> Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
>> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3
>> (e1000e): transmit queue 0 timed out, trans_s...es: 1
>> Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0
>> eth3: Reset adapter unexpectedly
>>
>
> Ben
>
> We had an interface doing this and grabbing these commits resolved it for us:
>
> 4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts
> 19110cfbb34d e1000e: Separate signaling for link check/link up
> d3509f8bc7b0 e1000e: Fix return value test
> 65a29da1f5fd e1000e: Fix wrong comment related to link detection
> c4c40e51f9c3 e1000e: Fix error path in link detection
>
> They are in the LTS kernels now, but don't believe they were when we
> first hit this problem.
Thanks a lot for the suggestions, I can confirm that these patches applied to my 4.13.16+
tree does indeed seem to fix the problem.
Thanks,
Ben
>
> Josh
>
--
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc http://www.candelatech.com
Powered by blists - more mailing lists