linux-kernel - Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <68580479-c66e-41e3-b869-b9f98e348f01@linux.ibm.com>
Date:   Wed, 18 Oct 2023 21:43:40 +0200
From:   Wenjia Zhang <wenjia@...ux.ibm.com>
To:     Wen Gu <guwen@...ux.alibaba.com>,
        Niklas Schnelle <schnelle@...ux.ibm.com>, kgraul@...ux.ibm.com,
        jaka@...ux.ibm.com, davem@...emloft.net, edumazet@...gle.com,
        kuba@...nel.org, pabeni@...hat.com
Cc:     wintera@...ux.ibm.com, gbayer@...ux.ibm.com, pasic@...ux.ibm.com,
        alibuda@...ux.alibaba.com, tonylu@...ux.alibaba.com,
        dust.li@...ux.alibaba.com, linux-s390@...r.kernel.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM
 extension and loopback-ism



On 17.10.23 05:49, Wen Gu wrote:
> 
> 
> On 2023/10/8 15:19, Wen Gu wrote:
>>
>>
>> On 2023/10/5 16:21, Niklas Schnelle wrote:
>>
>>>
>>> Hi Wen Gu,
>>>
>>> I've been trying out your series with iperf3, qperf, and uperf on
>>> s390x. I'm using network namespaces with a ConnectX VF from the same
>>> card in each namespace for the initial TCP/IP connection i.e. initially
>>> it goes out to a real NIC even if that can switch internally. All of
>>> these look great for streaming workloads both in terms of performance
>>> and stability. With a Connect-Request-Response workload and uperf
>>> however I've run into issues. The test configuration I use is as
>>> follows:
>>>
>>> Client Command:
>>>
>>> # host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml
>>>
>>> Server Command:
>>>
>>> # ip netns exec server smc_run uperf -s &> /dev/null
>>>
>>> Uperf tcp_crr.xml:
>>>
>>> <?xml version="1.0"?>
>>> <profile name="TCP_CRR">
>>>          <group nthreads="12">
>>>                  <transaction duration="120">
>>>                          <flowop type="connect" 
>>> options="remotehost=$host protocol=tcp" />
>>>                          <flowop type="write" options="size=200"/>
>>>                          <flowop type="read" options="size=1000"/>
>>>                          <flowop type="disconnect" />
>>>                  </transaction>
>>>          </group>
>>> </profile>
>>>
>>> The workload first runs fine but then after about 4 GB of data
>>> transferred fails with "Connection refused" and "Connection reset by
>>> peer" errors. The failure is not permanent however and re-running
>>> the streaming workloads run fine again (with both uperf server and
>>> client restarted). So I suspect something gets stuck in either the
>>> client or server sockets. The same workload runs fine with TCP/IP of
>>> course.
>>>
>>> Thanks,
>>> Niklas
>>>
>>>
>>
>> Hi Niklas,
>>
>> Thank you very much for the test. With the test example you provided, 
>> I've
>> reproduced the issue in my VM. And moreover, sometimes the test complains
>> with 'Error saying goodbye with <ip>'
>>
>> I'll figure out what's going on here.
>>
>> Thanks!
>> Wen Gu
> 
> I think that there is a common issue for SMC-R and SMC-D. I also reproduce
> 'connection reset by peer' and 'Error saying goodbye with <ip>' when using
> SMC-R under the same test condition. They occur at the end of the test.
> 
> When the uperf test time ends, some signals are sent. At this point there
> are usually some SMC connections doing CLC handshake. I catch some 
> -EINTR(-4)
> in client and -ECONNRESET(-104) in server returned from smc_clc_wait_msg,
> (correspondingly handshake error counts also increase) and TCP RST packets
> sent to terminate the CLC TCP connection(clcsock).
> 
> I am not sure if this should be considered as a bydesign or a bug of SMC.
>  From an application perspective, the conn reset behavior only happens when
> using SMC.
> 
> @Wenjia, could you please take a look at this?
> 
> Thanks,
> Wen Gu

Hi Wen,

Do you mean the bug in smc_clc_wait_msg()?
If yes, I can not see any problem in the smc_clc_wait_msg(). From your 
description, it looks to me like the server should get the CLC_PROPOSAL 
message, but nothing in it while the client is waiting for the accept 
CLC_ACCEPT message from the server until the wait loops is broken out.

Thanks,
Wenjia