netdev - Re: [PATCH net-next] net/smc: Optimize the search method of reused buf

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fa7dc8fc-fc6a-5ee1-94a2-b4ad62624834@huawei.com>
Date: Sat, 2 Nov 2024 14:43:52 +0800
From: Li Qiang <liqiang64@...wei.com>
To: <dust.li@...ux.alibaba.com>, <wenjia@...ux.ibm.com>, <jaka@...ux.ibm.com>,
	<alibuda@...ux.alibaba.com>, <tonylu@...ux.alibaba.com>,
	<guwen@...ux.alibaba.com>
CC: <linux-s390@...r.kernel.org>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <luanjianhai@...wei.com>,
	<zhangxuzhou4@...wei.com>, <dengguangxing@...wei.com>,
	<gaochao24@...wei.com>, <kuba@...nel.org>
Subject: Re: [PATCH net-next] net/smc: Optimize the search method of reused
 buf_desc



在 2024/11/1 18:52, Dust Li 写道:
> On 2024-11-01 16:23:42, liqiang wrote:
>> connections based on redis-benchmark (test in smc loopback-ism mode):
> 
> I think you can run test wrk/nginx test with short-lived connection.
> For example:
> 
> ```
> # client
> wrk -H "Connection: close" http://$serverIp
> 
> # server
> nginx
> ```

I tested with nginx, the test command is:
# server
smc_run nginx

# client
smc_run wrk -t <2,4,8,16,32,64> -c 200 -H "Connection: close" http://127.0.0.1

Requests/sec
--------+---------------+---------------+
req/s	| without patch	| apply patch	|
--------+---------------+---------------+
-t 2	|6924.18	|7456.54	|
--------+---------------+---------------+
-t 4	|8731.68	|9660.33	|
--------+---------------+---------------+
-t 8	|11363.22	|13802.08	|
--------+---------------+---------------+
-t 16	|12040.12	|18666.69	|
--------+---------------+---------------+
-t 32	|11460.82	|17017.28	|
--------+---------------+---------------+
-t 64	|11018.65	|14974.80	|
--------+---------------+---------------+

Transfer/sec
--------+---------------+---------------+
trans/s	| without patch	| apply patch	|
--------+---------------+---------------+
-t 2	|24.72MB	|26.62MB	|
--------+---------------+---------------+
-t 4	|31.18MB	|34.49MB	|
--------+---------------+---------------+
-t 8	|40.57MB	|49.28MB	|
--------+---------------+---------------+
-t 16	|42.99MB	|66.65MB	|
--------+---------------+---------------+
-t 32	|40.92MB	|60.76MB	|
--------+---------------+---------------+
-t 64	|39.34MB	|53.47MB	|
--------+---------------+---------------+

> 
>>
>>    1. On the current version:
>>        [x.832733] smc_buf_get_slot cost:602 ns, walk 10 buf_descs
>>        [x.832860] smc_buf_get_slot cost:329 ns, walk 12 buf_descs
>>        [x.832999] smc_buf_get_slot cost:479 ns, walk 17 buf_descs
>>        [x.833157] smc_buf_get_slot cost:679 ns, walk 13 buf_descs
>>        ...
>>        [x.045240] smc_buf_get_slot cost:5528 ns, walk 196 buf_descs
>>        [x.045389] smc_buf_get_slot cost:4721 ns, walk 197 buf_descs
>>        [x.045537] smc_buf_get_slot cost:4075 ns, walk 198 buf_descs
>>        [x.046010] smc_buf_get_slot cost:6476 ns, walk 199 buf_descs
>>
>>    2. Apply this patch:
>>        [x.180857] smc_buf_get_slot_free cost:75 ns
>>        [x.181001] smc_buf_get_slot_free cost:147 ns
>>        [x.181128] smc_buf_get_slot_free cost:97 ns
>>        [x.181282] smc_buf_get_slot_free cost:132 ns
>>        [x.181451] smc_buf_get_slot_free cost:74 ns
>>
>> It can be seen from the data that it takes about 5~6us to traverse 200 
> 
> Based on your data, I'm afraid the short-lived connection
> test won't show much benificial. Since the time to complete a
> SMC-R connection should be several orders of magnitude larger
> than 100ns.

Sorry, I didn't explain my test data well before.

The main optimized functions of this patch are as follows:

```
struct smc_buf_desc *smc_buf_get_slot(...)
{
	struct smc_buf_desc *buf_slot;
        down_read(lock);
        list_for_each_entry(buf_slot, buf_list, list) {
                if (cmpxchg(&buf_slot->used, 0, 1) == 0) {
                        up_read(lock);
                        return buf_slot;
                }
        }
        up_read(lock);
        return NULL;
}
```
The above data is the time-consuming data of this function.
If the current system has 200 active links, then during the
process of establishing a new SMC connection, this function
must traverse all 200 active links, which will take 5~6us.
If there are already 1,000 for active links, it takes about 30us.

After optimization, this function takes <100ns, it has nothing
to do with the number of active links.

Moreover, the lock has been removed, which is firendly to multi-thread
parallel scenarios.

The optimized code is as follows:

```
static struct smc_buf_desc *smc_buf_get_slot_free(struct llist_head *buf_llist)
{
        struct smc_buf_desc *buf_free;
        struct llist_node *llnode;

        if (llist_empty(buf_llist))
                return NULL;
        // lock-less link list don't need an lock
        llnode = llist_del_first(buf_llist);
        buf_free = llist_entry(llnode, struct smc_buf_desc, llist);
        WRITE_ONCE(buf_free->used, 1);
        return buf_free;
}
```

-- 
Cheers,
Li Qiang