[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241104081304.GB54400@linux.alibaba.com>
Date: Mon, 4 Nov 2024 16:13:04 +0800
From: Dust Li <dust.li@...ux.alibaba.com>
To: Li Qiang <liqiang64@...wei.com>, wenjia@...ux.ibm.com,
jaka@...ux.ibm.com, alibuda@...ux.alibaba.com,
tonylu@...ux.alibaba.com, guwen@...ux.alibaba.com
Cc: linux-s390@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, luanjianhai@...wei.com,
zhangxuzhou4@...wei.com, dengguangxing@...wei.com,
gaochao24@...wei.com, kuba@...nel.org
Subject: Re: [PATCH net-next] net/smc: Optimize the search method of reused
buf_desc
On 2024-11-02 14:43:52, Li Qiang wrote:
>
>
>在 2024/11/1 18:52, Dust Li 写道:
>> On 2024-11-01 16:23:42, liqiang wrote:
>>> connections based on redis-benchmark (test in smc loopback-ism mode):
>>
>> I think you can run test wrk/nginx test with short-lived connection.
>> For example:
>>
>> ```
>> # client
>> wrk -H "Connection: close" http://$serverIp
>>
>> # server
>> nginx
>> ```
>
>I tested with nginx, the test command is:
># server
>smc_run nginx
>
># client
>smc_run wrk -t <2,4,8,16,32,64> -c 200 -H "Connection: close" http://127.0.0.1
>
>Requests/sec
>--------+---------------+---------------+
>req/s | without patch | apply patch |
>--------+---------------+---------------+
>-t 2 |6924.18 |7456.54 |
>--------+---------------+---------------+
>-t 4 |8731.68 |9660.33 |
>--------+---------------+---------------+
>-t 8 |11363.22 |13802.08 |
>--------+---------------+---------------+
>-t 16 |12040.12 |18666.69 |
>--------+---------------+---------------+
>-t 32 |11460.82 |17017.28 |
>--------+---------------+---------------+
>-t 64 |11018.65 |14974.80 |
>--------+---------------+---------------+
>
>Transfer/sec
>--------+---------------+---------------+
>trans/s | without patch | apply patch |
>--------+---------------+---------------+
>-t 2 |24.72MB |26.62MB |
>--------+---------------+---------------+
>-t 4 |31.18MB |34.49MB |
>--------+---------------+---------------+
>-t 8 |40.57MB |49.28MB |
>--------+---------------+---------------+
>-t 16 |42.99MB |66.65MB |
>--------+---------------+---------------+
>-t 32 |40.92MB |60.76MB |
>--------+---------------+---------------+
>-t 64 |39.34MB |53.47MB |
>--------+---------------+---------------+
>
>>
>>>
>>> 1. On the current version:
>>> [x.832733] smc_buf_get_slot cost:602 ns, walk 10 buf_descs
>>> [x.832860] smc_buf_get_slot cost:329 ns, walk 12 buf_descs
>>> [x.832999] smc_buf_get_slot cost:479 ns, walk 17 buf_descs
>>> [x.833157] smc_buf_get_slot cost:679 ns, walk 13 buf_descs
>>> ...
>>> [x.045240] smc_buf_get_slot cost:5528 ns, walk 196 buf_descs
>>> [x.045389] smc_buf_get_slot cost:4721 ns, walk 197 buf_descs
>>> [x.045537] smc_buf_get_slot cost:4075 ns, walk 198 buf_descs
>>> [x.046010] smc_buf_get_slot cost:6476 ns, walk 199 buf_descs
>>>
>>> 2. Apply this patch:
>>> [x.180857] smc_buf_get_slot_free cost:75 ns
>>> [x.181001] smc_buf_get_slot_free cost:147 ns
>>> [x.181128] smc_buf_get_slot_free cost:97 ns
>>> [x.181282] smc_buf_get_slot_free cost:132 ns
>>> [x.181451] smc_buf_get_slot_free cost:74 ns
>>>
>>> It can be seen from the data that it takes about 5~6us to traverse 200
>>
>> Based on your data, I'm afraid the short-lived connection
>> test won't show much benificial. Since the time to complete a
>> SMC-R connection should be several orders of magnitude larger
>> than 100ns.
>
>Sorry, I didn't explain my test data well before.
>
>The main optimized functions of this patch are as follows:
>
>```
>struct smc_buf_desc *smc_buf_get_slot(...)
>{
> struct smc_buf_desc *buf_slot;
> down_read(lock);
> list_for_each_entry(buf_slot, buf_list, list) {
> if (cmpxchg(&buf_slot->used, 0, 1) == 0) {
> up_read(lock);
> return buf_slot;
> }
> }
> up_read(lock);
> return NULL;
>}
>```
>The above data is the time-consuming data of this function.
>If the current system has 200 active links, then during the
>process of establishing a new SMC connection, this function
>must traverse all 200 active links, which will take 5~6us.
>If there are already 1,000 for active links, it takes about 30us.
>
>After optimization, this function takes <100ns, it has nothing
>to do with the number of active links.
>
>Moreover, the lock has been removed, which is firendly to multi-thread
>parallel scenarios.
>
>The optimized code is as follows:
>
>```
>static struct smc_buf_desc *smc_buf_get_slot_free(struct llist_head *buf_llist)
>{
> struct smc_buf_desc *buf_free;
> struct llist_node *llnode;
>
> if (llist_empty(buf_llist))
> return NULL;
> // lock-less link list don't need an lock
^^^ kernel use /**/ for comments
> llnode = llist_del_first(buf_llist);
> buf_free = llist_entry(llnode, struct smc_buf_desc, llist);
If 2 CPU both passed the llist_empty() check, only 1 CPU can get llnode,
the other one should be NULL ?
> WRITE_ONCE(buf_free->used, 1);
> return buf_free;
>}
>```
>
>--
>Cheers,
>Li Qiang
Powered by blists - more mailing lists