[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a4d10f9e-ab8f-ffad-5eea-48316c436f41@intel.com>
Date: Wed, 7 Sep 2022 17:39:47 +0800
From: "Sun, Jiebin" <jiebin.sun@...el.com>
To: Tim Chen <tim.c.chen@...ux.intel.com>,
Shakeel Butt <shakeelb@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, vasily.averin@...ux.dev,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
Christoph Lameter <cl@...ux.com>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Alexey Gladkov <legion@...nel.org>,
Manfred Spraul <manfred@...orfullife.com>,
alexander.mikhalitsyn@...tuozzo.com, Linux MM <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>,
"Chen, Tim C" <tim.c.chen@...el.com>,
Feng Tang <feng.tang@...el.com>,
Huang Ying <ying.huang@...el.com>, tianyou.li@...el.com,
wangyang.guo@...el.com, jiebin.sun@...el.com
Subject: Re: [PATCH] ipc/msg.c: mitigate the lock contention with percpu
counter
On 9/7/2022 2:44 AM, Tim Chen wrote:
> On Fri, 2022-09-02 at 09:27 -0700, Shakeel Butt wrote:
>> On Fri, Sep 2, 2022 at 12:04 AM Jiebin Sun <jiebin.sun@...el.com> wrote:
>>> The msg_bytes and msg_hdrs atomic counters are frequently
>>> updated when IPC msg queue is in heavy use, causing heavy
>>> cache bounce and overhead. Change them to percpu_counters
>>> greatly improve the performance. Since there is one unique
>>> ipc namespace, additional memory cost is minimal. Reading
>>> of the count done in msgctl call, which is infrequent. So
>>> the need to sum up the counts in each CPU is infrequent.
>>>
>>> Apply the patch and test the pts/stress-ng-1.4.0
>>> -- system v message passing (160 threads).
>>>
>>> Score gain: 3.38x
>>>
>>> CPU: ICX 8380 x 2 sockets
>>> Core number: 40 x 2 physical cores
>>> Benchmark: pts/stress-ng-1.4.0
>>> -- system v message passing (160 threads)
>>>
>>> Signed-off-by: Jiebin Sun <jiebin.sun@...el.com>
>> [...]
>>> +void percpu_counter_add_local(struct percpu_counter *fbc, s64 amount)
>>> +{
>>> + this_cpu_add(*fbc->counters, amount);
>>> +}
>>> +EXPORT_SYMBOL(percpu_counter_add_local);
>> Why not percpu_counter_add()? This may drift the fbc->count more than
>> batch*nr_cpus. I am assuming that is not the issue for you as you
>> always do an expensive sum in the slow path. As Andrew asked, this
>> should be a separate patch.
> In the IPC case, the read is always done with the accurate read using
> percpu_counter_sum() gathering all the counts and
> never with percpu_counter_read() that only read global count.
> So Jiebin was not worry about accuracy.
>
> However, the counter is s64 and the local per cpu counter is S32.
> So the counter size has shrunk if we only keep the count in local per
> cpu counter, which can overflow a lot sooner and is not okay.
>
> Jiebin, can you try to use percpu_counter_add_batch, but using a large
> batch size. That should achieve what you want without needing
> to create a percpu_counter_add_local() function, and also the overflow
> problem.
>
> Tim
>
I have sent out the patch v4 which use percpu_counter_add_batch. If we use
a tuned large batch size (1024), the performance gain is 3.17x (patch v4)
vs 3.38x (patch v3) previously in stress-ng -- message. It still has
significant performance improvement and also good balance between
performance gain and overflow issue.
Jiebin
>
Powered by blists - more mailing lists