linux-kernel - Re: [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <16af80a2-30a0-06da-9fd1-8d5101398fb9@redhat.com>
Date:   Mon, 12 Apr 2021 15:30:04 -0400
From:   Waiman Long <llong@...hat.com>
To:     Roman Gushchin <guro@...com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
        Pekka Enberg <penberg@...nel.org>,
        David Rientjes <rientjes@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org,
        cgroups@...r.kernel.org, linux-mm@...ck.org,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        Alex Shi <alex.shi@...ux.alibaba.com>,
        Chris Down <chris@...isdown.name>,
        Yafang Shao <laoar.shao@...il.com>,
        Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
        Wei Yang <richard.weiyang@...il.com>,
        Masayoshi Mizuma <msys.mizuma@...il.com>
Subject: Re: [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

On 4/12/21 2:22 PM, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 07:18:40PM -0400, Waiman Long wrote:
>> Before the new slab memory controller with per object byte charging,
>> charging and vmstat data update happen only when new slab pages are
>> allocated or freed. Now they are done with every kmem_cache_alloc()
>> and kmem_cache_free(). This causes additional overhead for workloads
>> that generate a lot of alloc and free calls.
>>
>> The memcg_stock_pcp is used to cache byte charge for a specific
>> obj_cgroup to reduce that overhead. To further reducing it, this patch
>> makes the vmstat data cached in the memcg_stock_pcp structure as well
>> until it accumulates a page size worth of update or when other cached
>> data change.
> The idea makes total sense to me and also gives a hope to remove
> byte-sized vmstats in the long-term.
>
>> On a 2-socket Cascade Lake server with instrumentation enabled and this
>> patch applied, it was found that about 17% (946796 out of 5515184) of the
>> time when __mod_obj_stock_state() is called leads to an actual call to
>> mod_objcg_state() after initial boot. When doing parallel kernel build,
>> the figure was about 16% (21894614 out of 139780628). So caching the
>> vmstat data reduces the number of calls to mod_objcg_state() by more
>> than 80%.
>>
>> Signed-off-by: Waiman Long <longman@...hat.com>
>> ---
>>   mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
>>   mm/slab.h       | 26 +++++++----------
>>   2 files changed, 79 insertions(+), 25 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b19100c68aa0..539c3b632e47 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
>>   
>>   #ifdef CONFIG_MEMCG_KMEM
>>   	struct obj_cgroup *cached_objcg;
>> +	struct pglist_data *cached_pgdat;
>>   	unsigned int nr_bytes;
>> +	int vmstat_idx;
>> +	int vmstat_bytes;
>>   #endif
> Because vmstat_idx can realistically take only 3 values (slab_reclaimable,
> slab_unreclaimable and percpu), I wonder if it's better to have
> vmstat_bytes[3] and save a bit more on the reduced number of flushes?
> It must be an often case when a complex (reclaimable) kernel object has
> non-reclaimable parts (e.g. kmallocs) or percpu counters.
> If the difference will be too small, maybe the current form is better.

I have thought about that too. However, that will make the code more 
complex. So I decided to cache just one for now. We can certainly play 
around with caching more in a later patch.

Cheers,
Longman