[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <77174872-b823-3d29-1a9f-d0a9a19c3157@huawei.com>
Date: Thu, 1 Dec 2022 10:22:31 +0800
From: Yongqiang Liu <liuyongqiang13@...wei.com>
To: Yang Shi <shy828301@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
<aarcange@...hat.com>, <hughd@...gle.com>, <mgorman@...e.de>,
<mhocko@...e.cz>, <cl@...two.org>, <zokeefe@...gle.com>,
<rientjes@...gle.com>, Matthew Wilcox <willy@...radead.org>,
<peterx@...hat.com>,
"Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@...wei.com>,
"zhangxiaoxu (A)" <zhangxiaoxu5@...wei.com>,
<kirill.shutemov@...ux.intel.com>, Lu Jialin <lujialin4@...wei.com>
Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with
THP enabled
在 2022/11/30 1:23, Yang Shi 写道:
> On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@...wei.com> wrote:
>>
>> 在 2022/11/29 4:01, Yang Shi 写道:
>>> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@...wei.com> wrote:
>>>> Hi,
>>>>
>>>> We use mm_counter to how much a process physical memory used. Meanwhile,
>>>> page_counter of a memcg is used to count how much a cgroup physical
>>>> memory used.
>>>> If a cgroup only contains a process, they looks almost the same. But with
>>>> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
>>>> more than rss
>>>> in proc/[pid]/smaps_rollup as follow:
>>>>
>>>> [root@...alhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
>>>> 1080930304
>>>> [root@...alhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
>>>> 1290
>>>> [root@...alhost sda]# cat /proc/1290/smaps_rollup
>>>> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
>>>> [rollup]
>>>> Rss: 500648 kB
>>>> Pss: 498337 kB
>>>> Shared_Clean: 2732 kB
>>>> Shared_Dirty: 0 kB
>>>> Private_Clean: 364 kB
>>>> Private_Dirty: 497552 kB
>>>> Referenced: 500648 kB
>>>> Anonymous: 492016 kB
>>>> LazyFree: 0 kB
>>>> AnonHugePages: 129024 kB
>>>> ShmemPmdMapped: 0 kB
>>>> Shared_Hugetlb: 0 kB
>>>> Private_Hugetlb: 0 kB
>>>> Swap: 0 kB
>>>> SwapPss: 0 kB
>>>> Locked: 0 kB
>>>> THPeligible: 0
>>>>
>>>> I have found the differences was because that __split_huge_pmd decrease
>>>> the mm_counter but page_counter in memcg was not decreased with refcount
>>>> of a head page is not zero. Here are the follows:
>>>>
>>>> do_madvise
>>>> madvise_dontneed_free
>>>> zap_page_range
>>>> unmap_single_vma
>>>> zap_pud_range
>>>> zap_pmd_range
>>>> __split_huge_pmd
>>>> __split_huge_pmd_locked
>>>> __mod_lruvec_page_state
>>>> zap_pte_range
>>>> add_mm_rss_vec
>>>> add_mm_counter -> decrease the
>>>> mm_counter
>>>> tlb_finish_mmu
>>>> arch_tlb_finish_mmu
>>>> tlb_flush_mmu_free
>>>> free_pages_and_swap_cache
>>>> release_pages
>>>> folio_put_testzero(page) -> not zero, skip
>>>> continue;
>>>> __folio_put_large
>>>> free_transhuge_page
>>>> free_compound_page
>>>> mem_cgroup_uncharge
>>>> page_counter_uncharge -> decrease the
>>>> page_counter
>>>>
>>>> node_page_stat which shows in meminfo was also decreased. the
>>>> __split_huge_pmd
>>>> seems free no physical memory unless the total THP was free.I am
>>>> confused which
>>>> one is the true physical memory used of a process.
>>> This should be caused by the deferred split of THP. When MADV_DONTNEED
>>> is called on the partial of the map, the huge PMD is split, but the
>>> THP itself will not be split until the memory pressure is hit (global
>>> or memcg limit). So the unmapped sub pages are actually not freed
>>> until that point. So the mm counter is decreased due to the zapping
>>> but the physical pages are not actually freed then uncharged from
>>> memcg.
>> Thanks!
>>
>> I don't know how much memory a real workload will cost.So I just
>>
>> test the max_usage_in_bytes of memcg with THP disabled and add a little bit
>>
>> more for the limit_in_byte of memcg with THP enabled which trigger a oom...
>>
>> (actually it costed 100M more with THP enabled). Another testcase which I
>>
>> known the amout of memory will cost don't trigger a oom with suitable
>>
>> memcg limit and I see the THP split when the memory hit the limit.
>>
>>
>> I have another concern that k8s usually use (rss - files) to estimate
> Do you mean "workingset" used by some 3rd party k8s monitoring tools?
> I recall that depends on what monitoring tools you use, for example,
> some monitoring use active_anon + active_file.
Yes, I notice the k8s use a parent pod which set a memcg limit to cover all
child pods, and workingset monitor is watch the root memcg.
>
>> the memory workload but the anon_thp in the defered list charged
>>
>> in memcg will make it look higher than actucal. And it seems the
> Yes, but the deferred split shrinker should handle this quite gracefully.
>
>> container will be killed without oom...
> If you have some userspace daemons which monitor the memory usage by
> rss, and try to behave smarter to kill the container by looking at rss
> solely, you may kill the container prematurely.
Thanks.
>
>> Is it suitable to add meminfo of a deferred split list of THP?
> We could, but I don't think of how it will be used to improve the
> usecase. Any more thoughts?
In current k8s scenario, I think it will not kill the container with the
parent
pod memcg limit set correctly.
Maybe the meminfo with a split interface will be helpful for user to
release memory in advance.
>>>> Kind regards,
>>>>
>>>> Yongqiang Liu
>>>>
>>>>
>>> .
> .
Powered by blists - more mailing lists