linux-kernel - Re: [QUESTION] memcg page_counter seems broken in MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <77174872-b823-3d29-1a9f-d0a9a19c3157@huawei.com>
Date:   Thu, 1 Dec 2022 10:22:31 +0800
From:   Yongqiang Liu <liuyongqiang13@...wei.com>
To:     Yang Shi <shy828301@...il.com>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        <aarcange@...hat.com>, <hughd@...gle.com>, <mgorman@...e.de>,
        <mhocko@...e.cz>, <cl@...two.org>, <zokeefe@...gle.com>,
        <rientjes@...gle.com>, Matthew Wilcox <willy@...radead.org>,
        <peterx@...hat.com>,
        "Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@...wei.com>,
        "zhangxiaoxu (A)" <zhangxiaoxu5@...wei.com>,
        <kirill.shutemov@...ux.intel.com>, Lu Jialin <lujialin4@...wei.com>
Subject: Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with
 THP enabled


在 2022/11/30 1:23, Yang Shi 写道:
> On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@...wei.com> wrote:
>>
>> 在 2022/11/29 4:01, Yang Shi 写道:
>>> On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@...wei.com> wrote:
>>>> Hi,
>>>>
>>>> We use mm_counter to how much a process physical memory used. Meanwhile,
>>>> page_counter of a memcg is used to count how much a cgroup physical
>>>> memory used.
>>>> If a cgroup only contains a process, they looks almost the same. But with
>>>> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
>>>> more than rss
>>>> in proc/[pid]/smaps_rollup as follow:
>>>>
>>>> [root@...alhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
>>>> 1080930304
>>>> [root@...alhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
>>>> 1290
>>>> [root@...alhost sda]# cat /proc/1290/smaps_rollup
>>>> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
>>>> [rollup]
>>>> Rss:              500648 kB
>>>> Pss:              498337 kB
>>>> Shared_Clean:       2732 kB
>>>> Shared_Dirty:          0 kB
>>>> Private_Clean:       364 kB
>>>> Private_Dirty:    497552 kB
>>>> Referenced:       500648 kB
>>>> Anonymous:        492016 kB
>>>> LazyFree:              0 kB
>>>> AnonHugePages:    129024 kB
>>>> ShmemPmdMapped:        0 kB
>>>> Shared_Hugetlb:        0 kB
>>>> Private_Hugetlb:       0 kB
>>>> Swap:                  0 kB
>>>> SwapPss:               0 kB
>>>> Locked:                0 kB
>>>> THPeligible:    0
>>>>
>>>> I have found the differences was because that __split_huge_pmd decrease
>>>> the mm_counter but page_counter in memcg was not decreased with refcount
>>>> of a head page is not zero. Here are the follows:
>>>>
>>>> do_madvise
>>>>      madvise_dontneed_free
>>>>        zap_page_range
>>>>          unmap_single_vma
>>>>            zap_pud_range
>>>>              zap_pmd_range
>>>>                __split_huge_pmd
>>>>                  __split_huge_pmd_locked
>>>>                    __mod_lruvec_page_state
>>>>                zap_pte_range
>>>>                   add_mm_rss_vec
>>>>                      add_mm_counter                    -> decrease the
>>>> mm_counter
>>>>          tlb_finish_mmu
>>>>            arch_tlb_finish_mmu
>>>>              tlb_flush_mmu_free
>>>>                free_pages_and_swap_cache
>>>>                  release_pages
>>>>                    folio_put_testzero(page)            -> not zero, skip
>>>>                      continue;
>>>>                    __folio_put_large
>>>>                      free_transhuge_page
>>>>                        free_compound_page
>>>>                          mem_cgroup_uncharge
>>>>                            page_counter_uncharge        -> decrease the
>>>> page_counter
>>>>
>>>> node_page_stat which shows in meminfo was also decreased. the
>>>> __split_huge_pmd
>>>> seems free no physical memory unless the total THP was free.I am
>>>> confused which
>>>> one is the true physical memory used of a process.
>>> This should be caused by the deferred split of THP. When MADV_DONTNEED
>>> is called on the partial of the map, the huge PMD is split, but the
>>> THP itself will not be split until the memory pressure is hit (global
>>> or memcg limit). So the unmapped sub pages are actually not freed
>>> until that point. So the mm counter is decreased due to the zapping
>>> but the physical pages are not actually freed then uncharged from
>>> memcg.
>> Thanks!
>>
>> I don't know how much memory a real workload will cost.So I just
>>
>> test the max_usage_in_bytes of memcg with THP disabled and add a little bit
>>
>> more for the limit_in_byte of memcg with THP enabled which trigger a oom...
>>
>> (actually it costed 100M more with THP enabled). Another testcase which I
>>
>> known the amout of memory will cost don't trigger a oom with suitable
>>
>> memcg limit  and I see the THP split when the memory hit the limit.
>>
>>
>> I have another concern that k8s usually use (rss - files) to estimate
> Do you mean "workingset" used by some 3rd party k8s monitoring tools?
> I recall that depends on what monitoring tools you use, for example,
> some monitoring use active_anon + active_file.

Yes, I notice the k8s use a parent pod which set a memcg limit to cover all

child pods, and workingset monitor is watch the root memcg.

>
>> the memory workload but the anon_thp in the defered list charged
>>
>> in memcg will make it look higher than actucal. And it seems the
> Yes, but the deferred split shrinker should handle this quite gracefully.
>
>> container will be killed without oom...
> If you have some userspace daemons which monitor the memory usage by
> rss, and try to behave smarter to kill the container by looking at rss
> solely, you may kill the container prematurely.
Thanks.
>
>> Is it suitable to add meminfo of a deferred split list of THP?
> We could, but I don't think of how it will be used to improve the
> usecase. Any more thoughts?

In current k8s scenario, I think it will not kill the container with the 
parent

pod memcg limit set correctly.

Maybe  the meminfo with a split interface  will be helpful for user to

release memory in advance.

>>>> Kind regards,
>>>>
>>>> Yongqiang Liu
>>>>
>>>>
>>> .
> .