lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 30 May 2019 11:22:21 +0800
From:   Yang Shi <yang.shi@...ux.alibaba.com>
To:     David Rientjes <rientjes@...gle.com>
Cc:     ktkhai@...tuozzo.com, hannes@...xchg.org, mhocko@...e.com,
        kirill.shutemov@...ux.intel.com, hughd@...gle.com,
        shakeelb@...gle.com, Andrew Morton <akpm@...ux-foundation.org>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/3] Make deferred split shrinker memcg aware



On 5/30/19 5:07 AM, David Rientjes wrote:
> On Wed, 29 May 2019, Yang Shi wrote:
>
>>> Right, we've also encountered this.  I talked to Kirill about it a week or
>>> so ago where the suggestion was to split all compound pages on the
>>> deferred split queues under the presence of even memory pressure.
>>>
>>> That breaks cgroup isolation and perhaps unfairly penalizes workloads that
>>> are running attached to other memcg hierarchies that are not under
>>> pressure because their compound pages are now split as a side effect.
>>> There is a benefit to keeping these compound pages around while not under
>>> memory pressure if all pages are subsequently mapped again.
>> Yes, I do agree. I tried other approaches too, it sounds making deferred split
>> queue per memcg is the optimal one.
>>
> The approach we went with were to track the actual counts of compound
> pages on the deferred split queue for each pgdat for each memcg and then
> invoke the shrinker for memcg reclaim and iterate those not charged to the
> hierarchy under reclaim.  That's suboptimal and was a stop gap measure
> under time pressure: it's refreshing to see the optimal method being
> pursued, thanks!

We did the exactly same thing for a temporary hotfix.

>
>>> I'm curious if your internal applications team is also asking for
>>> statistics on how much memory can be freed if the deferred split queues
>>> can be shrunk?  We have applications that monitor their own memory usage
>> No, but this reminds me. The THPs on deferred split queue should be accounted
>> into available memory too.
>>
> Right, and we have also seen this for users of MADV_FREE that have both an
> increased rss and memcg usage that don't realize that the memory is freed
> under pressure.  I'm thinking that we need some kind of MemAvailable for
> memcg hierarchies to be the authoritative source of what can be reclaimed
> under pressure.

It sounds useful. We also need know the available memory in memcg scope 
in our containers.

>
>>> through memcg stats or usage and proactively try to reduce that usage when
>>> it is growing too large.  The deferred split queues have significantly
>>> increased both memcg usage and rss when they've upgraded kernels.
>>>
>>> How are your applications monitoring how much memory from deferred split
>>> queues can be freed on memory pressure?  Any thoughts on providing it as a
>>> memcg stat?
>> I don't think they have such monitor. I saw rss_huge is abormal in memcg stat
>> even after the application is killed by oom, so I realized the deferred split
>> queue may play a role here.
>>
> Exactly the same in my case :)  We were likely looking at the exact same
> issue at the same time.

Yes, it seems so. :-)

>> The memcg stat doesn't have counters for available memory as global vmstat. It
>> may be better to have such statistics, or extending reclaimable "slab" to
>> shrinkable/reclaimable "memory".
>>
> Have you considered following how NR_ANON_MAPPED is tracked for each pgdat
> and using that as an indicator of when the modify a memcg stat to track
> the amount of memory on a compound page?  I think this would be necessary
> for userspace to know what their true memory usage is.

No, I haven't. Do you mean minus MADV_FREE and deferred split THP from 
NR_ANON_MAPPED? It looks they have been decreased from NR_ANON_MAPPED 
when removing rmap.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ