linux-kernel - Re: [RFC 0/4] Introduce unbalance proactive reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 15 Nov 2023 10:11:56 +0800
From:   Huan Yang <link@...o.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     "Huang, Ying" <ying.huang@...el.com>, Tejun Heo <tj@...nel.org>,
        Zefan Li <lizefan.x@...edance.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>,
        Matthew Wilcox <willy@...radead.org>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Peter Xu <peterx@...hat.com>,
        "Vishal Moola (Oracle)" <vishal.moola@...il.com>,
        Yosry Ahmed <yosryahmed@...gle.com>,
        Liu Shixin <liushixin2@...wei.com>,
        Hugh Dickins <hughd@...gle.com>, cgroups@...r.kernel.org,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, opensource.kernel@...o.com
Subject: Re: [RFC 0/4] Introduce unbalance proactive reclaim


在 2023/11/14 21:03, Michal Hocko 写道:
> On Tue 14-11-23 20:37:07, Huan Yang wrote:
>> 在 2023/11/14 18:04, Michal Hocko 写道:
>>> On Mon 13-11-23 09:54:55, Huan Yang wrote:
>>>> 在 2023/11/10 20:32, Michal Hocko 写道:
>>>>> On Fri 10-11-23 14:21:17, Huan Yang wrote:
>>>>> [...]
>>>>>>> BTW: how do you know the number of pages to be reclaimed proactively in
>>>>>>> memcg proactive reclaiming based solution?
>>>>>> One point here is that we are not sure how long the frozen application
>>>>>> will be opened, it could be 10 minutes, an hour, or even days.  So we
>>>>>> need to predict and try, gradually reclaim anonymous pages in
>>>>>> proportion, preferably based on the LRU algorithm.  For example, if
>>>>>> the application has been frozen for 10 minutes, reclaim 5% of
>>>>>> anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.  It is even more
>>>>>> complicated as it requires adding a mechanism for predicting failure
>>>>>> penalties.
>>>>> Why would make your reclaiming decisions based on time rather than the
>>>>> actual memory demand? I can see how a pro-active reclaim could make a
>>>>> head room for an unexpected memory pressure but applying more pressure
>>>>> just because of inactivity sound rather dubious to me TBH. Why cannot
>>>>> you simply wait for the external memory pressure (e.g. from kswapd) to
>>>>> deal with that based on the demand?
>>>> Because the current kswapd and direct memory reclamation are a passive
>>>> memory reclamation based on the watermark, and in the event of triggering
>>>> these reclamation scenarios, the smoothness of the phone application cannot
>>>> be guaranteed.
>>> OK, so you are worried about latencies on spike memory usage.
>>>
>>>> (We often observe that when the above reclamation is triggered, there
>>>> is a delay in the application startup, usually accompanied by block
>>>> I/O, and some concurrency issues caused by lock design.)
>>> Does that mean you do not have enough head room for kswapd to keep with
>> Yes, but if set high watermark a little high, the power consumption
>> will be very high.  We usually observe that kswapd will run
>> frequently.  Even if we have set a low kswapd water level, kswapd CPU
>> usage can still be high in some extreme scenarios.(For example, when
>> starting a large application that needs to acquire a large amount of
>> memory in a short period of time.)However, we will not discuss it in
>> detail here, the reasons are quite complex, and we have not yet sorted
>> out a complete understanding of them.
> This is definitely worth investigating further before resorting to
> proposing a new interface. If the kswapd consumes CPU cycles
> unproductively then we should look into why.
Yes, this is my current research objective.
>
> If there is a big peak memory demand then that surely requires CPU
> capacity for the memory reclaim. The work has to be done, whether that
> is in kswapd or the pro-active reclaimer context. I can imagine the
> latter one could be invoked with a better timing in mind but that is not
> a trivial thing to do. There are examples where this could be driven by
> PSI feedback loop but from what you have mention earlier you are doing a
> idle time based reclaim. Anyway, this is mostly a tuning related
> discussion. I wanted to learn more about what you are trying to achieve
> and so far it seems to me you are trying to workaround some issues and
> a) we would like to learn about those issues and b) a new interface is
> unlikely a good fit to paper over a suboptimal behavior.
Our current research goal is to find a possible dynamic balance between the
time consumption of passive memory reclamation and the application death
caused by active process killing.

The current strategy is to use proactive memory reclamation to intervene in
this process. As mentioned earlier, by actively reclaiming anonymous pages
that are deemed safe to reclaim, we can increase the currently available 
memory,
avoid lag when starting new applications, and prevent the death of resident
applications.

Through the previous discussions, it seems that we have reached a consensus
that although the active memory reclamation interface can achieve this goal,
it is not the best approach. Using MADV can both use existing methods to
achieve this goal and decide whether to reclaim based on the 
characteristics of
the anon vma, especially the anon_vma name set.

Therefore, I will also push for internal research on this approach.
>
>>> This would suggest that MADV_PAGEOUT is really what you are looking
>>> for.
>> Yes, I agree, especially to avoid reclaiming shared anonymous pages.
>>
>> However, I did some shallow research and found that MADV_PAGEOUT does
>> not reclaim pages with mapcount != 1. Our applications are usually
>> composed of multiple processes, and some anonymous pages are shared
>> among them. When the application is frozen, the memory that is only
>> shared among the processes within the application should be released,
>> but MADV_PAGEOUT seems not to be suitable for this scenario?(If I
>> misunderstood anything, please correct me.)
> Hmm, OK it seems that we are hitting some terminology problems. The
> discussion was about private memory so far (essentially MAP_PRIVATE)
> now you are talking about a shared anonymous memory. That would imply
> shmem and that is indeed not supported by MADV_PAGEOUT. The reason for
> that is that this poses a security risk for time based attacks. I can
> imagine, though, that we could extend the behavior to support shared
> mappings if they do not cross a security boundary (e.g. mapped by the
> same user). This would require some analysis though.
OK, thanks. I have communicated with our internal team and found out that
this part of the memory usage will not be particularly large.
>   
>> In addition, I still have doubts that this approach will consume a lot
>> of strategy resources, but it is worth studying.
>>> If you really aim at compressing a specific type of memory then
>>> tweking reclaim to achieve that sounds like a shortcut because
>>> madvise based solution is more involved. But that is not a solid
>>> justification for adding a new interface.
>> Yes, but this RFC is just adding an additional configuration option to
>> the proactive reclaim interface. And in the reclaim path, prioritize
>> processing these requests with reclaim tendencies. However, using
>> `unlikely` judgment should not have much impact.
> Just adding an adding configuration option means user interface contract
> that needs to be maintained for ever. Our future reclaim algorithm migh
> change (and in fact it has already changed quite a bit with MGLRU) and
> explicit request for LRU type specific reclaim might not even have any
> sense. See that point?
Yes, I get it.  This also means that if the reclaim algorithm changes, 
the current
implementation of tendencies will need to be modified accordingly, which 
requires
a certain cost to maintain.
If the current implementation of tendencies cannot prove its necessity, 
it should
be keep deep research.
This solution may be simpler for me to achieve our internal goals, but 
it may not be
the best solution.So, MADV_PAGEOUT is worth to research.

This conversation was very beneficial for me.
Thank you all very much.
>
-- 
Thanks,
Huan Yang