linux-kernel - Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <99843dad-608d-10cc-c28f-e5e63a793361@linux.alibaba.com>
Date:   Wed, 9 Jan 2019 17:47:41 -0800
From:   Yang Shi <yang.shi@...ux.alibaba.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     mhocko@...e.com, shakeelb@...gle.com, akpm@...ux-foundation.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when
 offlining



On 1/9/19 2:51 PM, Johannes Weiner wrote:
> On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote:
>> On 1/9/19 1:23 PM, Johannes Weiner wrote:
>>> On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote:
>>>> As I mentioned above, if we know some page caches from some memcgs
>>>> are referenced one-off and unlikely shared, why just keep them
>>>> around to increase memory pressure?
>>> It's just not clear to me that your scenarios are generic enough to
>>> justify adding two interfaces that we have to maintain forever, and
>>> that they couldn't be solved with existing mechanisms.
>>>
>>> Please explain:
>>>
>>> - Unmapped clean page cache isn't expensive to reclaim, certainly
>>>     cheaper than the IO involved in new application startup. How could
>>>     recycling clean cache be a prohibitive part of workload warmup?
>> It is nothing about recycling. Those page caches might be referenced by
>> memcg just once, then nobody touch them until memory pressure is hit. And,
>> they might be not accessed again at any time soon.
> I meant recycling the page frames, not the cache in them. So the new
> workload as it starts up needs to take those pages from the LRU list
> instead of just the allocator freelist. While that's obviously not the
> same cost, it's not clear why the difference would be prohibitive to
> application startup especially since app startup tends to be dominated
> by things like IO to fault in executables etc.

I'm a little bit confused here. Even though those page frames are not 
reclaimed by force_empty, they would be reclaimed by kswapd later when 
memory pressure is hit. For some usecases, they may prefer get recycled 
before kswapd kick them out LRU, but for some usecases avoiding memory 
pressure might outpace page frame recycling.

>
>>> - Why you couldn't set memory.high or memory.max to 0 after the
>>>     application quits and before you call rmdir on the cgroup
>> I recall I explained this in the review email for the first version. Set
>> memory.high or memory.max to 0 would trigger direct reclaim which may stall
>> the offline of memcg. But, we have "restarting the same name job" logic in
>> our usecase (I'm not quite sure why they do so). Basically, it means to
>> create memcg with the exact same name right after the old one is deleted,
>> but may have different limit or other settings. The creation has to wait for
>> rmdir is done.
> This really needs a fix on your end. We cannot add new cgroup control
> files because you cannot handle a delayed release in the cgroupfs
> namespace while you're reclaiming associated memory. A simple serial
> number would fix this.
>
> Whether others have asked for this knob or not, these patches should
> come with a solid case in the cover letter and changelogs that explain
> why this ABI is necessary to solve a generic cgroup usecase. But it
> sounds to me that setting the limit to 0 once the group is empty would
> meet the functional requirement (use fork() if you don't want to wait)
> of what you are trying to do.

Do you mean do something like the below:

echo 0 > cg1/memory.max &
rmdir cg1 &
mkdir cg1 &

But, the latency is still there, even though memcg creation (mkdir) can 
be done very fast by using fork(), the latency would delay afterwards 
operations, i.e. attaching tasks (echo PID > cg1/cgroup.procs). When we 
calculating the time consumption of the container deployment, we would 
count from mkdir to the job is actually launched.

So, without delaying force_empty to offline kworker, we still suffer 
from the latency.

Am I missing anything?

Thanks,
Yang

>
> I don't think the new interface bar is met here.