linux-kernel - Re: [RESEND PATCH 1/2 -mm] mm: account lazy free pages separately

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190814110850.GT17933@dhcp22.suse.cz>
Date:   Wed, 14 Aug 2019 13:08:50 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Yang Shi <yang.shi@...ux.alibaba.com>
Cc:     kirill.shutemov@...ux.intel.com, hannes@...xchg.org,
        vbabka@...e.cz, rientjes@...gle.com, akpm@...ux-foundation.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RESEND PATCH 1/2 -mm] mm: account lazy free pages separately

On Mon 12-08-19 10:00:17, Yang Shi wrote:
> 
> 
> On 8/12/19 2:34 AM, Michal Hocko wrote:
> > On Fri 09-08-19 16:54:43, Yang Shi wrote:
> > > 
> > > On 8/9/19 11:26 AM, Yang Shi wrote:
> > > > 
> > > > On 8/9/19 11:02 AM, Michal Hocko wrote:
> > [...]
> > > > > I have to study the code some more but is there any reason why those
> > > > > pages are not accounted as proper THPs anymore? Sure they are partially
> > > > > unmaped but they are still THPs so why cannot we keep them accounted
> > > > > like that. Having a new counter to reflect that sounds like papering
> > > > > over the problem to me. But as I've said I might be missing something
> > > > > important here.
> > > > I think we could keep those pages accounted for NR_ANON_THPS since they
> > > > are still THP although they are unmapped as you mentioned if we just
> > > > want to fix the improper accounting.
> > > By double checking what NR_ANON_THPS really means,
> > > Documentation/filesystems/proc.txt says "Non-file backed huge pages mapped
> > > into userspace page tables". Then it makes some sense to dec NR_ANON_THPS
> > > when removing rmap even though they are still THPs.
> > > 
> > > I don't think we would like to change the definition, if so a new counter
> > > may make more sense.
> > Yes, changing NR_ANON_THPS semantic sounds like a bad idea. Let
> > me try whether I understand the problem. So we have some THP in
> > limbo waiting for them to be split and unmapped parts to be freed,
> > right? I can see that page_remove_anon_compound_rmap does correctly
> > decrement NR_ANON_MAPPED for sub pages that are no longer mapped by
> > anybody. LRU pages seem to be accounted properly as well.  As you've
> > said NR_ANON_THPS reflects the number of THPs mapped and that should be
> > reflecting the reality already IIUC.
> > 
> > So the only problem seems to be that deferred THP might aggregate a lot
> > of immediately freeable memory (if none of the subpages are mapped) and
> > that can confuse MemAvailable because it doesn't know about the fact.
> > Has an skewed counter resulted in a user observable behavior/failures?
> 
> No. But the skewed counter may make big difference for a big scale cluster.
> The MemAvailable is an important factor for cluster scheduler to determine
> the capacity.

But MemAvailable is a very rough estimation. Is relying on it really a
good measure? I mean there is a lot of reclaimable memory that is not
reflected there (some fs. internal data structures, networking buffers
etc.)

[...]

> > accounting the full THP correct? What if subpages are still mapped?
> 
> "Deferred split" definitely doesn't mean they are free. When memory pressure
> is hit, they would be split, then the unmapped normal pages would be freed.
> So, when calculating MemAvailable, they are not accounted 100%, but like
> "available += lazyfree - min(lazyfree / 2, wmark_low)", just like how page
> cache is accounted.

Then this is even more dubious IMHO.

> We could get more accurate account, i.e. checking each sub page's mapcount
> when accounting, but it may change before shrinker start scanning. So, just
> use the ballpark estimation to trade off the complexity for accurate
> accounting.

I do not see much point in fixing up one particular counter when there
is a whole lot that is even not considered. I would rather live with the
fact that MemAvailable is only very rough estimate then whack a mole on
any memory consumer that is freeable directly or indirectly via memory
reclaim. Because this is likely to be always subtly broken and only
visible under very specific workloads so there is no way to test for it.
-- 
Michal Hocko
SUSE Labs