linux-kernel - Re: [PATCH v11 00/16] per memcg lru lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.11.2006111409280.10801@eggly.anvils>
Date:   Thu, 11 Jun 2020 15:09:54 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Alex Shi <alex.shi@...ux.alibaba.com>
cc:     Hugh Dickins <hughd@...gle.com>, akpm@...ux-foundation.org,
        mgorman@...hsingularity.net, tj@...nel.org,
        khlebnikov@...dex-team.ru, daniel.m.jordan@...cle.com,
        yang.shi@...ux.alibaba.com, willy@...radead.org,
        hannes@...xchg.org, lkp@...el.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
        shakeelb@...gle.com, iamjoonsoo.kim@....com,
        richard.weiyang@...il.com
Subject: Re: [PATCH v11 00/16] per memcg lru lock

On Thu, 11 Jun 2020, Alex Shi wrote:
> 在 2020/6/10 上午11:22, Hugh Dickins 写道:
> > On Mon, 8 Jun 2020, Alex Shi wrote:
> >> 在 2020/6/8 下午12:15, Hugh Dickins 写道:
> >>>>  24 files changed, 487 insertions(+), 312 deletions(-)
> >>> Hi Alex,
> >>>
> >>> I didn't get to try v10 at all, waited until Johannes's preparatory
> >>> memcg swap cleanup was in mmotm; but I have spent a while thrashing
> >>> this v11, and can happily report that it is much better than v9 etc:
> >>> I believe this memcg lru_lock work will soon be ready for v5.9.
> >>>
> >>> I've not yet found any flaw at the swapping end, but fixes are needed
> >>> for isolate_migratepages_block() and mem_cgroup_move_account(): I've
> >>> got a series of 4 fix patches to send you (I guess two to fold into
> >>> existing patches of yours, and two to keep as separate from me).
> >>>
> >>> I haven't yet written the patch descriptions, will return to that
> >>> tomorrow.  I expect you will be preparing a v12 rebased on v5.8-rc1
> >>> or v5.8-rc2, and will be able to include these fixes in that.
> >>
> >> I am very glad to get your help on this feature! 
> >>
> >> and looking forward for your fixes tomorrow. :)
> >>
> >> Thanks a lot!
> >> Alex
> > 
> > Sorry, Alex, the news is not so good today.
> > 
> > You'll have noticed I sent nothing yesterday. That's because I got
> > stuck on my second patch: could not quite convince myself that it
> > was safe.
> 
> Hi Hugh,
> 
> Thanks a lot for your help and effort! I very appreciate for this.
> 
> > 
> > I keep hinting at these patches, and I can't complete their writeups
> > until I'm convinced; but to give you a better idea of what they do:
> > 
> > 1. Fixes isolate_fail and isolate_abort in isolate_migratepages_block().
> 
> I guess I know this after mm-compaction-avoid-vm_bug_onpageslab-in-page_mapcount.patch
> was removed.

No, I already assumed you had backed that out: these are fixes beyond that.

> 
> > 2. Fixes unsafe use of trylock_page() in __isolate_lru_page_prepare().
> > 3. Reverts 07/16 inversion of lock ordering in split_huge_page_to_list().
> > 4. Adds lruvec lock protection in mem_cgroup_move_account().
> 
> Sorry for can't follow you for above issues.

Indeed, more explanation needed: coming.

> Anyway, I will send out new patchset
> with the first issue fixed. and then let's discussion base on it.

Sigh. I wish you had waited for me to send you fixes, or waited for an
identifiable tag like 5.8-rc1.  Andrew has been very hard at work with
mm patches to Linus, but it looks like there are still "data_race" mods
to come before -rc1, which may stop your v12 from applying cleanly.

> 
> > 
> > In the second, I was using rcu_read_lock() instead of trylock_page()
> > (like in my own patchset), but could not quite be sure of the case when
> > PageSwapCache gets set at the wrong moment. Gave up for the night, and
> > in the morning abandoned that, instead just shifting the call to
> > __isolate_lru_page_prepare() after the get_page_unless_zero(),
> > where that trylock_page() becomes safe (no danger of stomping on page
> > flags while page is being freed or newly allocated to another owner).
> 
> Sorry, I don't know the problem of trylock_page here? Could you like to
> describe it as a race?

Races, yes. Look, I'll send you now patches 1 and 2: at least with those
in it should be safe for you and others to test compaction (if 5.8-rc1
turns out well: I think so much has gone in that it might have unrelated
problems, and often the -rc2 is much more stable).

But no point in sending 3 and 4 at this point, since ...

> 
> > 
> > I thought that a very safe change, but best to do some test runs with
> > it in before finalizing. And was then unpleasantly surprised to hit a
> > VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from
> > lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn <
> > pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine.
> > Then similar but < rotate_reclaimable_page after 8 hours on another.
> > 
> > Only seen once before: that's what drove me to add patch 4 (with 3 to
> > revert the locking before it): somehow, when adding the lruvec locking
> > there, I just took it for granted that your patchset would have the
> > appropriate locking (or TestClearPageLRU magic) at the other end.
> > 
> > But apparently not. And I'm beginning to think that TestClearPageLRU
> > was just to distract the audience from the lack of proper locking.
> > 
> > I have certainly not concluded that yet, but I'm having to think about
> > an area of the code which I'd imagined you had under control (and I'm
> > puzzled why my testing has found it so very hard to hit). If we're
> > lucky, I'll find that pagevec_move_tail is a special case, and
> > nothing much else needs changing; but I doubt that will be so.

... shows that your locking primitives are not yet good enough
to handle the case when tasks are moved between memcgs with
move_charge_at_immigrate set.  "bin/cg m" in the tests I sent,
but today I'm changing its "seconds=60" to "seconds=1" in hope
of speeding up the reproduction.

Ah, good, two machines crashed in 1.5 hours: but I don't need to
examine the crashes, now that it's obvious there's no protection -
please, think about rotate_reclaimable_page() (there will be more
cases, but in practice that seems easiest to hit, so focus on that)
and how it is not protected from mem_cgroup_move_account().

I'm thinking too. Maybe judicious use of lock_page_memcg() can fix it
(8 years ago it was unsuitable, but a lot has changed for the better
since then); otherwise it's back to what I've been doing all along,
taking the likely lruvec lock, and checking under that lock whether
we have the right lock (as your lruvec_memcg_debug() does), retrying
if not. Which may be more efficient than involving lock_page_memcg().

But I guess still worth sending my first two patches, since most of us
use move_charge_at_immigrate only for... testing move_charge_at_immigrate.
Whereas compaction bugs can hit any of us at any time.

> > 
> > There's one other unexplained and unfixed bug I've seen several times
> > while exercising mem_cgroup_move_account(): refcount_warn_saturate()
> > from where __mem_cgroup_clear_mc() calls mem_cgroup_id_get_many().
> > I'll be glad if that goes away when the lruvec locking is fixed,
> > but don't understand the connection. And it's quite possible that
> > this refcounting bug has nothing to do with your changes: I have
> > not succeeded in reproducing it on 5.7 nor on 5.7-rc7-mm1,
> > but I didn't really try long enough to be sure.

I got one of those quite quickly too after setting "cg m"'s seconds=1.
I think the best thing I can do while thinking and researching, is
give 5.7-rc7-mm1 a run on that machine with the speeded up moving,
to see whether or not that refcount bug reproduces.

> > 
> > (I should also warn, that I'm surprised by the amount of change
> > 11/16 makes to mm/mlock.c: I've not been exercising mlock at all.)
> 
> yes, that is a bit complex. I have tried the mlock cases in selftest with
> your swap&build case. They are all fine with 300 times run.

Good, thanks.

Hugh