lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 5 Feb 2014 10:28:21 -0500
From:	Johannes Weiner <hannes@...xchg.org>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Tejun Heo <tj@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
	linux-mm@...ck.org
Subject: Re: [PATCH -v2 4/6] memcg: make sure that memcg is not offline when
 charging

On Wed, Feb 05, 2014 at 02:38:34PM +0100, Michal Hocko wrote:
> On Tue 04-02-14 11:29:39, Johannes Weiner wrote:
> [...]
> > Maybe we should remove the XXX if it makes you think we should change
> > the current situation by any means necessary.  This patch is not an
> > improvement.
> >
> > I put the XXX there so that we one day maybe refactor the code in a
> > clean fashion where try_get_mem_cgroup_from_whatever() is in the same
> > rcu section as the first charge attempt.  On failure, reclaim, and do
> > the lookup again.
> 
> I wouldn't be opposed to such a cleanup. It is not that simple, though.
>
> > Also, this problem only exists on swapin, where the memcg is looked up
> > from an auxilliary data structure and not the current task, so maybe
> > that would be an angle to look for a clean solution.
> 
> I am not so sure about that. Task could have been moved to a different
> group basically anytime it was outside of rcu_read_lock section (which
> means most of the time). And so the group might get removed and we are
> in the very same situation.
> 
> > Either way, the problem is currently fixed 
> 
> OK, my understanding (and my ack was based on that) was that we needed
> a simple and safe fix for the stable trees and we would have something
> more appropriate later on. Preventing from the race sounds like a more
> appropriate and a better technical solution to me. So I would rather ask
> why to keep a workaround in place. Does it add any risk?
> Especially when we basically abuse the 2 stage cgroup removal. All the
> charges should be cleared out after css_offline.

I thought more about this and talked to Tejun as well.  He told me
that the rcu grace period between disabling tryget and calling
css_offline() is currently an implementation detail of the refcounter
that css uses, but it's not a guarantee.  So my initial idea of
reworking memcg to do css_tryget() and res_counter_charge() in the
same rcu section is no longer enough to synchronize against offlining.
We can forget about that.

On the other hand, memcg holds a css reference only while an actual
controller reference is being established (res_counter_charge), then
drops it.  This means that once css_tryget() is disabled, we only need
to wait for the css refcounter to hit 0 to know for sure that no new
charges can show up and reparent_charges() is safe to run, right?

Well, css_free() is the callback invoked when the ref counter hits 0,
and that is a guarantee.  From a memcg perspective, it's the right
place to do reparenting, not css_offline().

Here is the only exception to the above: swapout records maintain
permanent css references, so they prevent css_free() from running.
For that reason alone we should run one optimistic reparenting in
css_offline() to make sure one swap record does not pin gigabytes of
pages in an offlined cgroup, which is unreachable for reclaim.  But
the reparenting for *correctness* is in css_free(), not css_offline().

We should be changing the comments.  The code is already correct.

> > Unless the alternative solution is inherent in a clean rework of the
> > code to match cgroup core lifetime management, I don't see any reason
> > to move away from the status quo.
> 
> To be honest this sounds like a weak reasoning to refuse a real fix
> which replaces a workaround.
> 
> This is a second attempt to fix the actual race that you are dismissing
> which is really surprising to me. Especially when the workaround is an
> ugly hack.

IMO it was always functionally correct, just something that could have
been done cleaner from a design POV.  That's why I refused every
alternative solution that made the code worse instead of better.

But looks like it also makes perfect sense from a design POV, so
it's all moot now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ