linux-kernel - Re: [RFC 0/4] memcg: Low-limit reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877g90cy6j.wl%klamm@yandex-team.ru>
Date:	Wed, 12 Feb 2014 16:28:36 +0400
From:	Roman Gushchin <klamm@...dex-team.ru>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	Roman Gushchin <klamm@...dex-team.ru>, linux-mm@...ck.org,
	Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Ying Han <yinghan@...gle.com>, Hugh Dickins <hughd@...gle.com>,
	Michel Lespinasse <walken@...gle.com>,
	Greg Thelen <gthelen@...gle.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Tejun Heo <tj@...nel.org>
Subject: Re: [RFC 0/4] memcg: Low-limit reclaim

Hi, Michal!

Sorry for a long reply.

At Wed, 29 Jan 2014 19:22:59 +0100,
Michal Hocko wrote:
> > As you can remember, I've proposed to introduce low limits about a year ago.
> > 
> > We had a small discussion at that time: http://marc.info/?t=136195226600004 .
> 
> yes I remember that discussion and vaguely remember the proposed
> approach. I really wanted to prevent from introduction of a new knob but
> things evolved differently than I planned since then and it turned out
> that the knew knob is unavoidable. That's why I came with this approach
> which is quite different from yours AFAIR.
>  
> > Since that time we intensively use low limits in our production
> > (on thousands of machines). So, I'm very interested to merge this
> > functionality into upstream.
> 
> Have you tried to use this implementation? Would this work as well?
> My very vague recollection of your patch is that it didn't cover both
> global and target reclaims and it didn't fit into the reclaim very
> naturally it used its own scaling method. I will have to refresh my
> memory though.

IMHO, the main problem of your implementation is the following: 
the number of reclaimed pages is not limited at all,
if cgroup is over it's low memory limit. So, a significant number 
of pages can be reclaimed, even if the memory usage is only a bit 
(e.g. one page) above the low limit.

In my case, this problem is solved by scaling the number of scanned pages.

I think, an ideal solution is to limit the number of reclaimed pages by 
low limit excess value. This allows to discard my scaling code, but save
the strict semantics of low limit under memory pressure. The main problem 
here is how to balance scanning pressure between cgroups and LRUs.

Maybe, we should calculate the number of pages to scan in a LRU based on
the low limit excess value instead of number of pages...

> > In my experience, low limits also require some changes in memcg page accounting
> > policy. For instance, an application in protected cgroup should have a guarantee
> > that it's filecache belongs to it's cgroup and is protected by low limit
> > therefore. If the filecache was created by another application in other cgroup,
> > it can be not so. I've solved this problem by implementing optional page
> > reaccouting on pagefaults and read/writes.
> 
> Memory sharing is a separate issue and we should discuss that
> separately. 
> 
> > I can prepare my current version of patchset, if someone is interested.
> 
> Sure, having something to compare with is always valuable.

----
Subject: [PATCH] memcg: low limits for memory cgroups

Low limits for memory cgroup can be used to limit memory pressure on it.
If memory usage of a cgroup is under it's low limit, it will not be
affected by global reclaim. If it reaches it's low limit from above,
the reclaiming speed will be dropped exponentially.

Low limits don't affect soft reclaim.
Also, it's possible that a cgroup with memory usage under low limit
will be reclaimed slowly on very low scanning priorities.
---
 include/linux/memcontrol.h  |  7 ++++++
 include/linux/res_counter.h | 17 +++++++++++++
 kernel/res_counter.c        |  2 ++
 mm/memcontrol.c             | 60 +++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                 |  9 +++++++
 5 files changed, 95 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index abd0113..3905e95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -231,6 +231,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -427,6 +429,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
 				struct page *newpage)
 {
 }
+
+static inline unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 201a697..7a16c2a 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,6 +40,10 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the secured guaranteed minimal limit of resource
+	 */
+	unsigned long long low_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -88,6 +92,7 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_LIMIT,
 };
 
 /*
@@ -224,4 +229,16 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_low_limit(struct res_counter *cnt,
+			   unsigned long long low_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_limit = low_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4aa8a30..c57daf9 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -135,6 +135,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_LIMIT:
+		return &counter->low_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd..d24b768 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1883,6 +1883,46 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			 NULL, "Memory cgroup out of memory");
 }
 
+/*
+ * If a cgroup is under low limit or enough close to it,
+ * decrease speed of page scanning.
+ *
+ * mem_cgroup_low_limit_scale() returns a number
+ * from range [0, DEF_PRIORITY - 2], which is used
+ * in the reclaim code as a scanning priority modifier.
+ *
+ * If the low limit is not set, it returns 0;
+ *
+ * usage - low_limit > usage / 8  => 0
+ * usage - low_limit > usage / 16 => 1
+ * usage - low_limit > usage / 32 => 2
+ * ...
+ * usage - low_limit > usage / (2 ^ DEF_PRIORITY - 3) => DEF_PRIORITY - 3
+ * usage < low_limit => DEF_PRIORITY - 2
+ *
+ */
+unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg)
+{
+	unsigned long long low_limit;
+	unsigned long long usage;
+	unsigned int i;
+
+	low_limit = res_counter_read_u64(&memcg->res, RES_LOW_LIMIT);
+	if (!low_limit)
+		return 0;
+
+	usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+
+	if (usage < low_limit)
+		return DEF_PRIORITY - 2;
+
+	for (i = 0; i < DEF_PRIORITY - 2; i++)
+		if (usage - low_limit > (usage >> (i + 3)))
+			break;
+
+	return i;
+}
+
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 					gfp_t gfp_mask,
 					unsigned long flags)
@@ -5318,6 +5358,20 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		else
 			ret = -EINVAL;
 		break;
+	case RES_LOW_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, low limits (as also soft limits, see upper)
+		 * are hard to implement in terms of semantics,
+		 * for now, we support soft limits for control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_low_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -6243,6 +6297,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read_u64,
 	},
 	{
+		.name = "low_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_LOW_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
 		.name = "soft_limit_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
 		.write_string = mem_cgroup_write,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b4..1d4eaac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -83,6 +83,9 @@ struct scan_control {
 	/* Scan (total_size >> priority) pages at once */
 	int priority;
 
+	/* If memcg is under it's low limit, do not scan it aggressively */
+	int low_limit_scale;
+
 	/*
 	 * The memory cgroup that hit its limit and as a result is the
 	 * primary target of this reclaim invocation.
@@ -2003,6 +2006,10 @@ out:
 			/* Look ma, no brain */
 			BUG();
 		}
+
+		if (sc->low_limit_scale)
+			scan >>= sc->low_limit_scale;
+
 		nr[lru] = scan;
 	}
 }
@@ -2206,6 +2213,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
+			sc->low_limit_scale = mem_cgroup_low_limit_scale(memcg);
 			shrink_lruvec(lruvec, sc);
 
 			/*
@@ -2640,6 +2648,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 		.may_swap = !noswap,
 		.order = 0,
 		.priority = 0,
+		.low_limit_scale = 0,
 		.target_mem_cgroup = memcg,
 	};
 	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/