lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200130125455.GV24244@dhcp22.suse.cz>
Date:   Thu, 30 Jan 2020 13:54:55 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Roman Gushchin <guro@...com>, Tejun Heo <tj@...nel.org>,
        linux-mm@...ck.org, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org, kernel-team@...com
Subject: Re: [PATCH v2 2/3] mm: memcontrol: clean up and document effective
 low/min calculations

On Thu 19-12-19 15:07:17, Johannes Weiner wrote:
> The effective protection of any given cgroup is a somewhat complicated
> construct that depends on the ancestor's configuration, siblings'
> configurations, as well as current memory utilization in all these
> groups. It's done this way to satisfy hierarchical delegation
> requirements while also making the configuration semantics flexible
> and expressive in complex real life scenarios.
> 
> Unfortunately, all the rules and requirements are sparsely documented,
> and the code is a little too clever in merging different scenarios
> into a single min() expression. This makes it hard to reason about the
> implementation and avoid breaking semantics when making changes to it.
> 
> This patch documents each semantic rule individually and splits out
> the handling of the overcommit case from the regular case.
> 
> Signed-off-by: Johannes Weiner <hannes@...xchg.org>

It took me a while to double check that the new code is indeed
equivalent but the result is easier to grasp indeed.

Acked-by: Michal Hocko <mhocko@...e.com>

> ---
>  mm/memcontrol.c | 190 ++++++++++++++++++++++++++----------------------
>  1 file changed, 104 insertions(+), 86 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 874a0b00f89b..9c771c4d6339 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6204,65 +6204,55 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.early_init = 0,
>  };
>  
> -/**
> - * mem_cgroup_protected - check if memory consumption is in the normal range
> - * @root: the top ancestor of the sub-tree being checked
> - * @memcg: the memory cgroup to check
> - *
> - * WARNING: This function is not stateless! It can only be used as part
> - *          of a top-down tree iteration, not for isolated queries.
> - *
> - * Returns one of the following:
> - *   MEMCG_PROT_NONE: cgroup memory is not protected
> - *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
> - *     an unprotected supply of reclaimable memory from other cgroups.
> - *   MEMCG_PROT_MIN: cgroup memory is protected
> - *
> - * @root is exclusive; it is never protected when looked at directly
> +/*
> + * This function calculates an individual cgroup's effective
> + * protection which is derived from its own memory.min/low, its
> + * parent's and siblings' settings, as well as the actual memory
> + * distribution in the tree.
>   *
> - * To provide a proper hierarchical behavior, effective memory.min/low values
> - * are used. Below is the description of how effective memory.low is calculated.
> - * Effective memory.min values is calculated in the same way.
> + * The following rules apply to the effective protection values:
>   *
> - * Effective memory.low is always equal or less than the original memory.low.
> - * If there is no memory.low overcommittment (which is always true for
> - * top-level memory cgroups), these two values are equal.
> - * Otherwise, it's a part of parent's effective memory.low,
> - * calculated as a cgroup's memory.low usage divided by sum of sibling's
> - * memory.low usages, where memory.low usage is the size of actually
> - * protected memory.
> + * 1. At the first level of reclaim, effective protection is equal to
> + *    the declared protection in memory.min and memory.low.
>   *
> - *                                             low_usage
> - * elow = min( memory.low, parent->elow * ------------------ ),
> - *                                        siblings_low_usage
> + * 2. To enable safe delegation of the protection configuration, at
> + *    subsequent levels the effective protection is capped to the
> + *    parent's effective protection.
>   *
> - * low_usage = min(memory.low, memory.current)
> + * 3. To make complex and dynamic subtrees easier to configure, the
> + *    user is allowed to overcommit the declared protection at a given
> + *    level. If that is the case, the parent's effective protection is
> + *    distributed to the children in proportion to how much protection
> + *    they have declared and how much of it they are utilizing.
>   *
> + *    This makes distribution proportional, but also work-conserving:
> + *    if one cgroup claims much more protection than it uses memory,
> + *    the unused remainder is available to its siblings.
>   *
> - * Such definition of the effective memory.low provides the expected
> - * hierarchical behavior: parent's memory.low value is limiting
> - * children, unprotected memory is reclaimed first and cgroups,
> - * which are not using their guarantee do not affect actual memory
> - * distribution.
> + *    Consider the following example tree:
>   *
> - * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
> + *        A      A/memory.low = 2G, A/memory.current = 6G
> + *       //\\
> + *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
> + *               C/memory.low = 1G  C/memory.current = 2G
> + *               D/memory.low = 0   D/memory.current = 2G
> + *               E/memory.low = 10G E/memory.current = 0
>   *
> - *     A      A/memory.low = 2G, A/memory.current = 6G
> - *    //\\
> - *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
> - *            C/memory.low = 1G  C/memory.current = 2G
> - *            D/memory.low = 0   D/memory.current = 2G
> - *            E/memory.low = 10G E/memory.current = 0
> + *    and memory pressure is applied, the following memory
> + *    distribution is expected (approximately*):
>   *
> - * and the memory pressure is applied, the following memory distribution
> - * is expected (approximately):
> + *      A/memory.current = 2G
> + *      B/memory.current = 1.3G
> + *      C/memory.current = 0.6G
> + *      D/memory.current = 0
> + *      E/memory.current = 0
>   *
> - *     A/memory.current = 2G
> + *    *assuming equal allocation rate and reclaimability
>   *
> - *     B/memory.current = 1.3G
> - *     C/memory.current = 0.6G
> - *     D/memory.current = 0
> - *     E/memory.current = 0
> + * 4. Conversely, when the declared protection is undercommitted at a
> + *    given level, the distribution of the larger parental protection
> + *    budget is NOT proportional. A cgroup's protection from a sibling
> + *    is capped to its own memory.min/low setting.
>   *
>   * These calculations require constant tracking of the actual low usages
>   * (see propagate_protected_usage()), as well as recursive calculation of
> @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   * for next usage. This part is intentionally racy, but it's ok,
>   * as memory.low is a best-effort mechanism.
>   */
> +static unsigned long effective_protection(unsigned long usage,
> +					  unsigned long setting,
> +					  unsigned long parent_effective,
> +					  unsigned long siblings_protected)
> +{
> +	unsigned long protected;
> +
> +	protected = min(usage, setting);
> +	/*
> +	 * If all cgroups at this level combined claim and use more
> +	 * protection then what the parent affords them, distribute
> +	 * shares in proportion to utilization.
> +	 *
> +	 * We are using actual utilization rather than the statically
> +	 * claimed protection in order to be work-conserving: claimed
> +	 * but unused protection is available to siblings that would
> +	 * otherwise get a smaller chunk than what they claimed.
> +	 */
> +	if (siblings_protected > parent_effective)
> +		return protected * parent_effective / siblings_protected;
> +
> +	/*
> +	 * Ok, utilized protection of all children is within what the
> +	 * parent affords them, so we know whatever this child claims
> +	 * and utilizes is effectively protected.
> +	 *
> +	 * If there is unprotected usage beyond this value, reclaim
> +	 * will apply pressure in proportion to that amount.
> +	 *
> +	 * If there is unutilized protection, the cgroup will be fully
> +	 * shielded from reclaim, but we do return a smaller value for
> +	 * protection than what the group could enjoy in theory. This
> +	 * is okay. With the overcommit distribution above, effective
> +	 * protection is always dependent on how memory is actually
> +	 * consumed among the siblings anyway.
> +	 */
> +	return protected;
> +}
> +
> +/**
> + * mem_cgroup_protected - check if memory consumption is in the normal range
> + * @root: the top ancestor of the sub-tree being checked
> + * @memcg: the memory cgroup to check
> + *
> + * WARNING: This function is not stateless! It can only be used as part
> + *          of a top-down tree iteration, not for isolated queries.
> + *
> + * Returns one of the following:
> + *   MEMCG_PROT_NONE: cgroup memory is not protected
> + *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
> + *     an unprotected supply of reclaimable memory from other cgroups.
> + *   MEMCG_PROT_MIN: cgroup memory is protected
> + */
>  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  						struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup *parent;
> -	unsigned long emin, parent_emin;
> -	unsigned long elow, parent_elow;
>  	unsigned long usage;
>  
>  	if (mem_cgroup_disabled())
> @@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  	if (!usage)
>  		return MEMCG_PROT_NONE;
>  
> -	emin = memcg->memory.min;
> -	elow = memcg->memory.low;
> -
>  	parent = parent_mem_cgroup(memcg);
>  	/* No parent means a non-hierarchical mode on v1 memcg */
>  	if (!parent)
>  		return MEMCG_PROT_NONE;
>  
> -	if (parent == root)
> -		goto exit;
> -
> -	parent_emin = READ_ONCE(parent->memory.emin);
> -	emin = min(emin, parent_emin);
> -	if (emin && parent_emin) {
> -		unsigned long min_usage, siblings_min_usage;
> -
> -		min_usage = min(usage, memcg->memory.min);
> -		siblings_min_usage = atomic_long_read(
> -			&parent->memory.children_min_usage);
> -
> -		if (min_usage && siblings_min_usage)
> -			emin = min(emin, parent_emin * min_usage /
> -				   siblings_min_usage);
> +	if (parent == root) {
> +		memcg->memory.emin = memcg->memory.min;
> +		memcg->memory.elow = memcg->memory.low;
> +		goto out;
>  	}
>  
> -	parent_elow = READ_ONCE(parent->memory.elow);
> -	elow = min(elow, parent_elow);
> -	if (elow && parent_elow) {
> -		unsigned long low_usage, siblings_low_usage;
> -
> -		low_usage = min(usage, memcg->memory.low);
> -		siblings_low_usage = atomic_long_read(
> -			&parent->memory.children_low_usage);
> +	memcg->memory.emin = effective_protection(usage,
> +			memcg->memory.min, READ_ONCE(parent->memory.emin),
> +			atomic_long_read(&parent->memory.children_min_usage));
>  
> -		if (low_usage && siblings_low_usage)
> -			elow = min(elow, parent_elow * low_usage /
> -				   siblings_low_usage);
> -	}
> -
> -exit:
> -	memcg->memory.emin = emin;
> -	memcg->memory.elow = elow;
> +	memcg->memory.elow = effective_protection(usage,
> +			memcg->memory.low, READ_ONCE(parent->memory.elow),
> +			atomic_long_read(&parent->memory.children_low_usage));
>  
> -	if (usage <= emin)
> +out:
> +	if (usage <= memcg->memory.emin)
>  		return MEMCG_PROT_MIN;
> -	else if (usage <= elow)
> +	else if (usage <= memcg->memory.elow)
>  		return MEMCG_PROT_LOW;
>  	else
>  		return MEMCG_PROT_NONE;
> -- 
> 2.24.1
> 

-- 
Michal Hocko
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ