linux-kernel - Re: [0/2] memrlimit improve error handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0806242240200.6804@blonde.site>
Date:	Wed, 25 Jun 2008 00:11:50 +0100 (BST)
From:	Hugh Dickins <hugh@...itas.com>
To:	Balbir Singh <balbir@...ux.vnet.ibm.com>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	YAMAMOTO Takashi <yamamoto@...inux.co.jp>,
	Paul Menage <menage@...gle.com>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Subject: Re: [0/2] memrlimit improve error handling

Hi Balbir,

Below I've a few comments on this particular patch, then report other
problems I've seen with memrlimits configured into my 2.6.25-rc5-mm3
kernel - I've not tried _using_ them.  My own opinion would be that
-mm already contains enough breakage, we don't need memrlimits yet.

On Fri, 20 Jun 2008, Balbir Singh wrote:
> 
> memrlimit cgroup does not handle error cases after may_expand_vm(). This
> BUG was reported by Kamezawa, with the test case below to reproduce it
> 
> This patch adds better handling support to fix the reported problem.
> 
> Reported-By: kamezawa.hiroyu@...fujitsu.com
> 
> Signed-off-by: Balbir Singh <balbir@...ux.vnet.ibm.com>

I wrote a similar patch when I saw Kame's report, because I was
interested in getting that accounting right.  Comparing the two,
a couple of notes on the details...

In mm/mmap.c:
> @@ -1982,6 +1992,7 @@ unsigned long do_brk(unsigned long addr,
>  	struct rb_node ** rb_link, * rb_parent;
>  	pgoff_t pgoff = addr >> PAGE_SHIFT;
>  	int error;
> +	int ret = -ENOMEM;

I think we don't want int error returned from some parts of do_brk()
and int ret returned from other parts: please use int error throughout.
It would probably be nicer to add int error rather than int ret in
acct_stack_growth() too, but that doesn't matter much.

In mm/mremap.c:
> @@ -256,6 +257,7 @@ unsigned long do_mremap(unsigned long ad
>  	struct vm_area_struct *vma;
>  	unsigned long ret = -EINVAL;
>  	unsigned long charged = 0;
> +	int vm_expanded = 0;
>  
>  	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
>  		goto out;
> @@ -349,6 +351,7 @@ unsigned long do_mremap(unsigned long ad
>  		goto out;
>  	}
>  
> +	vm_expanded = 1;
>  	if (vma->vm_flags & VM_ACCOUNT) {
>  		charged = (new_len - old_len) >> PAGE_SHIFT;
>  		if (security_vm_enough_memory(charged))
> @@ -411,6 +414,9 @@ out:
>  	if (ret & ~PAGE_MASK)
>  		vm_unacct_memory(charged);
>  out_nc:
> +	if (vm_expanded)
> +		memrlimit_cgroup_uncharge_as(mm,
> +				(new_len - old_len) >> PAGE_SHIFT);
>  	return ret;
>  }

See how vm_unacct_memory(charged) is only called if (ret & ~PAGE_MASK)?
If ret is a valid address being returned, we do not want to uncharge.
So I believe you need to do likewise with your uncharge_as().

And please handle them both in the same way: either follow the same
"charged" style as is being used for vm_unacct_memory, rather than a
boolean; or convert vm_unacct_memory over to use your boolean style:
but it's unhelpful to have the two using different techniques.

In kernel/fork.c:
nothing, but when I had a quick look there, again the error handling
appeared to be broken e.g. if allocate_mm fails, where's the uncharge?
But be careful: there's a particular point where enough of the new mm
is set up that it gets torn down by normal exit_mmap.

And while looking at copy_mm, do I see an extra down_write and
up_write of mmap_sem, just to guard some memrlimit charging?
Don't add overhead when memrlimits are CONFIGed off; but can't
it be moved into dup_mm() where mmap_sem is already down_write?

Please go through all your charges, again, to double check
that you've got the uncharging right in the error cases.

I was interested in these because I find that after running kernel
build on tmpfs swapping loads in a 700M memcg (but you may well hit
other rc5-mm3 bugs if you try that, I've fixes to send out) for some
hours on x86_64, my shutdown hits the kernel/res_counter.c:49
res_counter_uncharge_locked() WARN_ON(counter->usage < val) several
times, called from memrlimit_cgroup_uncharge_as called from exit_mmap.

I have no idea what the cause is; but I've not seen it on i386,
and I'm not seeing it after shorter runs.  It could even be some
error introduced by other patches in what I'm testing: so don't
worry too much about it yet, but please keep an eye out for it.

(If I'd thought harder, I'd have been less interested in the
charge_as leaks: those WARN_ONs come when too much is being
uncharged, not when too little.)

But the issue which worries me most, because it's mislocking
which affects anyone with CONFIG_MM_OWNER=y, is seen when I
have CONFIG_DEBUG_SPINLOCK_SLEEP=y: 

 BUG: sleeping function called from invalid context at kernel/rwsem.c:48
 in_atomic():1, irqs_disabled():0
 1 lock held by blogd/830:
  #0:  (tasklist_lock){..--}, at: [<78125869>] mm_update_next_owner+0x1dd/0x1fa
 Pid: 830, comm: blogd Not tainted 2.6.26-rc5-mm2 #2
  [<78120ea2>] __might_sleep+0xed/0xf2
  [<78367947>] down_write+0x13/0x43
  [<781257b6>] mm_update_next_owner+0x12a/0x1fa
  [<7812595a>] exit_mm+0xd4/0xe5
  [<78125f9d>] do_exit+0x1e7/0x2bc
  [<781260fb>] do_group_exit+0x61/0x8a
  [<78126134>] sys_exit_group+0x10/0x13
  [<78102dd9>] sysenter_past_esp+0x6a/0xa5

Your memrlimit-cgroup-mm-owner-callback-changes-to-add-task-info.patch
thinks it can acquire mmap_sem while it has read_lock(&tasklist_lock).
Sorry, no: you'll have to rework that somehow.

(In passing, I'll add that I'm not a great fan of these memrlimits:
to me it's loony to be charging people for virtual address space,
it's _virtual_, and process A can have as much as it likes without
affecting process B in any way.  You're following the lead of RLIMIT_AS,
but I've always thought RLIMIT_AS a lame attempt to move into the mmap
decade, after RLIMIT_DATA and RLIMIT_STACK no longer made sense.

Taking Alan Cox's Committed_AS as a limited resource charged per mm makes
much more sense to me: but yes, it's not perfect, and it is a lot harder
to get its accounting right, and to maintain that down the line.  Okay,
you've gone for the easier option of tracking total_vm, getting that
right is a more achievable target.  And I accept that I may be too
pessimistic about it: total_vm may often enough give a rough
approximation to something else worth limiting.)

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/