lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local>
Date: Sat, 24 Aug 2024 17:26:46 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Zhiguo Jiang <justinjiang@...o.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, oe-lkp@...ts.linux.dev, lkp@...el.com,
        opensource.kernel@...o.com
Subject: Re: [PATCH v2] vma remove the unneeded avc bound with non-CoWed folio

On Fri, Aug 23, 2024 at 11:02:06PM GMT, Zhiguo Jiang wrote:
> After CoWed by do_wp_page, the vma established a new mapping relationship
> with the CoWed folio instead of the non-CoWed folio. However, regarding
> the situation where vma->anon_vma and the non-CoWed folio's anon_vma are
> not same, the avc binding relationship between them will no longer be
> needed, so it is issue for the avc binding relationship still existing
> between them.
>
> This patch will remove the avc binding relationship between vma and the
> non-CoWed folio's anon_vma, which each has their own independent
> anon_vma. It can also alleviates rmap overhead simultaneously.
>
> Signed-off-by: Zhiguo Jiang <justinjiang@...o.com>


NACK (until fixed). This is broken (see below).


I'm not seeing any numbers to back anything up here as to why we want to
make changes to this incredibly sensitive code?

Also anon_vma logic is very complicated and confusing, this commit message
feels about 3 paragraphs too light.

Under what circumstances will vma->anon_vma be different from
folio_anon_vma(non_cowed_folio)? etc.

Confusing topics strongly require explanations that help (somewhat)
compensate. This is one of them.

> ---
>
> -v2:
>  * Solve the kernel test robot noticed "WARNING"
>    Reported-by: kernel test robot <oliver.sang@...el.com>
>    Closes: https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@intel.com

It doesn't.

Saw a bunch of warning output in dmesg when running in qemu, bisected it to
this commit. The below assert is being fired (did you build this kernel
with CONFIG_DEBUG_VM?):

	VM_WARN_ON(anon_vma->num_children);

>From what I saw, these appear to all be cases where anon_vma->num_children == 0...


[    1.905603] ------------[ cut here ]------------
[    1.905604] WARNING: CPU: 2 PID: 231 at mm/rmap.c:443 unlink_anon_vmas+0x181/0x1c0
[    1.905605] Modules linked in:
[    1.905605] CPU: 2 UID: 1000 PID: 231 Comm: zsh Tainted: G        W          6.11.0-rc4+ #49
[    1.905606] Tainted: [W]=WARN
[    1.905606] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.905607] RIP: 0010:unlink_anon_vmas+0x181/0x1c0
[    1.905608] Code: 48 83 7f 40 00 75 1c f0 ff 4f 30 75 ab e8 d7 fd ff ff eb a4 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 90 0f 0b 90 eb de 90 <0f> 0b 90 eb d1 90 0f 0b 90 48 83 c7 08 e8 4d 7c ea ff e9 fc fe ff
[    1.905608] RSP: 0018:ffffc90000547cb0 EFLAGS: 00010286
[    1.905609] RAX: ffff88817b265390 RBX: ffff88817b265380 RCX: ffff88817b2cb790
[    1.905609] RDX: ffff88817b265380 RSI: ffff88817b2cb790 RDI: ffff888179e08888
[    1.905610] RBP: dead000000000122 R08: 000000000000000c R09: 0000000000000010
[    1.905610] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88817b2cb790
[    1.905611] R13: dead000000000100 R14: ffff88817b2cb780 R15: ffff888179e08888
00000000000
[    1.905613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.905613] CR2: 0000555bc5d97390 CR3: 000000017c12c000 CR4: 0000000000750ef0
[    1.905614] PKRU: 55555554
[    1.905614] Call Trace:
[    1.905614]  <TASK>
[    1.905615]  ? unlink_anon_vmas+0x181/0x1c0
[    1.905615]  ? __warn.cold+0x8e/0xe8
[    1.905616]  ? unlink_anon_vmas+0x181/0x1c0
[    1.905617]  ? report_bug+0xff/0x140
[    1.905618]  ? handle_bug+0x3b/0x70
[    1.905619]  ? exc_invalid_op+0x17/0x70
[    1.905620]  ? asm_exc_invalid_op+0x1a/0x20
[    1.905621]  ? unlink_anon_vmas+0x181/0x1c0
[    1.905622]  free_pgtables+0x11f/0x250
[    1.905622]  exit_mmap+0x15e/0x380
[    1.905624]  mmput+0x54/0x110
[    1.905625]  do_exit+0x27e/0xa10
[    1.905626]  ? __x64_sys_close+0x37/0x80
[    1.905626]  do_group_exit+0x2b/0x80
[    1.905628]  __x64_sys_exit_group+0x13/0x20
[    1.905629]  x64_sys_call+0x14af/0x14b0
[    1.905630]  do_syscall_64+0x9e/0x1a0
[    1.905630]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.905631] RIP: 0033:0x7f4416ae33ad
[    1.905632] Code: Unable to access opcode bytes at 0x7f4416ae3383.
000e7
[    1.905633] RAX: ffffffffffffffda RBX: 00007f4416d5e3c0 RCX: 00007f4416ae33ad
[    1.905633] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000000
[    1.905633] RBP: 0000555b8eed1378 R08: 0000000000000000 R09: 0000000000000007
[    1.905634] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[    1.905634] R13: 0000000000000000 R14: 00007ffe7dbe9190 R15: 00007ffe7dbe9110
[    1.905635]  </TASK>
[    1.905635] ---[ end trace 0000000000000000 ]---
[    1.905638] ------------[ cut here ]------------


>  * Update comments to more accurately describe this patch.
>
> -v1:
>  https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@vivo.com/
>
>  include/linux/rmap.h |  1 +
>  mm/memory.c          |  8 +++++++
>  mm/rmap.c            | 53 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 62 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 91b5935e8485..8607d28a3146
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -257,6 +257,7 @@ void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
>  	folio_remove_rmap_ptes(folio, page, 1, vma)
>  void folio_remove_rmap_pmd(struct folio *, struct page *,
>  		struct vm_area_struct *);
> +void folio_remove_anon_avc(struct folio *, struct vm_area_struct *);
>
>  void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
>  		unsigned long address, rmap_t flags);
> diff --git a/mm/memory.c b/mm/memory.c
> index 93c0c25433d0..4c89cb1cb73e
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3428,6 +3428,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  			 * old page will be flushed before it can be reused.
>  			 */
>  			folio_remove_rmap_pte(old_folio, vmf->page, vma);
> +
> +			/*
> +			 * If the new_folio's anon_vma is different from the
> +			 * old_folio's anon_vma, the avc binding relationship
> +			 * between vma and the old_folio's anon_vma is removed,
> +			 * avoiding rmap redundant overhead.

What overhead? Worth spelling out for instance if it's unnecessary to
traverse avc's.

> +			 */
> +			folio_remove_anon_avc(old_folio, vma);
>  		}
>
>  		/* Free the old page.. */
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1103a536e474..56fc16fcf2a9
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1522,6 +1522,59 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
>  #endif
>  }
>
> +/**
> + * folio_remove_anon_avc - remove the avc binding relationship between
> + * folio and vma with different anon_vmas.
> + * @folio:	The folio with anon_vma to remove the binded avc from
> + * @vma:	The vm area to remove the binded avc with folio's anon_vma
> + *
> + * The caller is currently used for CoWed scene.

Strange turn of phrase,

> + */
> +void folio_remove_anon_avc(struct folio *folio,

I think this should be 'oldfolio'. You're not looking at the copied folio,
but the unCoW'd original folio.

> +		struct vm_area_struct *vma)
> +{
> +	struct anon_vma *anon_vma = folio_anon_vma(folio);
> +	pgoff_t pgoff_start, pgoff_end;
> +	struct anon_vma_chain *avc;
> +
> +	/*
> +	 * Ensure that the vma's anon_vma and the folio's
> +	 * anon_vma exist and are not same.
> +	 */
> +	if (!folio_test_anon(folio) || unlikely(!anon_vma) ||

The folio_test_anon() is already implied by folio_anon_vma() != NULL and
doesn't preclude KSM.

> +	    anon_vma == vma->anon_vma)
> +		return;

This is all super confusing, the 'parent' is actually anon_vma
(oldfolio). The newly created 'child' anon_vma is vma->anon_vma. Should
probably rename each accordingly.


> +
> +	pgoff_start = folio_pgoff(folio);
> +	pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
> +
> +	if (!anon_vma_trylock_write(anon_vma))
> +		return;
> +
> +	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
> +			pgoff_start, pgoff_end) {
> +		/*
> +		 * Find the avc associated with vma from the folio's
> +		 * anon_vma and remove it.
> +		 */

This is a meaningless comment.

This should be something like 'anon_vma_chain objects bind VMAs and
anon_vma's. Find the avc binding the unCoW'd folio's anon_vma to the new
VMA, and remove it, as it is now redundant.'

> +		if (avc->vma == vma) {

In testing I found that a lot of the time this isn't found at all... is
that expected?

> +			anon_vma_interval_tree_remove(avc, &anon_vma->rb_root);
> +			/*
> +			 * When removing the avc with anon_vma that is
> +			 * different from the parent anon_vma from parent
> +			 * anon_vma->rb_root, the parent num_children
> +			 * count value is needed to reduce one.
> +			 */

This is a really confusing comment. You're not explaining the 'why' you're
just essentially asserting that you need to do this, and clearly this is
broken.

> +			anon_vma->num_children--;

So we know this is broken to start due to VM_WARN_ON() failures.

As per above dmesg analysis, sometimes this is zero, so you're
underflowing. We definitely need a:

	VM_WARN_ON(anon_vma->num_children == 0);

At least.

But also the naming is broken here too, anon_vma is actually the parent
(oldfolio's) anon_vma...


This is also just not correct on any level - the anon_vma->num_children
field indicates how many child anon_vma objects point at it via
anon_vma->parent, NOT avc.

You're removing an avc, not disconnecting an anon_vma.

So it seems to me you should have logic to remove the avc AND logic to
disconnect vma->anon_vma from (parent) anon_vma if it points to it.

You'll need to be careful about locking when you do that too, as anon_vma's
lock on the root anon_vma, but in isolating the child anon_vma you'd lose
this lock.

I've tried to write code to fix this but haven't been able to yet, this is
fiddly stuff.

(I think this might have seemed to work at some point in testing because
unlink_anon_vmas() uses the avc list to determine what to unlink, rather
than looking at individual anon_vma's but still).

> +
> +			list_del(&avc->same_vma);
> +			anon_vma_chain_free(avc);
> +			break;
> +		}
> +	}
> +	anon_vma_unlock_write(anon_vma);
> +}
> +
>  static __always_inline void __folio_remove_rmap(struct folio *folio,
>  		struct page *page, int nr_pages, struct vm_area_struct *vma,
>  		enum rmap_level level)
> --
> 2.39.0
>

Again I question the value of this change. Are we REALLY seeing a big
problem due to unneeded avc's hanging around? This is very sensitive,
fiddly, confusing code, do we REALLY want to be playing with it?

It'd be good to get some tests though unless you move this to vma.c with
its userland testing (probably a good idea actually as Andrew suggested)
this might be tricky.

NACK until the issues are fixed and the approach at least seems more
correct.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ