linux-kernel - Re: [PATCH] Revert "mm/vunmap: add cond_resched() in vunmap_pmd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201116175323.GB3805951@google.com>
Date:   Mon, 16 Nov 2020 09:53:23 -0800
From:   Minchan Kim <minchan@...nel.org>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     linux-kernel@...r.kernel.org, linux-mm <linux-mm@...ck.org>,
        "Aneesh Kumar K . V" <aneesh.kumar@...ux.ibm.com>,
        Harish Sriram <harish@...ux.ibm.com>, stable@...r.kernel.org
Subject: Re: [PATCH] Revert "mm/vunmap: add cond_resched() in
 vunmap_pmd_range"

On Fri, Nov 13, 2020 at 08:25:29AM -0800, Minchan Kim wrote:
> On Thu, Nov 12, 2020 at 02:49:19PM -0800, Andrew Morton wrote:
> > On Thu, 12 Nov 2020 12:01:01 -0800 Minchan Kim <minchan@...nel.org> wrote:
> > 
> > > 
> > > On Sat, Nov 07, 2020 at 12:39:39AM -0800, Minchan Kim wrote:
> > > > Hi Andrew,
> > > > 
> > > > On Fri, Nov 06, 2020 at 05:59:33PM -0800, Andrew Morton wrote:
> > > > > On Thu,  5 Nov 2020 09:02:49 -0800 Minchan Kim <minchan@...nel.org> wrote:
> > > > > 
> > > > > > This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
> > > > > > 
> > > > > > While I was doing zram testing, I found sometimes decompression failed
> > > > > > since the compression buffer was corrupted. With investigation,
> > > > > > I found below commit calls cond_resched unconditionally so it could
> > > > > > make a problem in atomic context if the task is reschedule.
> > > > > > 
> > > > > > Revert the original commit for now.
> > >
> > > How should we proceed this problem?
> > >
> > 
> > (top-posting repaired - please don't).
> > 
> > Well, we don't want to reintroduce the softlockup reports which
> > e47110e90584a2 fixed, and we certainly don't want to do that on behalf
> > of code which is using the unmap_kernel_range() interface incorrectly.
> > 
> > So I suggest either
> > 
> > a) make zs_unmap_object() stop doing the unmapping from atomic context or
> 
> It's not easy since the path already hold several spin_locks as well as
> per-cpu context. I could pursuit the direction but it takes several
> steps to change entire locking scheme in the zsmalloc, which will
> take time(we couldn't leave zsmalloc broken until then) and hard to
> land on stable tree.
> 
> > 
> > b) figure out whether the vmalloc unmap code is *truly* unsafe from
> >    atomic context - perhaps it is only unsafe from interrupt context,
> >    in which case we can rework the vmalloc.c checks to reflect this, or
> 
> I don't get the point. I assume your suggestion would be "let's make the
> vunmap code atomic context safe" but how could it solve this problem?
>      
> The point from e47110e90584a2 was softlockup could be triggered if
> vunamp deal with large mapping so need *explict reschedule* point
> for CONFIG_PREEMPT_VOLUNTARY. However, CONFIG_PREEMPT_VOLUNTARY
> doesn't consider peempt count so even though we could make vunamp
> atomic safe to make a call under spin_lock:
> 
> spin_lock(&A);
> vunmap
>   vunmap_pmd_range
>     cond_resched <- bang
>  
> Below options would have same problem, too.
> Let me know if I misunderstand something.
> 
> > 
> > c) make the vfree code callable from all contexts.  Which is by far
> >    the preferred solution, but may be tough.
> > 
> > 
> > Or maybe not so tough - if the only problem in the vmalloc code is the
> > use of mutex_trylock() from irqs then it may be as simple as switching
> > to old-style semaphores and using down_trylock(), which I think is
> > irq-safe.
> > 
> > However old-style semaphores are deprecated.  A hackyish approach might
> > be to use an rwsem always in down_write mode and use
> > down_write_trylock(), which I think is also callable from interrrupt
> > context.
> > 
> > But I have a feeling that there are other reasons why vfree shouldn't
> > be called from atomic context, apart from the mutex_trylock-in-irq
> > issue.

How about this?

>From 0733bc468d0072147c2ecf998d7cc3af2234bc87 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@...nel.org>
Date: Mon, 16 Nov 2020 09:38:40 -0800
Subject: [PATCH] mm: unmap_kernel_range_atomic

unmap_kernel_range had been atomic operation and zsmalloc has used
it in atomic context in zs_unmap_object.
However, ("e47110e90584, mm/vunmap: add cond_resched() in vunmap_pmd_range")
changed it into non-atomic operation via adding cond_resched.
It causes zram decompresion failure by corrupting compressed buffer
in atomic context.

This patch introduces unmap_kernel_range_atomic which works for
only range less than PMD_SIZE to prevent cond_resched call.

Signed-off-by: Minchan Kim <minchan@...nel.org>
---
 include/linux/vmalloc.h |  2 ++
 mm/vmalloc.c            | 23 +++++++++++++++++++++--
 mm/zsmalloc.c           |  2 +-
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 938eaf9517e2..36b1ecc2d014 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -180,6 +180,7 @@ int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
 		struct page **pages);
 extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size);
 extern void unmap_kernel_range(unsigned long addr, unsigned long size);
+extern void unmap_kernel_range_atomic(unsigned long addr, unsigned long size);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 	struct vm_struct *vm = find_vm_area(addr);
@@ -200,6 +201,7 @@ unmap_kernel_range_noflush(unsigned long addr, unsigned long size)
 {
 }
 #define unmap_kernel_range unmap_kernel_range_noflush
+#define unmap_kernel_range_atomic unmap_kernel_range_noflush
 static inline void set_vm_flush_reset_perms(void *addr)
 {
 }
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d7075ad340aa..714e5425dc45 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -88,6 +88,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	pmd_t *pmd;
 	unsigned long next;
 	int cleared;
+	bool check_resched = (end - addr) > PMD_SIZE;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -102,8 +103,8 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		vunmap_pte_range(pmd, addr, next, mask);
-
-		cond_resched();
+		if (check_resched)
+			cond_resched();
 	} while (pmd++, addr = next, addr != end);
 }
 
@@ -2024,6 +2025,24 @@ void unmap_kernel_range(unsigned long addr, unsigned long size)
 	flush_tlb_kernel_range(addr, end);
 }
 
+/**
+ * unmap_kernel_range_atomic - unmap kernel VM area and flush cache and TLB
+ * @addr: start of the VM area to unmap
+ * @size: size of the VM area to unmap
+ *
+ * Similar to unmap_kernel_range_noflush() but it's atomic. @size should be
+ * less than PMD_SIZE.
+ */
+void unmap_kernel_range_atomic(unsigned long addr, unsigned long size)
+{
+	unsigned long end = addr + size;
+
+	flush_cache_vunmap(addr, end);
+	WARN_ON(size > PMD_SIZE);
+	unmap_kernel_range_noflush(addr, size);
+	flush_tlb_kernel_range(addr, end);
+}
+
 static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
 	struct vmap_area *va, unsigned long flags, const void *caller)
 {
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 662ee420706f..9decc7634852 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1154,7 +1154,7 @@ static inline void __zs_unmap_object(struct mapping_area *area,
 {
 	unsigned long addr = (unsigned long)area->vm_addr;
 
-	unmap_kernel_range(addr, PAGE_SIZE * 2);
+	unmap_kernel_range_atomic(addr, PAGE_SIZE * 2);
 }
 
 #else /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */
-- 
2.29.2.299.gdc1121823c-goog