linux-kernel - Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LSU.2.11.2012281925420.1028@eggly.anvils>
Date:   Mon, 28 Dec 2020 20:35:06 -0800 (PST)
From:   Hugh Dickins <hughd@...gle.com>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        Matthew Wilcox <willy@...radead.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Will Deacon <will@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Jan Kara <jack@...e.cz>, Minchan Kim <minchan@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vinayak Menon <vinmenon@...eaurora.org>,
        Android Kernel Team <kernel-team@...roid.com>
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries
 when prefaulting

Got it at last, sorry it's taken so long.

On Tue, 29 Dec 2020, Kirill A. Shutemov wrote:
> On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill@...temov.name> wrote:
> > > >
> > > > So far I only found one more pin leak and always-true check. I don't see
> > > > how can it lead to crash or corruption. Keep looking.

Those mods look good in themselves, but, as you expected,
made no difference to the corruption I was seeing.

> > > 
> > > Well, I noticed that the nommu.c version of filemap_map_pages() needs
> > > fixing, but that's obviously not the case Hugh sees.
> > > 
> > > No,m I think the problem is the
> > > 
> > >         pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > 
> > > at the end of filemap_map_pages().
> > > 
> > > Why?
> > > 
> > > Because we've been updating vmf->pte as we go along:
> > > 
> > >                 vmf->pte += xas.xa_index - last_pgoff;
> > > 
> > > and I think that by the time we get to that "pte_unmap_unlock()",
> > > vmf->pte potentially points to past the edge of the page directory.
> > 
> > Well, if it's true we have bigger problem: we set up an pte entry without
> > relevant PTL.
> > 
> > But I *think* we should be fine here: do_fault_around() limits start_pgoff
> > and end_pgoff to stay within the page table.

Yes, Linus's patch had made no difference,
the map_pages loop is safe in that respect.

> > 
> > It made mw looking at the code around pte_unmap_unlock() and I think that
> > the bug is that we have to reset vmf->address and NULLify vmf->pte once we
> > are done with faultaround:
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> 
> Ugh.. Wrong place. Need to sleep.
> 
> I'll look into your idea tomorrow.
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 87671284de62..e4daab80ed81 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
>  	} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	rcu_read_unlock();
> +	vmf->address = address;
> +	vmf->pte = NULL;
>  	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>  
>  	return ret;
> -- 

And that made no (noticeable) difference either.  But at last
I realized, it's absolutely on the right track, but missing the
couple of early returns at the head of filemap_map_pages(): add

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f
 
 	rcu_read_lock();
 	head = first_map_page(vmf, &xas, end_pgoff);
-	if (!head) {
-		rcu_read_unlock();
-		return 0;
-	}
+	if (!head)
+		goto out;
 
 	if (filemap_map_pmd(vmf, head)) {
-		rcu_read_unlock();
-		return VM_FAULT_NOPAGE;
+		ret = VM_FAULT_NOPAGE;
+		goto out;
 	}
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -3066,9 +3064,9 @@ unlock:
 		put_page(head);
 	} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
 	rcu_read_unlock();
 	vmf->address = address;
-	vmf->pte = NULL;
 	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
 
 	return ret;
--

and then the corruption is fixed.  It seems miraculous that the
machines even booted with that bad vmf->address going to __do_fault():
maybe that tells us what a good job map_pages does most of the time.

You'll see I've tried removing the "vmf->pte = NULL;" there. I did
criticize earlier that vmf->pte was being left set, but was either
thinking back to some earlier era of mm/memory.c, or else confusing
with vmf->prealloc_pte, which is NULLed when consumed: I could not
find anywhere in mm/memory.c which now needs vmf->pte to be cleared,
and I seem to run fine without it (even on i386 HIGHPTE).

So, the mystery is solved; but I don't think any of these patches
should be applied.  Without thinking through Linus's suggestions
re do_set_pte() in particular, I do think this map_pages interface
is too ugly, and given us lots of trouble: please take your time
to go over it all again, and come up with a cleaner patch.

I've grown rather jaded, and questioning the value of the rework:
I don't think I want to look at or test another for a week or so.

Hugh