linux-kernel - Re: pipe/page fault oddness.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 1 Oct 2014 01:19:00 -0700 (PDT)
From:	Hugh Dickins <hughd@...gle.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
cc:	Dave Jones <davej@...hat.com>, Al Viro <viro@...iv.linux.org.uk>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Rik van Riel <riel@...hat.com>,
	Ingo Molnar <mingo@...hat.com>,
	Michel Lespinasse <walken@...gle.com>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Hugh Dickins <hughd@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Sasha Levin <sasha.levin@...cle.com>
Subject: Re: pipe/page fault oddness.

On Tue, 30 Sep 2014, Linus Torvalds wrote:
> On Tue, Sep 30, 2014 at 11:20 AM, Dave Jones <davej@...hat.com> wrote:
> >
> > page_fault_kernel:    address=__per_cpu_end ip=copy_page_to_iter error_code=0x2
> 
> Interesting. "error_code" in particular. The value "2" means that the
> CPU thinks that the page is not present (bit zero is clear).
> 
> (That "address" is useless - it's tried to turn a user address into a
> kernel symbol, and the percpu symbols are zero-based, so it picks the
> last of them. The "ip" is useless too, since it doesn't give the
> offset)
> 
> So the CPU thinks it's a write to a not-present page, which means that
> _PAGE_PRESENT bit is clear.
> 
> Now the *kernel* thinks a page is present not just if _PAGE_PRESENT is
> set, but also if _PAGE_PROTNONE or _PAGE_NUMA are set. Sadly, your
> trace is not very useful, because inlining has caused pretty much all
> the cases to be in "handle_mm_fault()", so the trace doesn't really
> tell which path this all takes.
> 
> But we can still do *some* analysis on the trace: do_wp_page()
> shouldn't have been inlined, so it would have shown up in the trace if
> it had been called. So I think we can be pretty confident that the
> ptep_set_access_flags() we see is the one from handle_pte_fault().
> 
> And if that is the case, then we know that "pte_present()" is indeed
> true as far a the kernel is concerned. So with _PAGE_PRESENT not being
> set (based on the error code), we know that _PAGE_PROTNONE must be
> set, otherwise we'd have triggered the pte_numa() check and exited
> through do_numa_page().
> 
> So it smells like we have a PROT_NONE VM area (at least the paeg table
> entries imply that). But "access_error()" should have flagged that (it
> checks "vma->vm_flags & VM_WRITE"). How do we have a page table entry
> marked _PAGE_PROTNONE, but VM_WRITE set in the vma?
> 
> Or, possibly, we have some confusion about the page tables themselves
> (corruption, wrong %cr3 value, whatever), explaining why the CPU
> thinks one thing, but our software page table walker thinks another.
> 
> I'm not seeing how this all happens. But I'm adding Kirill to the cc,
> since he might see something I missed, and he touched some of this
> code last ("tag, you're it").
> 
> Kirill: the thread is on lkml, but basically it boils down to the
> second byte write in fault_in_pages_writeable() faulting forever,
> despite handle_mm_fault() apparently thinking that everything is fine.
> 
> Also adding Hugh Dickins, just because the more people who know this
> code that are involved, the better.

I've tried, but failed to explain it.

I think it's likely related to the VM_BUG_ON(!(val & _PAGE_PRESENT))
which linux-next has in pte_mknuma(), which Sasha Levin first reported
hitting in https://lkml.org/lkml/2014/8/26/869 (a resumption of the
"mm: BUG in unmap_page_range" thread, though its subject bug is fixed).

Mel and I gave it a lot of thought, but that too remains unexplained.
Sasha could reproduce it fairly easily on linux-next, but could not
reproduce it on 3.17-rc4 (plus the VM_BUG_ON); maybe Dave is doing
something different enough to get it on 3.17-rc7.

I say they're likely related because both could be explained if
there's some way in which a PROTNONE pte can get left behind after
the vma has been mprotected back from PROT_NONE to read-writable.
But we cannot see how (even when racing with page migration).

Irrelevance follows...

There *appears* to be a risk of hitting the VM_BUG_ON, or with no
VM_BUG_ON (as in 3.17-rc) pte_mknuma proceeding to add _PAGE_NUMA
to _PAGE_PROTNONE - making the pte then fail the pte_numa test,
but pass the pte_special test, hence fail the vm_normal_page test:
when coming from change_prot_numa serving MPOL_MF_LAZY for mbind.

However, that would still not explain Dave's endless refaulting;
though I was reminded to send you a patch to fix it, except that
when I came to test the fix, I could not produce the problem, and
eventually discovered a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP
and MPOL_MF_LAZY from userspace for now") - that call to
change_prot_numa is still just dead code, so we're still safe from
its use on PROT_NONE areas (which task_numa_work carefully avoids).

Some time wasted on that, but I learnt a valuable debugging technique:
#undef EINVAL
#define EINVAL __LINE__

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/