[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4E24FA51.70602@gmail.com>
Date: Tue, 19 Jul 2011 11:30:25 +0800
From: Shan Hai <haishan.bai@...il.com>
To: Benjamin Herrenschmidt <benh@...nel.crashing.org>
CC: Peter Zijlstra <peterz@...radead.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>, paulus@...ba.org,
tglx@...utronix.de, walken@...gle.com, dhowells@...hat.com,
cmetcalf@...era.com, tony.luck@...el.com,
akpm@...ux-foundation.org, linuxppc-dev@...ts.ozlabs.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
On 07/18/2011 03:36 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2011-07-18 at 15:26 +0800, Shan Hai wrote:
>> I am sorry I hadn't tried your newer patch, I tried it but it still
>> could not work in my test environment, I will dig into and tell you
>> why that failed later.
> Ok, please let me know what you find !
>
Have not been finding out the reason why failed,
I tried the following based on your code,
(1)
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..820556d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr)
{
struct mm_struct *mm = current->mm;
int ret;
+ int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT;
down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, (unsigned long)uaddr,
- 1, 1, 0, NULL, NULL);
+ ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,
+ flags, NULL, NULL, NULL);
up_read(&mm->mmap_sem);
return ret < 0 ? ret : 0;
(2)
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..f7ba26e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
...
+
+ if ((flags & (FOLL_WRITE | FOLL_FIXFAULT)) && !pte_dirty(pte))
+ handle_pte_sw_young_dirty(vma, address, ptep,
+ FAULT_FLAG_WRITE);
...
And everything lookes good, but still couldn't work, need more
investigation.
>> Yep, I know holding lots of ifdef's everywhere is not so good,
>> but if we have some other way(I don't know how till now) to
>> figure out the arch has the need to fixup up the write permission
>> we could eradicate the ugly ifdef's here.
>>
>> I think the handle_mm_fault could do all dirty/young tracking,
>> because the purpose of making follow_page return NULL to
>> its caller is that want to the handle_mm_fault to be called
>> on write permission protection fault.
> I see your point. Rather than factoring the fixup code out, we could
> force gup to call handle_mm_fault()... that makes sense.
>
> However, I don't think we should special case archs. There's plenty of
> cases where we don't care about this fixup even on archs that do SW
> tracking of dirty and young. For example when gup is using for
> subsequent DMA.
>
> Only the (rare ?) cases where it's used as a mean to fixup a failing
> "atomic" user access are relevant.
>
> So I believe we should still pass an explicit flag to __get_user_pages()
> as I propose to activate that behaviour.
>
How about the following one?
the write permission fixup behaviour is triggered explicitly by
the trouble making parts like futex as you suggested.
In this way, the follow_page() mimics exactly how the MMU
faults on atomic access to the user pages, and we could handle
the fault by already existing handle_mm_fault which also do
the dirty/young tracking properly.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..8a76694 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1546,6 +1546,7 @@ struct page *follow_page(struct vm_area_struct *,
unsigned long address,
#define FOLL_MLOCK 0x40 /* mark page as mlocked */
#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split
them */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_FIXFAULT 0x200 /* fixup after a fault (PTE
dirty/young upd) */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..820556d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr)
{
struct mm_struct *mm = current->mm;
int ret;
+ int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT;
down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, (unsigned long)uaddr,
- 1, 1, 0, NULL, NULL);
+ ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,
+ flags, NULL, NULL, NULL);
up_read(&mm->mmap_sem);
return ret < 0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..5682501 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct
*vma, unsigned long address,
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
+ int fix_write_permission = 0;
page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
if (!IS_ERR(page)) {
@@ -1519,6 +1520,9 @@ split_fallthrough:
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
set_page_dirty(page);
+
+ if ((flags & (FOLL_WRITE | FOLL_FIXFAULT)) && !pte_dirty(pte))
+ fix_write_permission = 1;
/*
* pte_mkyoung() would be more correct here, but atomic care
* is needed to avoid losing the dirty bit: it is easier to use
@@ -1551,7 +1555,7 @@ split_fallthrough:
unlock:
pte_unmap_unlock(ptep, ptl);
out:
- return page;
+ return (fix_write_permission) ? NULL : page;
bad_page:
pte_unmap_unlock(ptep, ptl);
> At this point, since we have isolated the special case callers, I think
> we are pretty much in a situation where there's no point trying to
> optimize the x86 case more, it's a fairly slow path anyway, and so no
> ifdef should be needed (and x86 already #define out the TLB flush for
> spurious faults in handle_pte_fault today).
>
> We don't even need to change follow_page()... we just don't call it the
> first time around.
>
> I'll cook up another patch later but first we need to find out why the
> one you have doesn't work. There might be another problem lurking (or I
> just made a stupid mistake).
>
> BTW. Can you give me some details about how you reproduce the problem ?
> I should setup something on a booke machine here to verify things.
>
> Cheers,
> Ben.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists