[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTi=A+cNYm4gvCksw7+LfT0tx6JnXLv3YYuf9M0YB@mail.gmail.com>
Date: Thu, 3 Feb 2011 22:00:12 +0000
From: Catalin Marinas <catalin.marinas@....com>
To: Russell King - ARM Linux <linux@....linux.org.uk>
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 09/19] ARM: LPAE: Page table maintenance for the
3-level format
On 3 February 2011 17:56, Russell King - ARM Linux
<linux@....linux.org.uk> wrote:
> On Mon, Jan 24, 2011 at 05:55:51PM +0000, Catalin Marinas wrote:
>> The patch also introduces the L_PGD_SWAPPER flag to mark pgd entries
>> pointing to pmd tables pre-allocated in the swapper_pg_dir and avoid
>> trying to free them at run-time. This flag is 0 with the classic page
>> table format.
>
> This shouldn't be necessary.
I tried hard to find a simple way around this but couldn't, so any
suggestion is welcomed. Basically we have two situations where
pgd_alloc/pgd_free are called: (1) new user mm and (2) identity
mapping. As long as we allocate a PMD for the modules/pkmap mappings,
we need to make sure it is freed (more why this allocation is needed
below).
For (1), we can (safely?) assume that we always have a vma in the same
1GB range with the MODULES_VADDR. I suspect the stack always gets at
the top of TASK_SIZE.
For (2), there is no guarantee that this PMD is freed, so we need to
explicit freeing in pgd_free().
But we can't simply try to free the previously allocated PMD
corresponding to MODULES_VADDR. There is a situation when the user
page tables had been cleared and we get an abort for modules/pkmap. We
than copy (safely, that's only temporarily used) the corresponding
pgd_k entry (1GB) into the soon to be freed pgd. At this point
pgd_free() would try to free the PMD from swapper_pg_dir and that's
not possible.
The L_PGD_SWAPPER also comes in handy when setting up identity
mappings. Since the top PGD entries (starting with PAGE_OFFSET >>
PGDIR_SHIFT) are copied by pgd_alloc from swapper_pg_dir, we don't
want the init pgd being corrupted when PHYS_OFFSET > PAGE_OFFSET.
Hence we check L_PGD_SWAPPER and allocate another PMD if necessary.
But at some point we need to free such PMD and can't blindly try to
free the swapper_pg_dir pages.
>> diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
>> index 709244c..003587d 100644
>> --- a/arch/arm/mm/pgd.c
>> +++ b/arch/arm/mm/pgd.c
>> @@ -10,6 +10,7 @@
>> #include <linux/mm.h>
>> #include <linux/gfp.h>
>> #include <linux/highmem.h>
>> +#include <linux/slab.h>
>>
>> #include <asm/pgalloc.h>
>> #include <asm/page.h>
>> @@ -17,6 +18,14 @@
>>
>> #include "mm.h"
>>
>> +#ifdef CONFIG_ARM_LPAE
>> +#define __pgd_alloc() kmalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL)
>> +#define __pgd_free(pgd) kfree(pgd)
>> +#else
>> +#define __pgd_alloc() (pgd_t *)__get_free_pages(GFP_KERNEL, 2)
>> +#define __pgd_free(pgd) free_pages((unsigned long)pgd, 2)
>> +#endif
>> +
>> /*
>> * need to get a 16k page for level 1
>> */
>> @@ -26,7 +35,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>> pmd_t *new_pmd, *init_pmd;
>> pte_t *new_pte, *init_pte;
>>
>> - new_pgd = (pgd_t *)__get_free_pages(GFP_KERNEL, 2);
>> + new_pgd = __pgd_alloc();
>> if (!new_pgd)
>> goto no_pgd;
>>
>> @@ -41,12 +50,21 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>>
>> clean_dcache_area(new_pgd, PTRS_PER_PGD * sizeof(pgd_t));
>>
>> +#ifdef CONFIG_ARM_LPAE
>> + /*
>> + * Allocate PMD table for modules and pkmap mappings.
>> + */
>> + new_pmd = pmd_alloc(mm, new_pgd + pgd_index(MODULES_VADDR), 0);
>> + if (!new_pmd)
>> + goto no_pmd;
>
> This should be a copy of the same page tables found in swapper_pg_dir -
> that's what the memcpy() above is doing.
The memcpy() above only copied between 1 and 3 entries in the pgd_k
(corresponding to 1 to 3GB kernel space). It doesn't copy the entry
corresponding to 1GB below PAGE_OFFSET that would be used by modules.
We need to allocate a new PMD for that.
The problem with the current memory map is that one PGD entry covers
1GB and the one corresponding to MODULES_VADDR is shared between user
and kernel. An alternative would be to move the kernel a bit higher
(and allow MODULES_VADDR at a 1GB boundary. The PAGE_OFFSET would be
something like 3GB + 16M, though I'm not sure what other implications
this would have.
Yet another alternative which I don't like at all is to pretend that
we only have 2 levels of page tables and always allocate 4 PMD pages +
1 PGD.
>> +#endif
>> +
>> if (!vectors_high()) {
>> /*
>> * On ARM, first page must always be allocated since it
>> * contains the machine vectors.
>> */
>> - new_pmd = pmd_alloc(mm, new_pgd, 0);
>> + new_pmd = pmd_alloc(mm, new_pgd + pgd_index(0), 0);
>
> However, the first pmd table, and the first pte table only need to be
> present for the reason stated in the comment, and these need to be
> allocated.
The above change is harmless, I just added it for correctness.
>> if (!new_pmd)
>> goto no_pmd;
>>
>> @@ -66,7 +84,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>> no_pte:
>> pmd_free(mm, new_pmd);
>> no_pmd:
>> - free_pages((unsigned long)new_pgd, 2);
>> + __pgd_free(new_pgd);
>> no_pgd:
>> return NULL;
>> }
>> @@ -80,20 +98,36 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd_base)
>> if (!pgd_base)
>> return;
>>
>> - pgd = pgd_base + pgd_index(0);
>> - if (pgd_none_or_clear_bad(pgd))
>> - goto no_pgd;
>> + if (!vectors_high()) {
>
> No, that's wrong. As FIRST_USER_ADDRESS is nonzero, the first pmd and
> pte table will remain allocated in spite of free_pgtables(), so this
> results in a memory leak.
I agree (and I replied to my own post earlier today), we found the
leak in testing. It is safe to remove this hunk (I had a thought that
it may trigger a bad pmd because of the identity mapping but that's
cleared already via identity_mapping_del().
>> + pgd = pgd_base + pgd_index(0);
>> + if (pgd_none_or_clear_bad(pgd))
>> + goto no_pgd;
>>
>> - pmd = pmd_offset(pgd, 0);
>> - if (pmd_none_or_clear_bad(pmd))
>> - goto no_pmd;
>> + pmd = pmd_offset(pgd, 0);
>> + if (pmd_none_or_clear_bad(pmd))
>> + goto no_pmd;
>>
>> - pte = pmd_pgtable(*pmd);
>> - pmd_clear(pmd);
>> - pte_free(mm, pte);
>> + pte = pmd_pgtable(*pmd);
>> + pmd_clear(pmd);
>> + pte_free(mm, pte);
>> no_pmd:
>> - pgd_clear(pgd);
>> - pmd_free(mm, pmd);
>> + pgd_clear(pgd);
>> + pmd_free(mm, pmd);
>> + }
>> no_pgd:
>> - free_pages((unsigned long) pgd_base, 2);
>> +#ifdef CONFIG_ARM_LPAE
>> + /*
>> + * Free modules/pkmap or identity pmd tables.
>> + */
>> + for (pgd = pgd_base; pgd < pgd_base + PTRS_PER_PGD; pgd++) {
>> + if (pgd_none_or_clear_bad(pgd))
>> + continue;
>> + if (pgd_val(*pgd) & L_PGD_SWAPPER)
>> + continue;
>> + pmd = pmd_offset(pgd, 0);
>> + pgd_clear(pgd);
>> + pmd_free(mm, pmd);
>> + }
>> +#endif
>
> And as kernel mappings in the pgd above TASK_SIZE are supposed to be
> identical across all page tables, this shouldn't be necessary.
For tasks yes, but what about the identity mapping allocations? We
could change the name of pgd_alloc() and add another parameter to
distinguish between these two scenarios.
--
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists