[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <33d6cb6b-834b-f9b8-df28-b15243994f9b@loongson.cn>
Date: Tue, 22 Oct 2024 09:39:27 +0800
From: maobibo <maobibo@...ngson.cn>
To: Huacai Chen <chenhuacai@...nel.org>
Cc: wuruiyang@...ngson.cn, Andrey Ryabinin <ryabinin.a.a@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, Barry Song <baohua@...nel.org>,
loongarch@...ts.linux.dev, linux-kernel@...r.kernel.org,
kasan-dev@...glegroups.com, linux-mm@...ck.org
Subject: Re: [PATCH v2 1/3] LoongArch: Set initial pte entry with PAGE_GLOBAL
for kernel space
On 2024/10/21 下午6:13, Huacai Chen wrote:
> On Mon, Oct 21, 2024 at 9:23 AM maobibo <maobibo@...ngson.cn> wrote:
>>
>>
>>
>> On 2024/10/18 下午2:32, Huacai Chen wrote:
>>> On Fri, Oct 18, 2024 at 2:23 PM maobibo <maobibo@...ngson.cn> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/10/18 下午12:23, Huacai Chen wrote:
>>>>> On Fri, Oct 18, 2024 at 12:16 PM maobibo <maobibo@...ngson.cn> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/10/18 下午12:11, Huacai Chen wrote:
>>>>>>> On Fri, Oct 18, 2024 at 11:44 AM maobibo <maobibo@...ngson.cn> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/10/18 上午11:14, Huacai Chen wrote:
>>>>>>>>> Hi, Bibo,
>>>>>>>>>
>>>>>>>>> I applied this patch but drop the part of arch/loongarch/mm/kasan_init.c:
>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson.git/commit/?h=loongarch-next&id=15832255e84494853f543b4c70ced50afc403067
>>>>>>>>>
>>>>>>>>> Because kernel_pte_init() should operate on page-table pages, not on
>>>>>>>>> data pages. You have already handle page-table page in
>>>>>>>>> mm/kasan/init.c, and if we don't drop the modification on data pages
>>>>>>>>> in arch/loongarch/mm/kasan_init.c, the kernel fail to boot if KASAN is
>>>>>>>>> enabled.
>>>>>>>>>
>>>>>>>> static inline void set_pte(pte_t *ptep, pte_t pteval)
>>>>>>>> {
>>>>>>>> WRITE_ONCE(*ptep, pteval);
>>>>>>>> -
>>>>>>>> - if (pte_val(pteval) & _PAGE_GLOBAL) {
>>>>>>>> - pte_t *buddy = ptep_buddy(ptep);
>>>>>>>> - /*
>>>>>>>> - * Make sure the buddy is global too (if it's !none,
>>>>>>>> - * it better already be global)
>>>>>>>> - */
>>>>>>>> - if (pte_none(ptep_get(buddy))) {
>>>>>>>> -#ifdef CONFIG_SMP
>>>>>>>> - /*
>>>>>>>> - * For SMP, multiple CPUs can race, so we need
>>>>>>>> - * to do this atomically.
>>>>>>>> - */
>>>>>>>> - __asm__ __volatile__(
>>>>>>>> - __AMOR "$zero, %[global], %[buddy] \n"
>>>>>>>> - : [buddy] "+ZB" (buddy->pte)
>>>>>>>> - : [global] "r" (_PAGE_GLOBAL)
>>>>>>>> - : "memory");
>>>>>>>> -
>>>>>>>> - DBAR(0b11000); /* o_wrw = 0b11000 */
>>>>>>>> -#else /* !CONFIG_SMP */
>>>>>>>> - WRITE_ONCE(*buddy, __pte(pte_val(ptep_get(buddy)) | _PAGE_GLOBAL));
>>>>>>>> -#endif /* CONFIG_SMP */
>>>>>>>> - }
>>>>>>>> - }
>>>>>>>> + DBAR(0b11000); /* o_wrw = 0b11000 */
>>>>>>>> }
>>>>>>>>
>>>>>>>> No, please hold on. This issue exists about twenty years, Do we need be
>>>>>>>> in such a hurry now?
>>>>>>>>
>>>>>>>> why is DBAR(0b11000) added in set_pte()?
>>>>>>> It exists before, not added by this patch. The reason is explained in
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.12-rc3&id=f93f67d06b1023313ef1662eac490e29c025c030
>>>>>> why speculative accesses may cause spurious page fault in kernel space
>>>>>> with PTE enabled? speculative accesses exists anywhere, it does not
>>>>>> cause spurious page fault.
>>>>> Confirmed by Ruiyang Wu, and even if DBAR(0b11000) is wrong, that
>>>>> means another patch's mistake, not this one. This one just keeps the
>>>>> old behavior.
>>>>> +CC Ruiyang Wu here.
>>>> Also from Ruiyang Wu, the information is that speculative accesses may
>>>> insert stale TLB, however no page fault exception.
>>>>
>>>> So adding barrier in set_pte() does not prevent speculative accesses.
>>>> And you write patch here, however do not know the actual reason?
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.12-rc3&id=f93f67d06b1023313ef1662eac490e29c025c030
>>> I have CCed Ruiyang, whether the description is correct can be judged by him.
>>
>> There are some problems to add barrier() in set_pte():
>>
>> 1. There is such issue only for HW ptw enabled and kernel address space,
>> is that? Also it may be two heavy to add barrier in set_pte(), comparing
>> to do this in flush_cache_vmap().
> So adding a barrier in set_pte() may not be the best solution for
> performance, but you cannot say it is a wrong solution. And yes, we
> can only care the kernel space, which is also the old behavior before
> this patch, so set_pte() should be:
>
> static inline void set_pte(pte_t *ptep, pte_t pteval)
> {
> WRITE_ONCE(*ptep, pteval);
> #ifdef CONFIG_SMP
> if (pte_val(pteval) & _PAGE_GLOBAL)
cpu_has_ptw seems also need here, if it is only for hw page walk.
> DBAR(0b11000); /* o_wrw = 0b11000 */
> #endif
> }
>
> Putting a dbar unconditionally in set_pte() is my mistake, I'm sorry for that.
>
>>
>> 2. LoongArch is different with other other architectures, two pages are
>> included in one TLB entry. If there is two consecutive page mapped and
>> memory access, there will page fault for the second memory access. Such
>> as:
>> addr1 =percpu_alloc(pagesize);
>> val1 = *(int *)addr1;
>> // With page table walk, addr1 is present and addr2 is pte_none
>> // TLB entry includes valid pte for addr1, invalid pte for addr2
>> addr2 =percpu_alloc(pagesize); // will not flush tlb in first time
>> val2 = *(int *)addr2;
>> // With page table walk, addr1 is present and addr2 is present also
>> // TLB entry includes valid pte for addr1, invalid pte for addr2
>> So there will be page fault when accessing address addr2
>>
>> There there is the same problem with user address space. By the way,
>> there is HW prefetching technology, negative effective of HW prefetching
>> technology will be tlb added. So there is potential page fault if memory
>> is allocated and accessed in the first time.
> As discussed internally, there may be three problems related to
> speculative access in detail: 1) a load/store after set_pte() is
> prioritized before, which can be prevented by dbar, 2) a instruction
> fetch after set_pte() is prioritized before, which can be prevented by
> ibar, 3) the buddy tlb problem you described here, if I understand
> Ruiyang's explanation correctly this can only be prevented by the
> filter in do_page_fault().
>
> From experiments, without the patch "LoongArch: Improve hardware page
> table walker", there are about 80 times of spurious page faults during
> boot, and increases continually during stress tests. And after that
> patch which adds a dbar to set_pte(), we cannot observe spurious page
> faults anymore. Of course this doesn't mean 2) and 3) don't exist, but
Good experiment result. Could you share me code about page fault
counting and test cases?
> we can at least say 1) is the main case. On this basis, in "LoongArch:
> Improve hardware page table walker" we use a relatively cheap dbar
> (compared to ibar) to prevent the main case, and add a filter to
> handle 2) and 3). Such a solution is reasonable.
>
>
>>
>> 3. For speculative execution, if it is user address, there is eret from
>> syscall. eret will rollback all speculative execution instruction. So it
>> is only problem for speculative execution. And how to verify whether it
>> is the problem of speculative execution or it is the problem of clause 2?
> As described above, if spurious page faults still exist after adding
> dbar to set_pte(), it may be a problem of clause 2 (case 3 in my
> description), otherwise it is not a problem of clause 2.
>
> At last, this patch itself is attempting to solve the concurrent
> problem about _PAGE_GLOBAL, so adding pte_alloc_one_kernel() and
> removing the buddy stuff in set_pte() are what it needs. However it
> shouldn't touch the logic of dbar in set_pte(), whether "LoongArch:
> Improve hardware page table walker" is right or wrong.
yes, I agree. We can discuss set_pte() issue in later. Simple for this
patch to solve concurrent problem, it is ok
https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson.git/diff/mm/kasan/init.c?h=loongarch-next&id=15832255e84494853f543b4c70ced50afc403067
Regards
Bibo Mao
>
>
> Huacai
>
>>
>> Regards
>> Bibo Mao
>>
>>
>>>
>>> Huacai
>>>
>>>>
>>>> Bibo Mao
>>>>>
>>>>> Huacai
>>>>>
>>>>>>
>>>>>> Obvious you do not it and you write wrong patch.
>>>>>>
>>>>>>>
>>>>>>> Huacai
>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bibo Mao
>>>>>>>>> Huacai
>>>>>>>>>
>>>>>>>>> On Mon, Oct 14, 2024 at 11:59 AM Bibo Mao <maobibo@...ngson.cn> wrote:
>>>>>>>>>>
>>>>>>>>>> Unlike general architectures, there are two pages in one TLB entry
>>>>>>>>>> on LoongArch system. For kernel space, it requires both two pte
>>>>>>>>>> entries with PAGE_GLOBAL bit set, else HW treats it as non-global
>>>>>>>>>> tlb, there will be potential problems if tlb entry for kernel space
>>>>>>>>>> is not global. Such as fail to flush kernel tlb with function
>>>>>>>>>> local_flush_tlb_kernel_range() which only flush tlb with global bit.
>>>>>>>>>>
>>>>>>>>>> With function kernel_pte_init() added, it can be used to init pte
>>>>>>>>>> table when it is created for kernel address space, and the default
>>>>>>>>>> initial pte value is PAGE_GLOBAL rather than zero at beginning.
>>>>>>>>>>
>>>>>>>>>> Kernel address space areas includes fixmap, percpu, vmalloc, kasan
>>>>>>>>>> and vmemmap areas set default pte entry with PAGE_GLOBAL set.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Bibo Mao <maobibo@...ngson.cn>
>>>>>>>>>> ---
>>>>>>>>>> arch/loongarch/include/asm/pgalloc.h | 13 +++++++++++++
>>>>>>>>>> arch/loongarch/include/asm/pgtable.h | 1 +
>>>>>>>>>> arch/loongarch/mm/init.c | 4 +++-
>>>>>>>>>> arch/loongarch/mm/kasan_init.c | 4 +++-
>>>>>>>>>> arch/loongarch/mm/pgtable.c | 22 ++++++++++++++++++++++
>>>>>>>>>> include/linux/mm.h | 1 +
>>>>>>>>>> mm/kasan/init.c | 8 +++++++-
>>>>>>>>>> mm/sparse-vmemmap.c | 5 +++++
>>>>>>>>>> 8 files changed, 55 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h
>>>>>>>>>> index 4e2d6b7ca2ee..b2698c03dc2c 100644
>>>>>>>>>> --- a/arch/loongarch/include/asm/pgalloc.h
>>>>>>>>>> +++ b/arch/loongarch/include/asm/pgalloc.h
>>>>>>>>>> @@ -10,8 +10,21 @@
>>>>>>>>>>
>>>>>>>>>> #define __HAVE_ARCH_PMD_ALLOC_ONE
>>>>>>>>>> #define __HAVE_ARCH_PUD_ALLOC_ONE
>>>>>>>>>> +#define __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL
>>>>>>>>>> #include <asm-generic/pgalloc.h>
>>>>>>>>>>
>>>>>>>>>> +static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
>>>>>>>>>> +{
>>>>>>>>>> + pte_t *pte;
>>>>>>>>>> +
>>>>>>>>>> + pte = (pte_t *) __get_free_page(GFP_KERNEL);
>>>>>>>>>> + if (!pte)
>>>>>>>>>> + return NULL;
>>>>>>>>>> +
>>>>>>>>>> + kernel_pte_init(pte);
>>>>>>>>>> + return pte;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> static inline void pmd_populate_kernel(struct mm_struct *mm,
>>>>>>>>>> pmd_t *pmd, pte_t *pte)
>>>>>>>>>> {
>>>>>>>>>> diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
>>>>>>>>>> index 9965f52ef65b..22e3a8f96213 100644
>>>>>>>>>> --- a/arch/loongarch/include/asm/pgtable.h
>>>>>>>>>> +++ b/arch/loongarch/include/asm/pgtable.h
>>>>>>>>>> @@ -269,6 +269,7 @@ extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pm
>>>>>>>>>> extern void pgd_init(void *addr);
>>>>>>>>>> extern void pud_init(void *addr);
>>>>>>>>>> extern void pmd_init(void *addr);
>>>>>>>>>> +extern void kernel_pte_init(void *addr);
>>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>> * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
>>>>>>>>>> diff --git a/arch/loongarch/mm/init.c b/arch/loongarch/mm/init.c
>>>>>>>>>> index 8a87a482c8f4..9f26e933a8a3 100644
>>>>>>>>>> --- a/arch/loongarch/mm/init.c
>>>>>>>>>> +++ b/arch/loongarch/mm/init.c
>>>>>>>>>> @@ -198,9 +198,11 @@ pte_t * __init populate_kernel_pte(unsigned long addr)
>>>>>>>>>> if (!pmd_present(pmdp_get(pmd))) {
>>>>>>>>>> pte_t *pte;
>>>>>>>>>>
>>>>>>>>>> - pte = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>>>>>>> + pte = memblock_alloc_raw(PAGE_SIZE, PAGE_SIZE);
>>>>>>>>>> if (!pte)
>>>>>>>>>> panic("%s: Failed to allocate memory\n", __func__);
>>>>>>>>>> +
>>>>>>>>>> + kernel_pte_init(pte);
>>>>>>>>>> pmd_populate_kernel(&init_mm, pmd, pte);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/loongarch/mm/kasan_init.c b/arch/loongarch/mm/kasan_init.c
>>>>>>>>>> index 427d6b1aec09..34988573b0d5 100644
>>>>>>>>>> --- a/arch/loongarch/mm/kasan_init.c
>>>>>>>>>> +++ b/arch/loongarch/mm/kasan_init.c
>>>>>>>>>> @@ -152,6 +152,8 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
>>>>>>>>>> phys_addr_t page_phys = early ?
>>>>>>>>>> __pa_symbol(kasan_early_shadow_page)
>>>>>>>>>> : kasan_alloc_zeroed_page(node);
>>>>>>>>>> + if (!early)
>>>>>>>>>> + kernel_pte_init(__va(page_phys));
>>>>>>>>>> next = addr + PAGE_SIZE;
>>>>>>>>>> set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
>>>>>>>>>> } while (ptep++, addr = next, addr != end && __pte_none(early, ptep_get(ptep)));
>>>>>>>>>> @@ -287,7 +289,7 @@ void __init kasan_init(void)
>>>>>>>>>> set_pte(&kasan_early_shadow_pte[i],
>>>>>>>>>> pfn_pte(__phys_to_pfn(__pa_symbol(kasan_early_shadow_page)), PAGE_KERNEL_RO));
>>>>>>>>>>
>>>>>>>>>> - memset(kasan_early_shadow_page, 0, PAGE_SIZE);
>>>>>>>>>> + kernel_pte_init(kasan_early_shadow_page);
>>>>>>>>>> csr_write64(__pa_symbol(swapper_pg_dir), LOONGARCH_CSR_PGDH);
>>>>>>>>>> local_flush_tlb_all();
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/loongarch/mm/pgtable.c b/arch/loongarch/mm/pgtable.c
>>>>>>>>>> index eb6a29b491a7..228ffc1db0a3 100644
>>>>>>>>>> --- a/arch/loongarch/mm/pgtable.c
>>>>>>>>>> +++ b/arch/loongarch/mm/pgtable.c
>>>>>>>>>> @@ -38,6 +38,28 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>>>>>>>>>> }
>>>>>>>>>> EXPORT_SYMBOL_GPL(pgd_alloc);
>>>>>>>>>>
>>>>>>>>>> +void kernel_pte_init(void *addr)
>>>>>>>>>> +{
>>>>>>>>>> + unsigned long *p, *end;
>>>>>>>>>> + unsigned long entry;
>>>>>>>>>> +
>>>>>>>>>> + entry = (unsigned long)_PAGE_GLOBAL;
>>>>>>>>>> + p = (unsigned long *)addr;
>>>>>>>>>> + end = p + PTRS_PER_PTE;
>>>>>>>>>> +
>>>>>>>>>> + do {
>>>>>>>>>> + p[0] = entry;
>>>>>>>>>> + p[1] = entry;
>>>>>>>>>> + p[2] = entry;
>>>>>>>>>> + p[3] = entry;
>>>>>>>>>> + p[4] = entry;
>>>>>>>>>> + p += 8;
>>>>>>>>>> + p[-3] = entry;
>>>>>>>>>> + p[-2] = entry;
>>>>>>>>>> + p[-1] = entry;
>>>>>>>>>> + } while (p != end);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> void pgd_init(void *addr)
>>>>>>>>>> {
>>>>>>>>>> unsigned long *p, *end;
>>>>>>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>>>>>>> index ecf63d2b0582..6909fe059a2c 100644
>>>>>>>>>> --- a/include/linux/mm.h
>>>>>>>>>> +++ b/include/linux/mm.h
>>>>>>>>>> @@ -3818,6 +3818,7 @@ void *sparse_buffer_alloc(unsigned long size);
>>>>>>>>>> struct page * __populate_section_memmap(unsigned long pfn,
>>>>>>>>>> unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
>>>>>>>>>> struct dev_pagemap *pgmap);
>>>>>>>>>> +void kernel_pte_init(void *addr);
>>>>>>>>>> void pmd_init(void *addr);
>>>>>>>>>> void pud_init(void *addr);
>>>>>>>>>> pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>>>>>>>>>> diff --git a/mm/kasan/init.c b/mm/kasan/init.c
>>>>>>>>>> index 89895f38f722..ac607c306292 100644
>>>>>>>>>> --- a/mm/kasan/init.c
>>>>>>>>>> +++ b/mm/kasan/init.c
>>>>>>>>>> @@ -106,6 +106,10 @@ static void __ref zero_pte_populate(pmd_t *pmd, unsigned long addr,
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> +void __weak __meminit kernel_pte_init(void *addr)
>>>>>>>>>> +{
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
>>>>>>>>>> unsigned long end)
>>>>>>>>>> {
>>>>>>>>>> @@ -126,8 +130,10 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
>>>>>>>>>>
>>>>>>>>>> if (slab_is_available())
>>>>>>>>>> p = pte_alloc_one_kernel(&init_mm);
>>>>>>>>>> - else
>>>>>>>>>> + else {
>>>>>>>>>> p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
>>>>>>>>>> + kernel_pte_init(p);
>>>>>>>>>> + }
>>>>>>>>>> if (!p)
>>>>>>>>>> return -ENOMEM;
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>>>>>>>>>> index edcc7a6b0f6f..c0388b2e959d 100644
>>>>>>>>>> --- a/mm/sparse-vmemmap.c
>>>>>>>>>> +++ b/mm/sparse-vmemmap.c
>>>>>>>>>> @@ -184,6 +184,10 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>>>>>>>>>> return p;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> +void __weak __meminit kernel_pte_init(void *addr)
>>>>>>>>>> +{
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
>>>>>>>>>> {
>>>>>>>>>> pmd_t *pmd = pmd_offset(pud, addr);
>>>>>>>>>> @@ -191,6 +195,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
>>>>>>>>>> void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>>>>>>>>>> if (!p)
>>>>>>>>>> return NULL;
>>>>>>>>>> + kernel_pte_init(p);
>>>>>>>>>> pmd_populate_kernel(&init_mm, pmd, p);
>>>>>>>>>> }
>>>>>>>>>> return pmd;
>>>>>>>>>> --
>>>>>>>>>> 2.39.3
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
Powered by blists - more mailing lists