linux-kernel - Re: [PATCH v2] RISC-V: Implement ASID allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4f290078-54bd-62e2-6356-d700e53e6315@garyguo.net>
Date:   Thu, 28 Mar 2019 14:30:28 +0000
From:   Gary Guo <gary@...yguo.net>
To:     Anup Patel <anup@...infault.org>
CC:     Anup Patel <Anup.Patel@....com>,
        Palmer Dabbelt <palmer@...ive.com>,
        Albert Ou <aou@...s.berkeley.edu>,
        Atish Patra <Atish.Patra@....com>,
        Paul Walmsley <paul.walmsley@...ive.com>,
        Christoph Hellwig <hch@...radead.org>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        "linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] RISC-V: Implement ASID allocator



On 28/03/2019 14:09, Anup Patel wrote:
> On Thu, Mar 28, 2019 at 7:07 PM Gary Guo <gary@...yguo.net> wrote:
>>
>> Hi Anup,
>>
>> The code still does not use ASID in TLB flush routines. Without this
>> added the code does not boot on systems with true ASID support.
> 
> Can you elaborate why flush by ASID is need and flush_tlb_all() will
> not work?
 >
flush_tlb_all() will work, but not flush_tlb_mm, flush_tlb_page, 
flush_tlb_range. When we want to flush something related to a MM we need 
to get its ASID and SFENCE with that ASID.
>>
>> We also need to consider the case of CONTEXTID overflow on 32-bit
>> systems. 32-bit CONTEXTID may overflow in a month time.
> 
> On 32bit systems, upper 24bits of CONTEXTID will be VERSION and
> lower 8bits will be HW ASID.
> 
> Can you elaborate how did you reach to conclusion that CONTEXID
> will overflow in a month time?
> 
Assume a case where we have 256 processes to run, and 8 cores, 
2^32/(250Hz)/8 = 24 days.
>>
>> Please all see my inline comments.
>>
>> Best,
>> Gary
>>
>> On 28/03/2019 06:32, Anup Patel wrote:
>>> Currently, we do local TLB flush on every MM switch. This is very harsh
>>> on performance because we are forcing page table walks after every MM
>>> switch.
>>>
>>> This patch implements ASID allocator for assigning an ASID to every MM
>>> context. The number of ASIDs are limited in HW so we create a logical
>>> entity named CONTEXTID for assigning to MM context. The lower bits of
>>> CONTEXTID are ASID and upper bits are VERSION number. The number of
>>> usable ASID bits supported by HW are detected at boot-time by writing
>>> 1s to ASID bits in SATP CSR. This means last ASID is always reserved
>>> because it is used for initial MM context.
>>>
>>> We allocate new CONTEXTID on first MM switch for a MM context where
>>> the ASID is allocated from an ASID bitmap and VERSION is provide by
>>> an atomic counter. At time of allocating new CONTEXTID, if we run out
>>> of available ASIDs then:
>>> 1. We flush the ASID bitmap
>>> 2. Increment current VERSION atomic counter
>>> 3. Re-allocate ASID from ASID bitmap
>>> 4. Flush TLB on all CPUs
>>> 5. Try CONTEXTID re-assignment on all CPUs
>>>
>>> Using above approach, we have virtually infinite CONTEXTIDs on-top-of
>>> limited number of HW ASIDs. This approach is inspired from ASID allocator
>>> used for Linux ARM/ARM64 but we have adapted it for RISC-V. Overall,
>>> this ASID allocator helps us reduce rate of local TLB flushes on every
>>> CPU thereby increasing performance.
>>>
>>> This patch is tested on QEMU/virt machine and SiFive Unleashed board.
>>> On QEMU/virt machine, we see 10% (approx) performance improvement with
>>> SW emulated TLBs provided by QEMU. Unfortunately, ASID bits of SATP CSR
>>> are not implemented on SiFive Unleashed board so we don't see any change
>>> in performance.
>>>
>>> Signed-off-by: Gary Guo <gary@...yguo.net>
>> Could you add a Co-developed-by line in addition to Signed-off-by as
>> well? Thanks.
> 
> Sure, I will add.
> 
>>> Signed-off-by: Anup Patel <anup.patel@....com>
>>> ---
>>> Changes since v1:
>>> - We adapt good aspects from Gary Guo's ASID allocator implementation
>>>     and provide due credit to him by adding his SoB.
>>> - Track ASIDs active during context flush and mark them as reserved
>>> - Set ASID bits to all 1s to simplify number of ASID bit detection
>>> - Use atomic_long_t instead of atomic64_t for being 32bit friendly
>>> - Use unsigned long instead of u64 for being 32bit friendly
>>> - Use flush_tlb_all() instead of lazy local_tlb_flush_all() at time
>>>     of context flush
>>>
>>> This patch is based on Linux-5.1-rc2 and TLB flush cleanup patches v4
>>> from Gary Guo. It can be also found in riscv_asid_allocator_v2 branch
>>> of https://github.com/avpatel/linux.git
>>> ---
>>>    arch/riscv/include/asm/csr.h         |   6 +
>>>    arch/riscv/include/asm/mmu.h         |   1 +
>>>    arch/riscv/include/asm/mmu_context.h |   1 +
>>>    arch/riscv/kernel/head.S             |   2 +
>>>    arch/riscv/mm/context.c              | 249 +++++++++++++++++++++++++--
>>>    5 files changed, 247 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
>>> index 28a0d1cb374c..ce18ab8f53ed 100644
>>> --- a/arch/riscv/include/asm/csr.h
>>> +++ b/arch/riscv/include/asm/csr.h
>>> @@ -45,10 +45,16 @@
>>>    #define SATP_PPN     _AC(0x003FFFFF, UL)
>>>    #define SATP_MODE_32 _AC(0x80000000, UL)
>>>    #define SATP_MODE    SATP_MODE_32
>>> +#define SATP_ASID_BITS       9
>>> +#define SATP_ASID_SHIFT      22
>>> +#define SATP_ASID_MASK       _AC(0x1FF, UL)
>>>    #else
>>>    #define SATP_PPN     _AC(0x00000FFFFFFFFFFF, UL)
>>>    #define SATP_MODE_39 _AC(0x8000000000000000, UL)
>>>    #define SATP_MODE    SATP_MODE_39
>>> +#define SATP_ASID_BITS       16
>>> +#define SATP_ASID_SHIFT      44
>>> +#define SATP_ASID_MASK       _AC(0xFFFF, UL)
>>>    #endif
>>>
>>>    /* Interrupt Enable and Interrupt Pending flags */
>>> diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
>>> index 5df2dccdba12..42a9ca0fe1fb 100644
>>> --- a/arch/riscv/include/asm/mmu.h
>>> +++ b/arch/riscv/include/asm/mmu.h
>>> @@ -18,6 +18,7 @@
>>>    #ifndef __ASSEMBLY__
>>>
>>>    typedef struct {
>>> +     atomic_long_t id;
>>>        void *vdso;
>>>    #ifdef CONFIG_SMP
>>>        /* A local icache flush is needed before user execution can resume. */
>>> diff --git a/arch/riscv/include/asm/mmu_context.h b/arch/riscv/include/asm/mmu_context.h
>>> index bf4f097a9051..ba6ab35c18dc 100644
>>> --- a/arch/riscv/include/asm/mmu_context.h
>>> +++ b/arch/riscv/include/asm/mmu_context.h
>>> @@ -30,6 +30,7 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
>>>    static inline int init_new_context(struct task_struct *task,
>>>        struct mm_struct *mm)
>>>    {
>>> +     atomic_long_set(&(mm)->context.id, 0);
>> Parenthesis around mm isn't necessary
> 
> Okay, I will update.
> 
>>>        return 0;
>>>    }
>>>
>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>> index fe884cd69abd..c3f9adc0d054 100644
>>> --- a/arch/riscv/kernel/head.S
>>> +++ b/arch/riscv/kernel/head.S
>>> @@ -95,6 +95,8 @@ relocate:
>>>        la a2, swapper_pg_dir
>>>        srl a2, a2, PAGE_SHIFT
>>>        li a1, SATP_MODE
>>> +     li a0, (SATP_ASID_MASK << SATP_ASID_SHIFT)
>>> +     or a1, a1, a0
>>>        or a2, a2, a1
>>>
>>>        /*
>>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
>>> index 0f787bcd3a7a..1205d33d1b1b 100644
>>> --- a/arch/riscv/mm/context.c
>>> +++ b/arch/riscv/mm/context.c
>>> @@ -2,13 +2,209 @@
>>>    /*
>>>     * Copyright (C) 2012 Regents of the University of California
>>>     * Copyright (C) 2017 SiFive
>>> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
>>>     */
>>>
>>> +#include <linux/bitops.h>
>>>    #include <linux/mm.h>
>>> +#include <linux/slab.h>
>>>
>>>    #include <asm/tlbflush.h>
>>>    #include <asm/cacheflush.h>
>>>
>>> +static bool use_asid_allocator;
>>> +static unsigned long asid_bits;
>>> +static unsigned long num_asids;
>>> +static unsigned long asid_mask;
>>> +static unsigned long first_version;
>>> +
>>> +static atomic_long_t current_version;
>>> +
>>> +static DEFINE_RAW_SPINLOCK(context_lock);
>>> +static unsigned long *context_asid_map;
>>> +
>>> +static DEFINE_PER_CPU(atomic_long_t, active_context);
>>> +static DEFINE_PER_CPU(unsigned long, reserved_context);
>>> +
>>> +static bool check_update_reserved_context(unsigned long cntx,
>>> +                                       unsigned long newcntx)
>>> +{
>>> +     int cpu;
>>> +     bool hit = false;
>>> +
>>> +     /*
>>> +      * Iterate over the set of reserved CONTEXT looking for a match.
>>> +      * If we find one, then we can update our mm to use new CONTEXT
>>> +      * (i.e. the same CONTEXT in the current_version) but we can't
>>> +      * exit the loop early, since we need to ensure that all copies
>>> +      * of the old CONTEXT are updated to reflect the mm. Failure to do
>>> +      * so could result in us missing the reserved CONTEXT in a future
>>> +      * version.
>>> +      */
>>> +     for_each_possible_cpu(cpu) {
>>> +             if (per_cpu(reserved_context, cpu) == cntx) {
>>> +                     hit = true;
>>> +                     per_cpu(reserved_context, cpu) = newcntx;
>>> +             }
>>> +     }
>>> +
>>> +     return hit;
>>> +}
>>> +
>>> +/* Note: must be called with context_lock held */
>>> +static void __flush_context(void)
>>> +{
>>> +     int i;
>>> +     unsigned long cntx;
>>> +
>>> +     /* Update the list of reserved ASIDs and the ASID bitmap. */
>>> +     bitmap_clear(context_asid_map, 0, num_asids);
>>> +
>>> +     /* Mark already acitve ASIDs as used */
>>> +     for_each_possible_cpu(i) {
>>> +             cntx = atomic_long_xchg_relaxed(&per_cpu(active_context, i), 0);
>>> +             /*
>>> +              * If this CPU has already been through a rollover, but
>>> +              * hasn't run another task in the meantime, we must preserve
>>> +              * its reserved CONTEXT, as this is the only trace we have of
>>> +              * the process it is still running.
>>> +              */
>>> +             if (cntx == 0)
>>> +                     cntx = per_cpu(reserved_context, i);
>>> +
>>> +             __set_bit(cntx & asid_mask, context_asid_map);
>>> +             per_cpu(reserved_context, i) = cntx;
>>> +     }
>>> +
>>> +     /* Mark last ASID as used because it is used at boot-time */
>>> +     __set_bit(asid_mask, context_asid_map);
>> Looks unnecessary as we always start find_next_zero_bit from idx 1.
> 
> This is to ensure that we never use last ASID >
Uh, sorry. I misread. But we surely can use the last ASID after the 
first rollover?
>>> +} >>> +
>>> +/* Note: must be called with context_lock held */
>>> +static unsigned long __new_context(struct mm_struct *mm,
>>> +                                bool *need_tlb_flush)
>>> +{
>>> +     static u32 cur_idx = 1;
>>> +     unsigned long cntx = atomic_long_read(&mm->context.id);
>>> +     unsigned long asid, ver = atomic_long_read(&current_version);
>>> +
>>> +     if (cntx != 0) {
>>> +             unsigned long newcntx = ver | (cntx & ~asid_mask);
>> Shouldn't this be cntx & asid_mask ?
> 
> Ahh, typo. Thanks for catching.
> 
>>> +
>>> +             /*
>>> +              * If our current CONTEXT was active during a rollover, we
>>> +              * can continue to use it and this was just a false alarm.
>>> +              */
>>> +             if (check_update_reserved_context(cntx, newcntx))
>>> +                     return newcntx;
>>> +
>>> +             /*
>>> +              * We had a valid CONTEXT in a previous life, so try to
>>> +              * re-use it if possible.
>>> +              */
>>> +             if (!__test_and_set_bit(cntx & asid_mask, context_asid_map))
>>> +                     return newcntx;
>>> +     }
>>> +
>>> +     /*
>>> +      * Allocate a free ASID. If we can't find one then increment
>>> +      * current_version and flush all ASIDs.
>>> +      */
>>> +     asid = find_next_zero_bit(context_asid_map, num_asids, cur_idx);
>>> +     if (asid != num_asids)
>>> +             goto set_asid;
>>> +
>>> +     /* We're out of ASIDs, so increment current_version */
>>> +     ver = atomic_long_add_return_relaxed(first_version, &current_version);
>>> +
>>> +     /* Flush everything  */
>>> +     __flush_context();
>>> +     *need_tlb_flush = true;
>>> +
>>> +     /* We have more ASIDs than CPUs, so this will always succeed */
>>> +     asid = find_next_zero_bit(context_asid_map, num_asids, 1);
>>> +
>>> +set_asid:
>>> +     __set_bit(asid, context_asid_map);
>>> +     cur_idx = asid;
>>> +     return asid | ver;
>>> +}
>>> +
>>> +static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
>>> +{
>>> +     unsigned long flags;
>>> +     bool need_tlb_flush = false;
>>> +     unsigned long cntx, old_active_cntx;
>>> +
>>> +     cntx = atomic_long_read(&mm->context.id);
>>> +
>>> +     /*
>>> +      * If our active_context is non-zero and the context matches the
>>> +      * current_version, then we update the active_context entry with a
>>> +      * relaxed cmpxchg.
>>> +      *
>>> +      * Following is how we handle racing with a concurrent rollover:
>>> +      *
>>> +      * - We get a zero back from the cmpxchg and end up waiting on the
>>> +      *   lock. Taking the lock synchronises with the rollover and so
>>> +      *   we are forced to see the updated verion.
>>> +      *
>>> +      * - We get a valid context back from the cmpxchg then we continue
>>> +      *   using old ASID because __flush_context() would have marked ASID
>>> +      *   of active_context as used and next context switch we will allocate
>>> +      *   new context.
>>> +      */
>>> +     old_active_cntx = atomic_long_read(&per_cpu(active_context, cpu));
>>> +     if (old_active_cntx &&
>>> +         !((cntx ^ atomic_long_read(&current_version)) >> asid_bits) &&
>> This looks hard to read. Probably
>> (cntx &~ asid_mask) == atomic_long_read(&current_version)
>> is clearer.
> 
> No issues, I am fine with either way. I will update.
> 
>>> +         atomic_long_cmpxchg_relaxed(&per_cpu(active_context, cpu),
>>> +                                     old_active_cntx, cntx))
>>> +             goto switch_mm_fast;
>>> +
>>> +     raw_spin_lock_irqsave(&context_lock, flags);
>> Any reason that we prefer raw_ here?
>>> +
>>> +     /* Check that our ASID belongs to the current_version. */
>>> +     cntx = atomic_long_read(&mm->context.id);
>>> +     if ((cntx ^ atomic_long_read(&current_version)) >> asid_bits) {
>> Same as above, probably
>> (cntx &~ asid_mask) != atomic_long_read(&current_version)
>> makes more sense.
>>> +             cntx = __new_context(mm, &need_tlb_flush);
>>> +             atomic_long_set(&mm->context.id, cntx);
>>> +     }
>>> +
>>> +     atomic_long_set(&per_cpu(active_context, cpu), cntx);
>>> +
>>> +     raw_spin_unlock_irqrestore(&context_lock, flags);
>>> +
>>> +switch_mm_fast:
>>> +     /*
>>> +      * Use the old spbtr name instead of using the current satp
>>> +      * name to support binutils 2.29 which doesn't know about the
>>> +      * privileged ISA 1.10 yet.
>>> +      */
>>> +     csr_write(sptbr, virt_to_pfn(mm->pgd) |
>>> +               ((cntx & asid_mask) << SATP_ASID_SHIFT) |
>>> +               SATP_MODE);
>>> +
>>> +     if (need_tlb_flush)
>>> +             flush_tlb_all();
>> I think your intention here is to avoid calling flush_tlb_all when IRQs
>> are off in the critical region. However, switch_mm will be called from
>> scheduler as well which also turn irqs off, so this still cause issue. I
>> think a better way is to force flush_tlb_all to use SBI when IRQs are
>> off. What do you think?
> 
> We are still waiting for OpenSBI to provide complete implementation.
> 
> I agree that we should prefer SBI based remote TLB flush all here. Let's
> wait for more comments.
> 
I prefer SBI as well. Can you reopen OpenSBI issue #87 to track the 
progress until we can proper handle race conditions in OpenSBI? Once 
that's completed I'll drop the IPI patch and we can safely do 
flush_tlb_all within __flush_context.
>>> +}
>>> +
>>> +static void set_mm_noasid(struct mm_struct *mm)
>>> +{
>>> +     /*
>>> +      * Use the old spbtr name instead of using the current satp
>>> +      * name to support binutils 2.29 which doesn't know about the
>>> +      * privileged ISA 1.10 yet.
>>> +      */
>>> +     csr_write(sptbr, virt_to_pfn(mm->pgd) | SATP_MODE);
>>> +
>>> +     /*
>>> +      * sfence.vma after SATP write. We call it on MM context instead of
>>> +      * calling local_flush_tlb_all to prevent global mappings from being
>>> +      * affected.
>>> +      */
>>> +     local_flush_tlb_mm(mm);
>>> +}
>>> +
>>>    /*
>>>     * When necessary, performs a deferred icache flush for the given MM context,
>>>     * on the local CPU.  RISC-V has no direct mechanism for instruction cache
>>> @@ -58,20 +254,49 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>>        cpumask_clear_cpu(cpu, mm_cpumask(prev));
>>>        cpumask_set_cpu(cpu, mm_cpumask(next));
>>>
>>> -     /*
>>> -      * Use the old spbtr name instead of using the current satp
>>> -      * name to support binutils 2.29 which doesn't know about the
>>> -      * privileged ISA 1.10 yet.
>>> -      */
>>> -     csr_write(sptbr, virt_to_pfn(next->pgd) | SATP_MODE);
>>> +     if (use_asid_allocator)
>>> +             set_mm_asid(next, cpu);
>>> +     else
>>> +             set_mm_noasid(next);
>>> +
>>> +     flush_icache_deferred(next);
>>> +}
>>> +
>>> +static int asids_init(void)
>>> +{
>>> +     /* Figure-out number of ASID bits in HW */
>>> +     asid_bits = (csr_read(sptbr) >> SATP_ASID_SHIFT) & SATP_ASID_MASK;
>>> +     asid_bits = fls_long(asid_bits);
>>> +
>>> +     /* Pre-compute ASID details */
>>> +     num_asids = 1 << asid_bits;
>>> +     asid_mask = num_asids - 1;
>>> +     first_version = num_asids;
>> Is there any reason we want to have two set-once variables with same value?
> 
> Yap, "first_version" looks redundant. I will update.
> 
>>>
>>>        /*
>>> -      * sfence.vma after SATP write. We call it on MM context instead of
>>> -      * calling local_flush_tlb_all to prevent global mappings from being
>>> -      * affected.
>>> +      * Use ASID allocator only if number of HW ASIDs are
>>> +      * at-least twice more than CPUs
>>>         */
>>> -     local_flush_tlb_mm(next);
>>> +     use_asid_allocator =
>>> +             (num_asids <= (2 * num_possible_cpus())) ? false : true;
>>>
>>> -     flush_icache_deferred(next);
>>> -}
>>> +     /* Setup ASID allocator if available */
>>> +     if (use_asid_allocator) {
>>> +             atomic_long_set(&current_version, first_version);
>>>
>>> +             context_asid_map = kcalloc(BITS_TO_LONGS(num_asids),
>>> +                                sizeof(*context_asid_map), GFP_KERNEL);
>>> +             if (!context_asid_map)
>>> +                     panic("Failed to allocate bitmap for %lu ASIDs\n",
>>> +                           num_asids);
>>> +
>>> +             __set_bit(asid_mask, context_asid_map);
>>> +
>>> +             pr_info("ASID allocator using %lu entries\n", num_asids);
>>> +     } else {
>> If we decide not to use ASID allocator, we will need to set ASID field
>> to zero on *all harts* before we do our first switch_mm. Otherwise we
>> will end up a hart running non-zero ASID and another running zero ASID
>> with different page table.
> 
> Yes, I saw that in your implementation but for better readability and
> debugability. I have preserved asid_bits that we computed and added
> separate use_asid_allocator flag.
I didn't say I'm against having use_asid_allocator.
> 
> In future, I plan to show asid_bits in /proc/cpuinfo as-well.
> 
>>> +             pr_info("ASID allocator disabled\n");
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +early_initcall(asids_init);
>>>
> 
> Regards,
> Anup
>