linux-kernel - Re: [PATCH 3/3] resource: Introduce resource cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9175AC33-BB3A-4D1F-B7FC-D0B1A8F971B6@vmware.com>
Date:   Wed, 19 Jun 2019 20:35:14 +0000
From:   Nadav Amit <namit@...are.com>
To:     Bjorn Helgaas <bhelgaas@...gle.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>, Borislav Petkov <bp@...e.de>,
        Toshi Kani <toshi.kani@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

> On Jun 19, 2019, at 6:00 AM, Bjorn Helgaas <bhelgaas@...gle.com> wrote:
> 
> On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <namit@...are.com> wrote:
>>> On Jun 17, 2019, at 10:33 PM, Nadav Amit <namit@...are.com> wrote:
>>> 
>>>> On Jun 17, 2019, at 9:57 PM, Andrew Morton <akpm@...ux-foundation.org> wrote:
>>>> 
>>>> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@...are.com> wrote:
>>>> 
>>>>> For efficient search of resources, as needed to determine the memory
>>>>> type for dax page-faults, introduce a cache of the most recently used
>>>>> top-level resource. Caching the top-level should be safe as ranges in
>>>>> that level do not overlap (unlike those of lower levels).
>>>>> 
>>>>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
>>>>> is added, removed or changed, invalidate all the resources. The
>>>>> invalidation takes place when the resource_lock is taken for write,
>>>>> preventing possible races.
>>>>> 
>>>>> This patch provides relatively small performance improvements over the
>>>>> previous patch (~0.5% on sysbench), but can benefit systems with many
>>>>> resources.
>>>> 
>>>>> --- a/kernel/resource.c
>>>>> +++ b/kernel/resource.c
>>>>> @@ -53,6 +53,12 @@ struct resource_constraint {
>>>>> 
>>>>> static DEFINE_RWLOCK(resource_lock);
>>>>> 
>>>>> +/*
>>>>> + * Cache of the top-level resource that was most recently use by
>>>>> + * find_next_iomem_res().
>>>>> + */
>>>>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
>>>> 
>>>> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
>>>> bit odd - the latency getting at that rwlock will swamp the benefit of
>>>> isolating the CPUs from each other when accessing resource_cache.
>>>> 
>>>> On the other hand, if we have multiple CPUs running
>>>> find_next_iomem_res() concurrently then yes, I see the benefit.  Has
>>>> the benefit of using a per-cpu cache (rather than a kernel-wide one)
>>>> been quantified?
>>> 
>>> No. I am not sure how easy it would be to measure it. On the other hander
>>> the lock is not supposed to be contended (at most cases). At the time I saw
>>> numbers that showed that stores to “exclusive" cache lines can be as
>>> expensive as atomic operations [1]. I am not sure how up to date these
>>> numbers are though. In the benchmark I ran, multiple CPUs ran
>>> find_next_iomem_res() concurrently.
>>> 
>>> [1] https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsigops.org%2Fs%2Fconferences%2Fsosp%2F2013%2Fpapers%2Fp33-david.pdf&amp;data=02%7C01%7Cnamit%40vmware.com%7Ca2706c5ab2c544283f3b08d6f4b6152b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636965460234022371&amp;sdata=cD7Nhs4jcJGMD7Lav6D%2BC6E5Sei0DiWhKXL7vz2cVHA%3D&amp;reserved=0
>> 
>> Just to clarify - the main motivation behind the per-cpu variable is not
>> about contention, but about the fact the different processes/threads that
>> run concurrently might use different resources.
> 
> IIUC, the underlying problem is that dax relies heavily on ioremap(),
> and ioremap() on x86 takes too long because it relies on
> find_next_iomem_res() via the __ioremap_caller() ->
> __ioremap_check_mem() -> walk_mem_res() path.

I don’t know much about this path and whether it is painful. The path I was
regarding is during page-fault handling:

   - handle_mm_fault
      - __handle_mm_fault
         - do_wp_page
            - ext4_dax_fault
               - ext4_dax_huge_fault
                  - dax_iomap_fault
                     - dax_iomap_pte_fault
                        - vmf_insert_mixed_mkwrite
                           - __vm_insert_mixed
                              - track_pfn_insert
                                 - lookup_memtype
                                    - pat_pagerange_is_ram

But indeed track_pfn_insert() in x86 specific. I guess the differences are
due to the page-table controlling the cachability in x86 (PAT), but I don’t
know much about other architectures and whether they have similar
cachability controls in the page-tables.