[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6698c486-56e7-1fc2-567c-69cc446a58be@netapp.com>
Date: Tue, 15 May 2018 16:24:05 +0300
From: Boaz Harrosh <boazh@...app.com>
To: Mark Rutland <mark.rutland@....com>,
Peter Zijlstra <peterz@...radead.org>
CC: Matthew Wilcox <willy@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Jeff Moyer <jmoyer@...hat.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, <x86@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"Rik van Riel" <riel@...hat.com>, Jan Kara <jack@...e.cz>,
Matthew Wilcox <mawilcox@...rosoft.com>,
Amit Golander <Amit.Golander@...app.com>
Subject: Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU
On 15/05/18 14:54, Boaz Harrosh wrote:
> On 15/05/18 03:44, Matthew Wilcox wrote:
>> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh <boazh@...app.com> wrote:
>>>> In this project we utilize a per-core server thread so everything
>>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>>> are scheduled for the unmap, though in our case we know that we
>>>> have only used a single core. The regular zap_ptes adds a very big
>>>> latency on every operation and mostly kills the concurrency of the
>>>> over all system. Because it imposes a serialization between all cores
>>>
>>> I'd have thought that in this situation, only the local CPU's bit is
>>> set in the vma's mm_cpumask() and the remote invalidations are not
>>> performed. Is that a misunderstanding, or is all that stuff not working
>>> correctly?
>>
>> I think you misunderstand Boaz's architecture. He has one thread per CPU,
>> so every bit will be set in the mm's (not vma's) mm_cpumask.
>>
>
> Hi Andrew, Matthew
>
> Yes I have been trying to investigate and trace this for days.
> Please see the code below:
>
>> #define flush_tlb_range(vma, start, end) \
>> flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
>
> The mm_struct @mm below comes from here vma->vm_mm
>
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index e055d1a..1d398a0 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>> void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> unsigned long end, unsigned long vmflag)
>> {
>> int cpu;
>>
>> struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
>> .mm = mm,
>> };
>>
>> cpu = get_cpu();
>>
>> /* This is also a barrier that synchronizes with switch_mm(). */
>> info.new_tlb_gen = inc_mm_tlb_gen(mm);
>>
>> /* Should we flush just the requested range? */
>> if ((end != TLB_FLUSH_ALL) &&
>> !(vmflag & VM_HUGETLB) &&
>> ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
>> info.start = start;
>> info.end = end;
>> } else {
>> info.start = 0UL;
>> info.end = TLB_FLUSH_ALL;
>> }
>>
>> if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>> VM_WARN_ON(irqs_disabled());
>> local_irq_disable();
>> flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
>> local_irq_enable();
>> }
>>
>> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> + if (!(vmflag & VM_LOCAL_CPU) &&
>> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> flush_tlb_others(mm_cpumask(mm), &info);
>>
>
> I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
> different points. At vma creation (file_operations->mmap()),
> and before the call to insert_pfn (which calls here).
>
> At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
> should have a single bit set just as the affinity of the thread on
> creation of that thread. But then I saw that at %80 of the times some
> other random bits are also set.
>
> Yes Random. Always the thread affinity (single) bit was set but
> then zero one or two more bits were set as well. Never seen more then
> two though, which baffles me a lot.
>
> If it was like Matthew said .i.e the cpumask of the all process
> then I would expect all the bits to be set. Because I have a thread
> on each core. And also I would even expect that all vma->vm_mm
> or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
> But it was not so. it was pointing to some thread unique object but
> still those phantom bits were set all over. (And I am almost sure
> same vma had those bits change over time)
>
> So I would love some mm guy to explain where are those bits collected?
> But I do not think this is a Kernel bug because as Matthew showed.
> that vma above can and is allowed to leak memory addresses to other
> threads / cores in the same process. So it appears that the Kernel
> has some correct logic behind its madness.
>
Hi Mark
So you see %20 of the times the mm_cpumask(vma->vm_mm) is a single
bit set. And a natural call to flush_tlb_range will only invalidate
the local cpu. Are you familiar with this logic?
> Which brings me to another question. How can I find from
> within a thread Say at the file_operations->mmap() call that the thread
> is indeed core-pinned. What mm_cpumask should I inspect?
>
Mark, Peter do you know how I can check the above?
Thanks
Boaz
>> put_cpu();
>> }
>
> Thanks
> Boaz
>
Powered by blists - more mailing lists