[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <A87157D6-4A0A-44F8-8AC8-4411F5EDC57A@fb.com>
Date: Sun, 3 Jun 2018 00:51:46 +0000
From: Song Liu <songliubraving@...com>
To: Andy Lutomirski <luto@...nel.org>
CC: "riel@...riel.com" <riel@...riel.com>,
Mike Galbraith <efault@....de>,
LKML <linux-kernel@...r.kernel.org>,
Kernel Team <Kernel-team@...com>,
"Ingo Molnar" <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
> On Jun 2, 2018, at 1:14 PM, Andy Lutomirski <luto@...nel.org> wrote:
>
> On Fri, Jun 1, 2018 at 10:04 PM Rik van Riel <riel@...riel.com> wrote:
>>
>> On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
>>> On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@...riel.com> wrote:
>>>>
>>>> On Fri, 1 Jun 2018 14:21:58 -0700
>>>> Andy Lutomirski <luto@...nel.org> wrote:
>>>>
>>>>> Hmm. I wonder if there's a more clever data structure than a
>>>>> bitmap
>>>>> that we could be using here. Each CPU only ever needs to be in
>>>>> one
>>>>> mm's cpumask, and each cpu only ever changes its own state in the
>>>>> bitmask. And writes are much less common than reads for most
>>>>> workloads.
>>>>
>>>> It would be easy enough to add an mm_struct pointer to the
>>>> per-cpu tlbstate struct, and iterate over those.
>>>>
>>>> However, that would be an orthogonal change to optimizing
>>>> lazy TLB mode.
>>>>
>>>> Does the (untested) patch below make sense as a potential
>>>> improvement to the lazy TLB heuristic?
>>>>
>>>> ---8<---
>>>> Subject: x86,tlb: workload dependent per CPU lazy TLB switch
>>>>
>>>> Lazy TLB mode is a tradeoff between flushing the TLB and touching
>>>> the mm_cpumask(&init_mm) at context switch time, versus potentially
>>>> incurring a remote TLB flush IPI while in lazy TLB mode.
>>>>
>>>> Whether this pays off is likely to be workload dependent more than
>>>> anything else. However, the current heuristic keys off hardware
>>>> type.
>>>>
>>>> This patch changes the lazy TLB mode heuristic to a dynamic, per-
>>>> CPU
>>>> decision, dependent on whether we recently received a remote TLB
>>>> shootdown while in lazy TLB mode.
>>>>
>>>> This is a very simple heuristic. When a CPU receives a remote TLB
>>>> shootdown IPI while in lazy TLB mode, a counter in the same cache
>>>> line is set to 16. Every time we skip lazy TLB mode, the counter
>>>> is decremented.
>>>>
>>>> While the counter is zero (no recent TLB flush IPIs), allow lazy
>>>> TLB mode.
>>>
>>> Hmm, cute. That's not a bad idea at all. It would be nice to get
>>> some kind of real benchmark on both PCID and !PCID. If nothing else,
>>> I would expect the threshold (16 in your patch) to want to be lower
>>> on
>>> PCID systems.
>>
>> That depends on how well we manage to get rid of
>> the cpumask manipulation overhead. On the PCID
>> system we first found this issue, the atomic
>> accesses to the mm_cpumask took about 4x as much
>> CPU time as the TLB invalidation itself.
>>
>> That kinda limits how much the cost of cheaper
>> TLB flushes actually help :)
>>
>> I agree this code should get some testing.
>>
>
> Just to check: in the workload where you're seeing this problem, are
> you using an mm with many threads? I would imagine that, if you only
> have one or two threads, the bit operations aren't so bad.
Yes, we are running netperf/netserver with 300 threads. We don't see
this much overhead in with real workload.
Here are some test results. The tests are on 2-socket system with E5-2678 v3,
so 48 logical cores with PCID. I tested 3 kernel (and tunes
flushed_while_lazy):
Baseline: 0512e01345 in upstream
Baseline + lazy-tlb: 0512e01345 + "x86,tlb: workload dependent per CPU lazy TLB switch"
Baseline + both patches: 0512e01345 + "x86,tlb: workload dependent per CPU lazy TLB switch" + "x86,switch_mm: skip atomic operations for init_mm"
The tests are wrong with the following options:
./super_netperf 300 -P 0 -t TCP_RR -p 8888 -H <target_host> -l 60 -- -r 100,100 -o -s 1M,1M -S 1M,1M
There are the results:
flushed_while_lazy Throughput %-cpu-on-switch_mm_irqs_off()
Baseline N/A 2.02756e+06 2.82% (1.54% in ctx of netserver + 1.28% in ctx of swapper)
Baseline + lazy-tlb 16 1.92706e+06 1.19% (only in ctx of swapper, same for all cases below)
Baseline + lazy-tlb 4 2.03863e+06 1.12% Good option 1
Baseline + lazy-tlb 2 1.93847e+06 1.22%
Baseline + both patches 64 1.92219e+06 1.09%
Baseline + both patches 16 1.93597e+06 1.15%
Baseline + both patches 8 1.92175e+06 1.11%
Baseline + both patches 4 2.03465e+06 1.09% Good option 2
Baseline + both patches 2 2.03603e+06 1.10% Good option 3
Baseline + both patches 1 1.9241e+06 1.06%
I highlighted 3 good options above. They are about the same for this
test case.
Thanks,
Song
Powered by blists - more mailing lists