lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sun, 3 Jun 2018 00:51:46 +0000
From:   Song Liu <songliubraving@...com>
To:     Andy Lutomirski <luto@...nel.org>
CC:     "riel@...riel.com" <riel@...riel.com>,
        Mike Galbraith <efault@....de>,
        LKML <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        "Ingo Molnar" <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm



> On Jun 2, 2018, at 1:14 PM, Andy Lutomirski <luto@...nel.org> wrote:
> 
> On Fri, Jun 1, 2018 at 10:04 PM Rik van Riel <riel@...riel.com> wrote:
>> 
>> On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
>>> On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@...riel.com> wrote:
>>>> 
>>>> On Fri, 1 Jun 2018 14:21:58 -0700
>>>> Andy Lutomirski <luto@...nel.org> wrote:
>>>> 
>>>>> Hmm.  I wonder if there's a more clever data structure than a
>>>>> bitmap
>>>>> that we could be using here.  Each CPU only ever needs to be in
>>>>> one
>>>>> mm's cpumask, and each cpu only ever changes its own state in the
>>>>> bitmask.  And writes are much less common than reads for most
>>>>> workloads.
>>>> 
>>>> It would be easy enough to add an mm_struct pointer to the
>>>> per-cpu tlbstate struct, and iterate over those.
>>>> 
>>>> However, that would be an orthogonal change to optimizing
>>>> lazy TLB mode.
>>>> 
>>>> Does the (untested) patch below make sense as a potential
>>>> improvement to the lazy TLB heuristic?
>>>> 
>>>> ---8<---
>>>> Subject: x86,tlb: workload dependent per CPU lazy TLB switch
>>>> 
>>>> Lazy TLB mode is a tradeoff between flushing the TLB and touching
>>>> the mm_cpumask(&init_mm) at context switch time, versus potentially
>>>> incurring a remote TLB flush IPI while in lazy TLB mode.
>>>> 
>>>> Whether this pays off is likely to be workload dependent more than
>>>> anything else. However, the current heuristic keys off hardware
>>>> type.
>>>> 
>>>> This patch changes the lazy TLB mode heuristic to a dynamic, per-
>>>> CPU
>>>> decision, dependent on whether we recently received a remote TLB
>>>> shootdown while in lazy TLB mode.
>>>> 
>>>> This is a very simple heuristic. When a CPU receives a remote TLB
>>>> shootdown IPI while in lazy TLB mode, a counter in the same cache
>>>> line is set to 16. Every time we skip lazy TLB mode, the counter
>>>> is decremented.
>>>> 
>>>> While the counter is zero (no recent TLB flush IPIs), allow lazy
>>>> TLB mode.
>>> 
>>> Hmm, cute.  That's not a bad idea at all.  It would be nice to get
>>> some kind of real benchmark on both PCID and !PCID.  If nothing else,
>>> I would expect the threshold (16 in your patch) to want to be lower
>>> on
>>> PCID systems.
>> 
>> That depends on how well we manage to get rid of
>> the cpumask manipulation overhead. On the PCID
>> system we first found this issue, the atomic
>> accesses to the mm_cpumask took about 4x as much
>> CPU time as the TLB invalidation itself.
>> 
>> That kinda limits how much the cost of cheaper
>> TLB flushes actually help :)
>> 
>> I agree this code should get some testing.
>> 
> 
> Just to check: in the workload where you're seeing this problem, are
> you using an mm with many threads?  I would imagine that, if you only
> have one or two threads, the bit operations aren't so bad.

Yes, we are running netperf/netserver with 300 threads. We don't see
this much overhead in with real workload. 

Here are some test results. The tests are on 2-socket system with E5-2678 v3, 
so 48 logical cores with PCID. I tested 3 kernel (and tunes 
flushed_while_lazy):

Baseline: 0512e01345 in upstream
Baseline + lazy-tlb: 0512e01345 + "x86,tlb: workload dependent per CPU lazy TLB switch"   
Baseline + both patches: 0512e01345 + "x86,tlb: workload dependent per CPU lazy TLB switch" + "x86,switch_mm: skip atomic operations for init_mm"    

The tests are wrong with the following options: 

./super_netperf 300 -P 0 -t TCP_RR -p 8888 -H <target_host> -l 60 -- -r 100,100 -o -s 1M,1M -S 1M,1M

There are the results:
                       flushed_while_lazy    Throughput     %-cpu-on-switch_mm_irqs_off() 

Baseline                      N/A             2.02756e+06         2.82%    (1.54% in ctx of netserver + 1.28% in ctx of swapper)
Baseline + lazy-tlb            16             1.92706e+06         1.19%    (only in ctx of swapper, same for all cases below)
Baseline + lazy-tlb            4              2.03863e+06         1.12%    Good option 1
Baseline + lazy-tlb            2              1.93847e+06         1.22%
Baseline + both patches        64             1.92219e+06         1.09%
Baseline + both patches        16             1.93597e+06         1.15%
Baseline + both patches        8              1.92175e+06         1.11%
Baseline + both patches        4              2.03465e+06         1.09%    Good option 2
Baseline + both patches        2              2.03603e+06         1.10%    Good option 3
Baseline + both patches        1              1.9241e+06          1.06%

I highlighted 3 good options above. They are about the same for this 
test case. 

Thanks,
Song


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ