lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67BFA611-F69E-4AE4-A03F-2EF546DC291A@vmware.com>
Date:   Fri, 31 May 2019 22:07:46 +0000
From:   Nadav Amit <namit@...are.com>
To:     Andy Lutomirski <luto@...capital.net>
CC:     Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...el.com>,
        Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>
Subject: Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages
 for flushing

> On May 31, 2019, at 2:47 PM, Andy Lutomirski <luto@...capital.net> wrote:
> 
> 
> On May 31, 2019, at 2:33 PM, Nadav Amit <namit@...are.com> wrote:
> 
>>> On May 31, 2019, at 2:14 PM, Andy Lutomirski <luto@...nel.org> wrote:
>>> 
>>>> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@...are.com> wrote:
>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>> the following conditions are met:
>>>> 
>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>> cause machine-checks.
>>>> 
>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>> NMI handlers and kprobes would avoid accessing userspace.
>>> 
>>> I think I need to ask the big picture question.  When someone calls
>>> flush_tlb_mm_range() (or the other entry points), if no page tables
>>> were freed, they want the guarantee that future accesses (initiated
>>> observably after the flush returns) will not use paging entries that
>>> were replaced by stores ordered before flush_tlb_mm_range().  We also
>>> need the guarantee that any effects from any memory access using the
>>> old paging entries will become globally visible before
>>> flush_tlb_mm_range().
>>> 
>>> I'm wondering if receipt of an IPI is enough to guarantee any of this.
>>> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
>>> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
>>> dirty bit?  An interrupt entry today is fully serializing by the time
>>> it finishes, but interrupt entries are epicly slow, and I don't know
>>> if the APIC waits long enough.  Heck, what if IRQs are off on the
>>> remote CPU?  There are a handful of places where we touch user memory
>>> with IRQs off, and it's (sadly) possible for user code to turn off
>>> IRQs with iopl().
>>> 
>>> I *think* that Intel has stated recently that SMT siblings are
>>> guaranteed to stop speculating when you write to the APIC ICR to poke
>>> them, but SMT is very special.
>>> 
>>> My general conclusion is that I think the code needs to document what
>>> is guaranteed and why.
>> 
>> I think I might have managed to confuse you with a bug I made (last minute
>> bug when I was doing some cleanup). This bug does not affect the performance
>> much, but it might led you to think that I use the APIC sending as
>> synchronization.
>> 
>> The idea is not for us to rely on write to ICR as something serializing. The
>> flow should be as follows:
>> 
>> 
>>   CPU0                    CPU1
>> 
>> flush_tlb_mm_range()
>> __smp_call_function_many()
>> [ prepare call_single_data (csd) ]
>> [ lock csd ] 
>> [ send IPI ]
>>   (*)
>> [ wait for csd to be unlocked ]
>>                   [ interrupt ]
>>                   [ copy csd info to stack ]
>>                   [ csd unlock ]
>> [ find csd is unlocked ]
>> [ continue (**) ]
>>                   [ flush TLB ]
>> 
>> 
>> At (**) the pages might be recycled, written-back to disk, etc. Note that
>> during (*), CPU0 might do some local TLB flushes, making it very likely that
>> CSD will be unlocked by the time it gets there.
>> 
>> As you can see, I don’t rely on any special micro-architectural behavior.
>> The synchronization is done purely in software.
>> 
>> Does it make more sense now?
> 
> Yes.  Have you benchmarked this?

Partially. Numbers are indeed worse. Here are preliminary results, comparing
to v1 (concurrent):

	n_threads	before		concurrent	+async
	---------	------		---------- 	------
	1		661		663		663
	2		1436		1225 (-14%)	1115 (-22%)
	4		1571		1421 (-10%)	1289 (-18%)

Note that the benefit of “async" would be greater if the initiator does not
flush the TLB at all. This might happen in the case of kswapd, for example.
Let me try some micro-optimizations first, run more benchmarks and get back
to you.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ