linux-kernel - Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu counters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7dbac943-890b-af16-d6a0-705b3cd609a1@arm.com>
Date:   Tue, 25 Jun 2019 14:20:55 +0100
From:   Raphael Gault <raphael.gault@....com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Mark Rutland <mark.rutland@....com>
Cc:     Arnaldo Carvalho de Melo <arnaldo.melo@...il.com>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will.deacon@....com>, szabolcs.nagy@....com
Subject: Re: [PATCH 3/7] perf: arm64: Use rseq to test userspace access to pmu
 counters

Hi Mathieu, Hi Szabolcs,

On 6/11/19 8:33 PM, Mathieu Desnoyers wrote:
> ----- On Jun 11, 2019, at 6:57 PM, Mark Rutland mark.rutland@....com wrote:
> 
>> Hi Arnaldo,
>>
>> On Tue, Jun 11, 2019 at 11:33:46AM -0300, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Jun 11, 2019 at 01:53:11PM +0100, Raphael Gault escreveu:
>>>> Add an extra test to check userspace access to pmu hardware counters.
>>>> This test doesn't rely on the seqlock as a synchronisation mechanism but
>>>> instead uses the restartable sequences to make sure that the thread is
>>>> not interrupted when reading the index of the counter and the associated
>>>> pmu register.
>>>>
>>>> In addition to reading the pmu counters, this test is run several time
>>>> in order to measure the ratio of failures:
>>>> I ran this test on the Juno development platform, which is big.LITTLE
>>>> with 4 Cortex A53 and 2 Cortex A57. The results vary quite a lot
>>>> (running it with 100 tests is not so long and I did it several times).
>>>> I ran it once with 10000 iterations:
>>>> `runs: 10000, abort: 62.53%, zero: 34.93%, success: 2.54%`
>>>>
>>>> Signed-off-by: Raphael Gault <raphael.gault@....com>
>>>> ---
>>>>   tools/perf/arch/arm64/include/arch-tests.h    |   5 +-
>>>>   tools/perf/arch/arm64/include/rseq-arm64.h    | 220 ++++++++++++++++++
>>>
>>> So, I applied the first patch in this series, but could you please break
>>> this patch into at least two, one introducing the facility
>>> (include/rseq*) and the second adding the test?
>>>
>>> We try to enforce this kind of granularity as down the line we may want
>>> to revert one part while the other already has other uses and thus
>>> wouldn't allow a straight revert.
>>>
>>> Also, can this go to tools/arch/ instead? Is this really perf specific?
>>> Isn't there any arch/arm64/include files for the kernel that we could
>>> mirror and have it checked for drift in tools/perf/check-headers.sh?
>>
>> The rseq bits aren't strictly perf specific, and I think the existing
>> bits under tools/testing/selftests/rseq/ could be factored out to common
>> locations under tools/include/ and tools/arch/*/include/.
> 
> Hi Mark,
> 
> Thanks for CCing me!
> 
> Or into a stand-alone librseq project:
> 
> https://github.com/compudj/librseq (currently a development branch in
> my own github)
> 
> I don't see why this user-space code should sit in the kernel tree.
> It is not tooling-specific.
> 
>>
>>  From a scan, those already duplicate barriers and other helpers which
>> already have definitions under tools/, which seems unfortunate. :/
>>
>> Comments below are for Raphael and Matthieu.
>>
>> [...]
>>
>>>> +static u64 noinline mmap_read_self(void *addr, int cpu)
>>>> +{
>>>> +	struct perf_event_mmap_page *pc = addr;
>>>> +	u32 idx = 0;
>>>> +	u64 count = 0;
>>>> +
>>>> +	asm volatile goto(
>>>> +                     RSEQ_ASM_DEFINE_TABLE(0, 1f, 2f, 3f)
>>>> +		     "nop\n"
>>>> +                     RSEQ_ASM_STORE_RSEQ_CS(1, 0b, rseq_cs)
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>>>> +                     RSEQ_ASM_OP_R_LOAD(pc_idx)
>>>> +                     RSEQ_ASM_OP_R_AND(0xFF)
>>>> +		     RSEQ_ASM_OP_R_STORE(idx)
>>>> +                     RSEQ_ASM_OP_R_SUB(0x1)
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
>>>> +                     "msr pmselr_el0, " RSEQ_ASM_TMP_REG "\n"
>>>> +                     "isb\n"
>>>> +		     RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 3f)
> 
> I really don't understand why the cpu_id needs to be compared 3 times
> here (?!?)
> 
> Explicit comparison of the cpu_id within the rseq critical section
> should be done _once_.
> 
> If the kernel happens to preempt and migrate the thread while in the
> critical section, it's the kernel's job to move user-space execution
> to the abort handler.
> 
>>>> +                     "mrs " RSEQ_ASM_TMP_REG ", pmxevcntr_el0\n"
>>>> +                     RSEQ_ASM_OP_R_FINAL_STORE(cnt, 2)
>>>> +		     "nop\n"
>>>> +                     RSEQ_ASM_DEFINE_ABORT(3, abort)
>>>> +                     :/* No output operands */
>>>> +		     :  [cpu_id] "r" (cpu),
>>>> +			[current_cpu_id] "Qo" (__rseq_abi.cpu_id),
>>>> +			[rseq_cs] "m" (__rseq_abi.rseq_cs),
>>>> +			[cnt] "m" (count),
>>>> +			[pc_idx] "r" (&pc->index),
>>>> +			[idx] "m" (idx)
>>>> +                     :"memory"
>>>> +                     :abort
>>>> +                    );
>>
>> While baroque, this doesn't look as scary as I thought it would!
> 
> That's good to hear :)
> 
>>
>> However, I'm very scared that this is modifying input operands without
>> clobbering them. IIUC this is beacause we're trying to use asm goto,
>> which doesn't permit output operands.
> 
> This is correct. What is wrong with modifying the target of "m" input
> operands in an inline asm that has a "memory" clobber ?
> 
> gcc documentation at https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
> states:
> 
> "An asm goto statement cannot have outputs. This is due to an internal
> restriction of the compiler: control transfer instructions cannot have
> outputs. If the assembler code does modify anything, use the "memory"
> clobber to force the optimizers to flush all register values to memory
> and reload them if necessary after the asm statement."
> 
> If there is a problem with this approach, an alternative would be to
> pass &__rseq_abi.rseq.cs as a "r" input operand, explicitly dereference
> it in the assembly, and use the "memory" clobber to ensure the compiler
> knows that there are read/write references to memory.
> 
>> I'm very dubious to abusing asm goto in this way. Can we instead use a
>> regular asm volatile block, and place the abort handler _within_ the
>> asm? If performance is a concern, we can use .pushsection and
>> .popsection to move that far away...
> 
> Let's dig into what would be needed in order to move the abort into the
> asm block.
> 
> One approach would be to make that asm block return a nonzero value in
> an output register, and put zero in that register in the non-abort case,
> and then have a conditional check in C on that register to check
> whether it needs to branch to the abort. This adds overhead we want
> to avoid.
> 
> Another alternative would be to perform the entire abort handler in
> the same assembly block as the rseq critical section. However, this
> prevents us from going back to C to handle the abort, which is unwanted.
> For instance, in the use-case of perf counters on aarch64, a good
> fallback on abort would be to call the perf system call to read the
> value of the performance counter. However, requiring that the abort be
> implemented within the rseq assembly block would require that we
> re-implement system call invocation in user-space for this, which
> is rather annoying.
> 
>>
>>>> +
>>>> +	if (idx)
>>>> +		count += READ_ONCE(pc->offset);
>>
>> I'm rather scared that from GCC's PoV, idx was initialized to zero, and
>> not modified above (per the asm constraints). I realise that we've used
>> an "m" constraint and clobbered memory, but I could well imagine that
>> GCC can interpret that as needing to place a read-only copy in memory,
>> but still being permitted to use the original value in a register. That
>> would permit the above to be optimized away, since GCC knows no
>> registers were clobbered, and thus idx must still be zero.
> 
> I suspect this is based on a rather conservative interpretation of the
> following statement from https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html:
> 
> "The "memory" clobber tells the compiler that the assembly code performs memory
> reads or writes to items other than those listed in the input and output operands"
> 
> Based on the previous sentence, it's tempting to conclude that the "m" input
> operands content is not clobbered by the "memory" clobber.
> 
> however, it is followed by this:
> 
> "Further, the compiler does not assume that any values read from memory before an
> asm remain unchanged after that asm; it reloads them as needed. Using the "memory"
> clobber effectively forms a read/write memory barrier for the compiler."
> 
> Based on this last sentence, my understanding is that a "memory" clobber would
> also clobber the content of any "m" operand.
> 
> If use of "m" (var) input-operand-as-output + "memory" clobber ends up being an
> issue, we can always fall-back to "r" (&var) input operand + "memory" clobber,
> which seems less ambiguous from a documentation standpoint.
> 
> I'd really like to have an authoritative answer from gcc folks before we start
> changing this in all rseq asm for all architectures.
> 

Hi Szabolcs, we would really appreciate to see what your opinion is on 
this matter.

>>
>>>> +
>>>> +	return count;
>>
>> ... and for similar reasons, always return zero here.
>>
>>>> +abort:
>>>> +        pr_debug("Abort handler\n");
>>>> +        exit(-2);
>>>> +}
>>
>> Given the necessary complexity evident above, I'm also fearful that the
>> sort of folk that want userspace counter access aren't going to bother
>> with the above.
> 
> The abort handler should be implemented in C, simply invoking the perf
> system call which lets the kernel perform the perf counter read.


Thanks,

-- 
Raphael Gault