linux-kernel - Re: selftests/x86/fsgsbase

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 26 Jan 2018 14:38:28 -0800
From:   Andy Lutomirski <luto@...nel.org>
To:     Andy Lutomirski <luto@...nel.org>, Borislav Petkov <bp@...en8.de>
Cc:     "H. Peter Anvin" <hpa@...or.com>, Dan Rue <dan.rue@...aro.org>,
        Shuah Khan <shuah@...nel.org>, Ingo Molnar <mingo@...nel.org>,
        Dmitry Safonov <dsafonov@...tuozzo.com>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: selftests/x86/fsgsbase_64 test problem

On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@...nel.org> wrote:
> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@...nel.org> wrote:
>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@...nel.org> wrote:
>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@...aro.org> wrote:
>>>>
>>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>>> following error:
>>>>
>>>>         [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>>                 Before schedule, set selector to 0x1
>>>>                 other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>>         [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>>
>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>>
>>>>     for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>>
>>>> This problem isn't new - I've reproduced it on latest mainline and every
>>>> release going back to v4.12 (I did not try earlier). This was tested on
>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>>> i3-5010U.
>>>>
>>>
>>> Hmm, I can reproduce it, too.  I'll look in a bit.
>>
>> I'm triggering a different error, and I think what's going on is that
>> the kernel doesn't currently re-save GSBASE when a task switches out
>> and that task has save gsbase != 0 and in-register GS == 0.  This is
>> arguably a bug, but it's not an infoleak, and fixing it could be a wee
>> bit expensive.  I'm not sure what, if anything, to do about this.  I
>> suppose I could add some gross perf hackery to the test to detect this
>> case and suppress the error.
>>
>> I can also trigger the problem you're seeing, and I don't know what's
>> up.  It may be related to and old problem I've seen that causes signal
>> delivery to sometimes corrupt %gs.  It's deterministic, but it depends
>> in some odd way on register state.  I can currently reproduce that
>> issue 100% of the time, and I'm trying to see if I can figure out
>> what's happening.
>
> I think it's a CPU bug, and I'm a bit mystified.  I can trigger the
> following, plausibly related issue:
>
> Write a program that writes %gs = 1.
> Run that program under gdb
> break in which %gs == 1
> display/x $gs
> si
>
> Under QEMU TCG, gs stays equal to 1.  On native or KVM, on Skylake, it
> changes to 0.
>
> On KVM or native, I do not observe do_debug getting called with %gs ==
> 1.  On TCG, I do.  I don't think that's precisely the problem that's
> causing the test to fail, since the test doesn't use TF or ptrace, but
> I wouldn't be shocked if it's related.
>
> hpa, any insight?
>
> (NB: if you want to play with this as I've described it, you may need
> to make invalid_selector() in ptrace.c always return false.  The
> current implementation is too strict and causes problems.)

Much simpler test.  Run the attached program (gs1).  It more or less
just sets %gs to 1 and spins until it stops being 1.  Do it on a
kernel with the attached patch applied.  I see stuff like this:

# ./gs1
PID = 129
[   15.703015] pid 129 saved gs = 1
[   15.703517] pid 129 loaded gs = 1
[   15.703973] pid 129 prepare_exit_to_usermode: gs = 1
ax = 0, cx = 0, dx = 0

So we're interrupting the program, switching out, switching back in,
setting %gs to 1, observing that %gs is *still* 1 in
prepare_exit_to_usermode(), returning to usermode, and observing %gs
== 0.

Presumably what's happening is that the IRET microcode matches the
SDM's pseudocode, which says:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each SegReg in (ES, FS, GS, and DS)
  DO
    tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
    IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code
    THEN (* Segment register invalid *)
      SegReg ← NULL;
    FI;
  OD;

But this is very odd.  The actual permission checks (in the docs for MOV) are:

IF DS, ES, FS, or GS is loaded with non-NULL selector
THEN
  IF segment selector index is outside descriptor table limits
  or segment is not a data or readable code segment
  or ((segment is a data or nonconforming code segment)
  or ((RPL > DPL) and (CPL > DPL))
    THEN #GP(selector); FI;

^^^^
This makes no sense.  This says that the data segments cannot be
loaded with MOV.  Empirically, it seems like MOV works if CPL <= DPL
and RPL <= DPL, but I haven't checked that hard.

  IF segment not marked present
    THEN #NP(selector);
  ELSE
    SegmentRegister ← segment selector;
    SegmentRegister ← segment descriptor; FI;
  FI;

  IF DS, ES, FS, or GS is loaded with NULL selector
  THEN
    SegmentRegister ← segment selector;
    SegmentRegister ← segment descriptor;
    ^^^^
    wtf?  There is no "segment descriptor".  Presumably what actually
gets written to segment.DPL is nonsense.
  FI;

Anyway, I think it's nonsense that user code can load a selector using
MOV that is, in turn, rejected by IRET.  I don't suppose Intel would
consider fixing this going forward.

Borislav, any chance you could run the attached program on an AMD
machine to see what it does?

View attachment "gs1.c" of type "text/x-csrc" (588 bytes)