lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 08 Jun 2012 12:41:56 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	"H. Peter Anvin" <hpa@...or.com>
Cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH 2/3] x86: Remove cmpxchg from i386 NMI nesting code

On Fri, 2012-06-08 at 09:10 -0700, H. Peter Anvin wrote:

> Okay, slight interrupt here.

Pun intended?

> 
> The cost of this on real hardware better be zero (which I cannot
> immediately judge.)

Is dec_and_test cheaper than cmpxchg? I would think so.

> 
> Why?  Because cmpxchg has been in every CPU since the i486, the i386 is
> royally crippled on Linux anyway (due to minor architectural defects,
> the main one being the write protect issue) and NMI is almost never used
> on i386 as anything other than a fatal error indication.
> 
> Most "real" NMI users generate the NMI from the local APIC, but the i386
> has no local APIC, and unlike the i486 cannot even have an external
> local APIC to the best of my knowledge.

Yeah, this is why I didn't rush to do this change. But it does seem to
make the code simpler and it may actually speed things up. It replaces a
cmpxchg with a local_dec_and_test, which, I believe, doesn't even lock
the cachelines.

So lets look at the patch in detail, shall we?


>  enum nmi_states {
> -       NMI_NOT_RUNNING,
> +       NMI_NOT_RUNNING = 0,

This change was done more for documenting that the first element must be
zero. Even though C guarantees this. I wanted to point out that we
expect it to be zero and that it being zero really does matter. No
functionality change whats-so-ever.

>         NMI_EXECUTING,
>         NMI_LATCHED,
>  };
> -static DEFINE_PER_CPU(enum nmi_states, nmi_state);
> +static DEFINE_PER_CPU(local_t, nmi_state);

local_t is is just an atomic_long_t, which on i386 is nothing different
than what an enum would be.

>  
>  #define nmi_nesting_preprocess(regs)                                   \
>         do {                                                            \
> -               if (__get_cpu_var(nmi_state) != NMI_NOT_RUNNING) {      \
> -                       __get_cpu_var(nmi_state) = NMI_LATCHED;         \
> +               local_t *__state = &__get_cpu_var(nmi_state);           \
> +               if (local_read(__state) != NMI_NOT_RUNNING) {           \
> +                       local_set(__state, NMI_LATCHED);                \

The above change is probably a little bit of a speed up because we
remove the double '__get_cpu_var()' and replace it with a pointer that
is reused. I haven't looked at the assembly for this, but it is either
the same or better with the patch.

Sure, we could improve this by using this_cpu_var() which may make it
better than the patch. But the patch is currently the same or better
than what is there now.

>                         return;                                         \
>                 }                                                       \
> -       nmi_restart:                                                    \
> -               __get_cpu_var(nmi_state) = NMI_EXECUTING;               \
> -       } while (0)
> +               local_set(__state, NMI_EXECUTING);                      \
> +       } while (0);                                                    \
> +       nmi_restart:

Here it's better or the same than what is there now as we again remove
the reference to getting the pointer. In case gcc doesn't optimize it
nicely. But again we could have switched to this_cpu_write() which could
be better.

The movement of nmi_restart does help too. I'll explain that below.

>  
>  #define nmi_nesting_postprocess()                                      \
>         do {                                                            \
> -               if (cmpxchg(&__get_cpu_var(nmi_state),                  \
> -                   NMI_EXECUTING, NMI_NOT_RUNNING) != NMI_EXECUTING)   \
> +               if (!local_dec_and_test(&__get_cpu_var(nmi_state)))     \

Now this is where I think the patch helps. I'm almost certain that
local_dec_and_test is faster than a cmpxchg by many cycles. Especially
on i386.

>                         goto nmi_restart;                               \

And now that the local_dec_and_test on failure set the state to
NMI_EXECUTING, we can simply jump back to the new location of
nmi_restart instead of having to reset it above.

This also removes the race in case another NMI comes in at this moment
and we lose the latch. But this is a small problem, as we are about to
repeat the NMI anyway, and it would be the same as HW if the NMI came in
just before we finished the last NMI. We only repeat once. This is just
changing where we consider the start of the repeated NMI is (where a new
latch can happen). But functionality wise, it's no different.

-- Steve

>         } while (0)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ