linux-kernel - Re: [kernel-hardening] rowhammer protection [was Re: Getting interrupt every million cache misses]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161028172710.GA10309@amd>
Date:   Fri, 28 Oct 2016 19:27:10 +0200
From:   Pavel Machek <pavel@....cz>
To:     Mark Rutland <mark.rutland@....com>
Cc:     Kees Cook <keescook@...omium.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Arnaldo Carvalho de Melo <acme@...hat.com>,
        kernel list <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        "kernel-hardening@...ts.openwall.com" 
        <kernel-hardening@...ts.openwall.com>
Subject: Re: [kernel-hardening] rowhammer protection [was Re: Getting
 interrupt every million cache misses]

Hi!

> On Fri, Oct 28, 2016 at 01:21:36PM +0200, Pavel Machek wrote:
> > > Has this been tested on a system vulnerable to rowhammer, and if so, was
> > > it reliable in mitigating the issue?
> > > 
> > > Which particular attack codebase was it tested against?
> > 
> > I have rowhammer-test here,
> > 
> > commit 9824453fff76e0a3f5d1ac8200bc6c447c4fff57
> > Author: Mark Seaborn <mseaborn@...omium.org>
> 
> ... from which repo?

https://github.com/mseaborn/rowhammer-test.git

> > I do not have vulnerable machine near me, so no "real" tests, but
> > I'm pretty sure it will make the error no longer reproducible with the
> > newer version. [Help welcome ;-)]
> 
> Even if we hope this works, I think we have to be very careful with that
> kind of assertion. Until we have data is to its efficacy, I don't think
> we should claim that this is an effective mitigation.

On my hardware, rowhammer errors are not trivial to reproduce. It
takes time (minutes). I'm pretty sure this will be enough to stop the
exploit. If you have machines where rowhammer errors are really easy
to reproduce, testing on it would be welcome.

> > Well, I'd like to postpone debate 'where does it live' to the later
> > stage. The problem is not arch-specific, the solution is not too
> > arch-specific either. I believe we can use Kconfig to hide it from
> > users where it does not apply. Anyway, lets decide if it works and
> > where, first.
> 
> You seem to have forgotten the drammer case here, which this would not
> have protected against. I'm not sure, but I suspect that we could have
> similar issues with mappings using other attributes (e.g write-through),
> as these would cause the memory traffic without cache miss events.

Can you get me example code for x86 or x86-64? If this is trivial to
workaround using movnt or something like that, it would be good to
know.

I did not go through the drammer paper in too great detail. They have
some kind of DMA-able memory, and they abuse it to do direct writes?
So you can "simply" stop providing DMA-able memory to the userland,
right? [Ok, bye bye accelerated graphics, I guess. But living w/o
graphics acceleration is preferable to remote root...]

OTOH... the exploit that scares me most is javascript sandbox
escape. I should be able to stop that... and other JIT escape cases
where untrusted code does not have access to special instructions.

On x86, there seems to be "DATA_MEM_REFS" performance counter, if
cache misses do not account movnt, this one should. Will need checking.

> Perhaps, but that depends on a number of implementation details. If "too
> often" means "all the time", people will turn this off when they could
> otherwise have been protected (e.g. if we can accurately monitor the
> last level of cache).

Yup. Doing it well is preferable to doing it badly.

> > > * On some implementations, it may be that the counters are not
> > >   interchangeable, and for those this would take away
> > >   PERF_COUNT_HW_CACHE_MISSES from existing users.
> > 
> > Yup. Note that with this kind of protection, one missing performance
> > counter is likely to be small problem.
> 
> That depends. Who chooses when to turn this on? If it's down to the
> distro, this can adversely affect users with perfectly safe DRAM.

You don't want this enabled on machines with working DRAM, there will
be performance impact.

> > > > +	/* FIXME msec per usec, reverse logic? */
> > > > +	if (delta < 64 * NSEC_PER_MSEC)
> > > > +		mdelay(56);
> > > > +}
> > > 
> > > If I round-robin my attack across CPUs, how much does this help?
> > 
> > See below for new explanation. With 2 CPUs, we are fine. On monster
> > big-little 8-core machines, we'd probably trigger protection too
> > often.
> 
> We see larger core counts in mobile devices these days. In China,
> octa-core phones are popular, for example. Servers go much larger.

Well, I can't help everyone :-(. On servers, there's ECC. On phones,
well, don't buy broken machines. This will work, but performance
impact will not be nice.

> > +static struct perf_event_attr rh_attr = {
...
> > +	.sample_period = 10000,
> > +};
> 
> What kind of overhead (just from taking the interrupt) will this come
> with?

This is not used, see below.

> > +/*
> > + * How often is the DRAM refreshed. Setting it too high is safe.
> > + */
> 
> Stale comment? Given the check against delta below, this doesn't look to
> be true.

Thinko, actually. Too low is safe, AFAICT.

> > +/*
> > + * DRAM is shared between CPUs, but these performance counters are per-CPU.
> > + */
> > +	int max_attacking_cpus = 2;
> 
> As above, many systems today have more than two CPUs. In the drammmer
> paper, it looks like the majority had four.

We can do set this automatically, and we should also take cpu hotplug
into account. But lets get it working first.

Actually in the ARM case (Drammer), it may be better to stop exploit
some other way. Turning off/redesigning GPU acceleration should work
there, right?

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)