[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YnupYin9knBu6c91@antec>
Date: Wed, 11 May 2022 21:17:38 +0900
From: Stafford Horne <shorne@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: "Jason A. Donenfeld" <Jason@...c4.com>,
Andy Shevchenko <andriy.shevchenko@...el.com>,
Mikulas Patocka <mpatocka@...hat.com>,
Andy Shevchenko <andy@...nel.org>,
device-mapper development <dm-devel@...hat.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux Crypto Mailing List <linux-crypto@...r.kernel.org>,
Herbert Xu <herbert@...dor.apana.org.au>,
"David S. Miller" <davem@...emloft.net>,
Mike Snitzer <msnitzer@...hat.com>,
Mimi Zohar <zohar@...ux.ibm.com>,
Milan Broz <gmazyland@...il.com>
Subject: Re: [PATCH v2] hex2bin: make the function hex_to_bin constant-time
On Sun, May 08, 2022 at 09:37:26AM +0900, Stafford Horne wrote:
> On Thu, May 05, 2022 at 05:38:58AM +0900, Stafford Horne wrote:
> > On Wed, May 04, 2022 at 01:10:03PM -0700, Linus Torvalds wrote:
> > > On Wed, May 4, 2022 at 12:58 PM Stafford Horne <shorne@...il.com> wrote:
> > > >
> > > > I have uploaded a diff I created here:
> > > > https://gist.github.com/54334556f2907104cd12374872a0597c
> > > >
> > > > It shows the same output.
> > >
> > > In hex_to_bin itself it seems to only be a difference due to some
> > > register allocation (r19 and r3 switched around).
> > >
> > > But then it gets inlined into hex2bin and there changes there seem to
> > > be about instruction and basic block scheduling, so it's a lot harder
> > > to see what's going on.
> > >
> > > And a lot of constant changes, which honestly look just like code code
> > > moved around by 16 bytes and offsets changed due to that.
> > >
> > > So I doubt it's hex_to_bin() that is causing problems, I think it's
> > > purely code movement. Which explains why adding a nop or a fake printk
> > > fixes things.
> > >
> > > Some alignment assumption that got broken?
> >
> > This is what it looks like to me too. I will have to do a deep dive on what is
> > going on with this particular build combination as I can't figure out what it is
> > off the top of my head.
> >
> > This test is using a gcc 11 compiler, I tried with my gcc 12 toolchain and the
> > issue cannot be reproduced.
> >
> > - musl gcc 11 - https://musl.cc/or1k-linux-musl-cross.tgz
> > - openrisc gcc 12 - https://github.com/openrisc/or1k-gcc/releases/tag/or1k-12.0.1-20220210-20220304
> >
> > But again the difference between the two compiler outputs is a lot of register
> > allocation and offsets changes. Its not easy to see anything that stands out.
> > I checked the change log for the openrisc specific changes from gcc 11 to gcc
> > 12. Nothing seems to stand out, mcount profiler fix for PIC, a new large binary
> > link flag.
>
> Hello,
>
> Just an update on this. The issue so far has been traced to the alignment of
> the crypto multiplication function fe_mul_impl in lib/crypto/curve25519-fiat32.c.
>
> This patch e5be15767e7e ("hex2bin: make the function hex_to_bin constant-time")
> allowed for the alignment to be just right to cause the failure. Without
> this patch and forcing the alignment we can reproduce the issue. So though the
> bisect is correct, this patch is not the root cause of the issue.
>
> Using some l.nop sliding techniques and some strategically placed .align
> statements I have been able to reproduce the issue on:
>
> - gcc 11 and gcc 12
> - preempt and non-preempt kernels
>
> I have not been able to reproduce this on my FPGA, so far only QEMU. My
> hunch now is that since the fe_mul_impl function contains some rather long basic
> blocks it appears that timer interrupts that interrupt qemu mid basic block
> execution somehow is causing this. The hypothesis is it may be basic block
> resuming behavior in qemu that causes incosistent behavior.
>
> It will take a bit more time to trace this. Since I maintain OpenRISC QEMU the
> issue is on me.
>
> Again, It's safe to say that patch e5be15767e7e ("hex2bin: make the function
> hex_to_bin constant-time") is not an issue.
This issue has been fixed. I sent a patch to QEMU for it:
- https://lore.kernel.org/qemu-devel/20220511120541.2242797-1-shorne@gmail.com/T/#u
The issue was a bug in the OpenRISC emulation in QEMU which was triggered when
receiving a TICK TIMER interrupt, in a delay slot, on a page boundary.
The fix was simple enough, but investigation took quite some work.
Thanks,
-Stafford
Powered by blists - more mailing lists