linux-kernel - RE: [PATCH v3 4/6] x86/mce: Avoid tail copy when machine check terminated a copy from user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <65c0345e142c4c46a5cfd8f8b51489aa@AcuMS.aculab.com>
Date:   Wed, 7 Oct 2020 21:11:35 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "'Luck, Tony'" <tony.luck@...el.com>,
        Borislav Petkov <bp@...en8.de>
CC:     "Song, Youquan" <youquan.song@...el.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v3 4/6] x86/mce: Avoid tail copy when machine check
 terminated a copy from user

From: Luck, Tony
> Sent: 07 October 2020 19:50
> >> Machine checks are more serious. Just give up at the point where the
> >> main copy loop triggered the #MC and return from the copy code as if
> >> the copy succeeded. The machine check handler will use task_work_add() to
> >> make sure that the task is sent a SIGBUS.
> >
> > Isn't that just plain wrong?
> 
> It isn't pretty. I'm not sure how wrong it is.
> 
> > If copy is reported as succeeding the kernel code will use the 'old'
> > data that is in the buffer as if it had been read from userspace.
> > This could end up with kernel stack data being written to a file.
> 
> I ran a test with:
> 
> 	write(fd, buf, 512)
> 
> With poison injected into buf[256] to force a machine check mid-copy.
> 
> The size of the file did get incremented by 512 rather than 256. Which isn't good.
> 
> The data in the file up to the 256 byte mark was the user data from buf[0 ... 255].
> 
> The data in the file past offset 256 was all zeroes. I suspect that isn't by chance.
> The kernel has to defend against a user writing a partial page and using mmap(2)
> on the same file to peek at data past EOF and up to the next PAGE_SIZE boundary.
> So I think it must zero new pages allocated in page cache as they are allocated to
> a file.

Think about what happens to a device write or an ioctl request.
These typically get copied into on-stack buffers.

> > IIRC the code to try to maximise the copy has been removed.
> > So the 'slow' retry wont happen any more.
> 
> Which code has been removed (and when ... TIP, and my testing, is based on 5.9-rc1)

The code that retries byte by byte after an initial fault.
Might only be removed from 'next' and for some architectures.
Basically almost nothing ever relies on partial copies (except mount).

IIRC there is 'magic' in the syscall exit path to convert EFAULT
into SIGSEGV.
You probably want to hijack it to generate SIGBUS?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)