linux-kernel - Re: [mm 4.15-rc8] Random oopses under memory pressure.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180118234955.nlo55rw2qsfnavfm@node.shutemov.name>
Date:   Fri, 19 Jan 2018 02:49:55 +0300
From:   "Kirill A. Shutemov" <kirill@...temov.name>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     Andrea Arcangeli <aarcange@...hat.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Tony Luck <tony.luck@...el.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Michal Hocko <mhocko@...nel.org>,
        "hillf.zj" <hillf.zj@...baba-inc.com>,
        Hugh Dickins <hughd@...gle.com>,
        Oleg Nesterov <oleg@...hat.com>,
        Rik van Riel <riel@...hat.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Ingo Molnar <mingo@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-mm <linux-mm@...ck.org>,
        the arch/x86 maintainers <x86@...nel.org>
Subject: Re: [mm 4.15-rc8] Random oopses under memory pressure.

On Thu, Jan 18, 2018 at 09:26:25AM -0800, Linus Torvalds wrote:
> On Thu, Jan 18, 2018 at 8:56 AM, Kirill A. Shutemov
> <kirill@...temov.name> wrote:
> >
> > I can't say I fully grasp how 'diff' got this value and how it leads to both
> > checks being false.
> 
> I think the problem is that page difference when they are in different sections.
> 
> When you do
> 
>      pte_page(*pvmw->pte) - pvmw->page
> 
> then the compiler takes the pointer difference, and then divides by
> the size of "struct page" to get an index.
> 
> But - and this is important - it does so knowing that the division it
> does will have no modulus: the two 'struct page *' pointers are really
> in the same array, and they really are 'n*sizeof(struct page)' apart
> for some 'n'.
> 
> That means that the compiler can optimize the division. In fact, for
> this case, gcc will generate
> 
>         subl    %ebx, %eax
>         sarl    $3, %eax
>         imull   $-858993459, %eax, %eax
> 
> because 'struct page' is 40 bytes in size, and that magic sequence
> happens to divide by 40 (first divide by 8, then that magical "imull"
> will divide by 5 *IFF* the thing is evenly divisible by 5 (and not too
> big - but the shift guarantees that).
> 
> Basically, it's a magic trick, because real divides are very
> expensive, but you can fake them more quickly if you can limit the
> input domain.
> 
> But what does it mean if the two "struct page *" are not in the same
> array, and the two arrays were allocated not aligned exactly 40 bytes
> away, but some random number of pages away?
> 
> You get *COMPLETE*GARBAGE* when you do the above optimized divide.
> Suddenly the divide had a modulus (because the base of the two arrays
> weren't 40-byte aligned), and the "trick" doesn't work.
> 
> So that's why you can't do pointer diffs between two arrays. Not
> because you can't subtract the two pointers, but because the
> *division* part of the C pointer diff rules leads to issues.

Thanks a lot for the explanation!

I wounder if this may be a problem in other places?

For instance, perf uses address of a mutex to determinate the lock
ordering. See mutex_lock_double(). The mutex is embedded into struct
perf_event_context, which is allocated with kzalloc() so I don't see how
we can presume that alignment is consistent between them.

I don't think it's the only example in kernel. Are we just lucky?

-- 
 Kirill A. Shutemov