[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180118234955.nlo55rw2qsfnavfm@node.shutemov.name>
Date: Fri, 19 Jan 2018 02:49:55 +0300
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Linus Torvalds <torvalds@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>
Cc: Andrea Arcangeli <aarcange@...hat.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
Mel Gorman <mgorman@...hsingularity.net>,
Tony Luck <tony.luck@...el.com>,
Vlastimil Babka <vbabka@...e.cz>,
Michal Hocko <mhocko@...nel.org>,
"hillf.zj" <hillf.zj@...baba-inc.com>,
Hugh Dickins <hughd@...gle.com>,
Oleg Nesterov <oleg@...hat.com>,
Rik van Riel <riel@...hat.com>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Vladimir Davydov <vdavydov.dev@...il.com>,
Ingo Molnar <mingo@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-mm <linux-mm@...ck.org>,
the arch/x86 maintainers <x86@...nel.org>
Subject: Re: [mm 4.15-rc8] Random oopses under memory pressure.
On Thu, Jan 18, 2018 at 09:26:25AM -0800, Linus Torvalds wrote:
> On Thu, Jan 18, 2018 at 8:56 AM, Kirill A. Shutemov
> <kirill@...temov.name> wrote:
> >
> > I can't say I fully grasp how 'diff' got this value and how it leads to both
> > checks being false.
>
> I think the problem is that page difference when they are in different sections.
>
> When you do
>
> pte_page(*pvmw->pte) - pvmw->page
>
> then the compiler takes the pointer difference, and then divides by
> the size of "struct page" to get an index.
>
> But - and this is important - it does so knowing that the division it
> does will have no modulus: the two 'struct page *' pointers are really
> in the same array, and they really are 'n*sizeof(struct page)' apart
> for some 'n'.
>
> That means that the compiler can optimize the division. In fact, for
> this case, gcc will generate
>
> subl %ebx, %eax
> sarl $3, %eax
> imull $-858993459, %eax, %eax
>
> because 'struct page' is 40 bytes in size, and that magic sequence
> happens to divide by 40 (first divide by 8, then that magical "imull"
> will divide by 5 *IFF* the thing is evenly divisible by 5 (and not too
> big - but the shift guarantees that).
>
> Basically, it's a magic trick, because real divides are very
> expensive, but you can fake them more quickly if you can limit the
> input domain.
>
> But what does it mean if the two "struct page *" are not in the same
> array, and the two arrays were allocated not aligned exactly 40 bytes
> away, but some random number of pages away?
>
> You get *COMPLETE*GARBAGE* when you do the above optimized divide.
> Suddenly the divide had a modulus (because the base of the two arrays
> weren't 40-byte aligned), and the "trick" doesn't work.
>
> So that's why you can't do pointer diffs between two arrays. Not
> because you can't subtract the two pointers, but because the
> *division* part of the C pointer diff rules leads to issues.
Thanks a lot for the explanation!
I wounder if this may be a problem in other places?
For instance, perf uses address of a mutex to determinate the lock
ordering. See mutex_lock_double(). The mutex is embedded into struct
perf_event_context, which is allocated with kzalloc() so I don't see how
we can presume that alignment is consistent between them.
I don't think it's the only example in kernel. Are we just lucky?
--
Kirill A. Shutemov
Powered by blists - more mailing lists