linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.11.1411202204030.3950@nanos>
Date:	Thu, 20 Nov 2014 22:58:26 +0100 (CET)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Tejun Heo <tj@...nel.org>
cc:	Frederic Weisbecker <fweisbec@...il.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Dave Jones <davej@...hat.com>, Don Zickus <dzickus@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	the arch/x86 maintainers <x86@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Andy Lutomirski <luto@...capital.net>,
	Arnaldo Carvalho de Melo <acme@...stprotocols.net>
Subject: Re: frequent lockups in 3.18rc4

On Thu, 20 Nov 2014, Tejun Heo wrote:
> On Thu, Nov 20, 2014 at 12:50:36AM +0100, Frederic Weisbecker wrote:
> > > Are we talking about different per cpu allocators here or am I missing
> > > something completely non obvious?
> > 
> > That's the same allocator yeah. So if the whole memory is dereferenced,
> > faults shouldn't happen indeed.
> > 
> > Maybe that was a bug a few years ago but not anymore.
> 
> It has been always like that tho.  Percpu memory given out is always
> populated and cleared.
> 
> > Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
> > After all it's allocated with vzalloc() so that part could be skipped. The memset(0)
> 
> The vzalloc call is for the internal allocation bitmap not the actual
> percpu memory area.  The actual address areas for percpu memory are
> obtained using pcpu_get_vm_areas() call and later get populated using
> map_kernel_range_noflush() (flush is performed after mapping is
> complete).
> 
> Trying to remember what happens with vmalloc_fault().  Ah okay, so
> when a new PUD gets created for vmalloc area, we don't go through all
> PGDs and update them.  The PGD entries get faulted in lazily.  Percpu
> memory allocator clearing or not clearing the allocated area doesn't
> have anything to do with it.  The memory area is always fully
> populated in the kernel page table.  It's just that the population
> happened while a different PGD was active and this PGD hasn't been
> populated with the new PUD yet.

It's completely undocumented behaviour, whether it has been that way
for ever or not. And I agree with Fredric, that it is insane. Actuallu
it's beyond insane, really.

> So, yeap, vmalloc_fault() can always happen when accessing vmalloc
> areas and the only way to avoid that would be removing lazy PGD
> population - going through all PGDs and populating new PUDs
> immediately.

There is no requirement to go through ALL PGDs and populate that stuff
immediately.

Lets look at the two types of allocations

   1) Kernel percpu allocations

   2) Per process/task percpu allocations

Of course we do not have a way to distinguish those, but we really
should have one.

#1 Kernel percpu allocations usually happen in the context of driver
   bringup, subsystem initialization, interrupt setup etc.

   So this is functionality which is not a hotpath and usually
   requires some form of synchronization versus the rest of the system
   anyway.

   The per cpu population stuff is serialized with a mutex anyway, so
   what's wrong to have a globaly visible percpu sequence counter,
   which is incremented whenever a new allocation is populated or torn
   down?

   We can make that sequence counter a per cpu variable as well to
   avoid the issues of a global variable (preferrably that's a
   compile/boot time allocated percpu variable to avoid the obvious
   circulus vitiosus)

   Now after that increment the allocation side needs to wait for a
   scheduling cycle on all cpus (we have mechanisms for that)
   
   So in the scheduler if the same task gets reselected you check that
   sequence count and update the PGD if different. If a task switch
   happens then you also need to check the sequence count and act
   accordingly.

   If we make the sequence counter a percpu variable as outlined above
   the overhead of checking this is just noise versus the other
   nonsense we do in schedule().


#2 That's process related statistics and instrumentation stuff.

   Now that just needs a immediate population on the process->mm->pgd
   aside of the init_mm.pgd, but that's really not a big deal.

Of course that does not solve the issues we have with the current
infrastructure retroactively, but it allows us to avoid fuckups like
the one Frederic was talking about that perf invented its own kmalloc
based 'percpu' replacement just to workaround the shortcoming in a
particular place.

What really frightens me is the potential and well hidden fuckup
potential which lurks around the corner and the hard to debug once in
a while fallout which might be caused by this.

Thanks,

	tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/