linux-kernel - [patch 00/19] sched: Rewrite MM CID management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251015164952.694882104@linutronix.de>
Date: Wed, 15 Oct 2025 19:29:23 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: LKML <linux-kernel@...r.kernel.org>
Cc: Peter Zijlstra <peterz@...radead.org>,
 Gabriele Monaco <gmonaco@...hat.com>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 Michael Jeanson <mjeanson@...icios.com>,
 Jens Axboe <axboe@...nel.dk>,
 "Paul E. McKenney" <paulmck@...nel.org>,
 "Gautham R. Shenoy" <gautham.shenoy@....com>,
 Florian Weimer <fweimer@...hat.com>,
 Tim Chen <tim.c.chen@...el.com>,
 TCMalloc Team <tcmalloc-eng@...gle.com>
Subject: [patch 00/19] sched: Rewrite MM CID management

This is a follow up on the discussion and initial attempt to address the MM
CID management issues related to the CID space compaction in task work:

    https://lore.kernel.org/20250716160603.138385-8-gmonaco@redhat.com

and my naive way to fix this, which brought parts of the regression back
that caused the current implementation of CID magic:

   https://lore.kernel.org/87tt0k3oks.ffs@tglx

As a couple of people have rightfully pointed out to me that this helps
only in the back to back scheduling case of threads which belong to the
same process (mm), I went back to the drawing board and started over
from a fresh plate and took five steps back.

There are only two set in stone ABI requirements:

  1) It's guaranteed that at no given point in time the same CID is used
     by concurrent tasks in userspace.

  2) The CID space must not exceed the number of possible CPUs in a
     system.

The CID space compaction itself is not a functional correctness
requirement, it is only a useful optimization mechanism to reduce the
memory foot print in unused user space pools.

The optimal CID space is:

    min(nr_tasks, nr_cpus_allowed);

Where @nr_tasks is the number of actual user space threads associated to
the mm.

@nr_cpus_allowed is the superset of all task affinities. It is growth
only as it would be insane to take a racy snapshot of all task
affinities when the affinity of one task changes just do redo it 2
milliseconds later when the next task changes its affinity.

When looking at the current implementation, there are immediately a
couple of obvious problems related to the @nr_tasks and @nr_cpus_allowed
handling.

@nr_tasks is evaluating mm::mm_users, which is the wrong metric because
kernel users which temporarily take a reference on the MM are skewing
the picture. That's easy enough to fix as that can be seperately
accounted for in a mm:mm_cid specific way on fork() and exit().

@nr_cpus_allowed management does way too much useless work for the common
case where a process starts with unrestricted affinity,
i.e. @nr_cpus_allowed is equal to the number of possible CPUs right
away. So obviously this wants to check whether that limit is reached
already and then avoid the whole cpumask update and evaluation.

The update is also called from the wrong place because migrate_disable()
and migrate_enable() related updates are completely irrelevant.
migrate_disable() is restricting the runnable CPUs to the current CPU,
which obviously is in the task's affinity mask at that time.
migrate_enable() restores the actual task affinity, which might have
changed by that time, but that change is already accounted for at that
point.

As @nr_tasks only changes on fork() and exit() and @nr_cpus_allowed only
on affinity changes it's completely bogus to do the

    min(nr_tasks, nr_cpus_allowed);

as a racy snapshot all over the place. This can be done at exactly those
points and the result provided as a non-atomic variable which is not
really subject to frequent changes, while mm::mm_users can be contended
under certain circumstances and workload scenarios.

When observing the current behaviour I also noticed that both the per MM
CID related variables and the per MM CID per CPU variables are subject
to false sharing. Moving them into data structures which are cache line
aligned makes that immediately go away.

So once I established that there are only three events where the
constraints can actually change it became pretty obvious what needs to
be done:

1) As long as the number of tasks is lower or equal than the number
   of CPUs allowed, each task owns a CID. If the number of tasks
   exceeds the number of CPUs allowed it switches to per CPU mode,
   where the CPUs own the CIDs and the tasks borrow them as long
   as they are scheduled in.

2) Account the tasks in fork() and exit() and make the decision
   whether to switch to per CPU mode or back to per task mode.

   If fork() or exit() changes the mode, then that context is the
   obvious candidate to do the fixup work which is required to do a
   mode change.

   While the kernel tries to make fork() as fast as possible, the
   reality is that compared to context switching, task migration and
   return to user space, it is pretty much in the noise when fork()
   has to walk through possible CPUs or through the thread list of a
   process. The same applies for exit().

3) The common fork() case is really fast if there is no mode change
   happening. If the process (MM) is in per task mode it acquires a
   CID and the new task, which is not yet visible owns it. If it's in
   per CPU mode, then nothing to do as the task will borrow the CID
   from the CPU when it becomes runnable.

4) The common exit() case is equally trivial if there is no mode
   change. If the process (MM) is in per task mode the task drops the
   CID and hands it back into the pool. In per CPU mode it does
   nothing because the CPU owns the CID.

5) As a consequence context switching becomes most of the time a very
   trivial sequence of operations.

   The scheduling out task does exactly nothing, which spares access to
   likely cache cold data. The scheduling in task needs to do a little bit
   of work obviously:

     if (!next->mm_cid.active)
        return;

     cpu_cid = __this_cpu_read(mm->mm_cid.pcpu->cid);
     if (unlikely(READ_ONCE(mm->mm_cid.percpu)))
        mm_cid_from_cpu(next, cpu_cid);
     else
        mm_cid_from_task(next, cpu_cid);
   
    mm_cid_from_cpu() does in the common case:

      tcid = next->mm_cid.cid;
      if (likely(cid_on_cpu(cpu_cid & tcid))) {
	 if (likely(cpu_cid_in_range(cpu_cid))) {
	    next->mm_cid.tcid = cpu_cid;
            return;
	 }
      }

    mm_cid_from_task() does in the common case:

     tcid = next->mm_cid.cid;
     if (likely(cid_on_task(tcid | cpu_cid))) {
        if (likely(task_cid_in_range(tcid))) {
           this_cpu_write(next->mm.mm_cid.pcpu.cid, tcid);
           return;
	}
     }

  The common case is almost invisible in perf top with that change.
   
  After some fighting with postgresql I got that sysbench benchmark
  running which caused the current implementation. I don't have the
  expertise to actually evaluate the sysbench output, but looking at
  all those magic numbers this trends upwards in throughput.

  But on a 128 Core (256 threads) machine I clearly can see in perf
  top that __schedule() drops:

  upstream 	+rseq/perf	+cid rework
    0.43%          0.42%          0.30%                [k] __schedule 

  when I run sysbench with 255 threads, which means a total of 256 threads,
  i.e. still in the per task limit on that box. PSQL uses processes for the
  connections which means there is only one user per process. That doesn't
  look a lot, but summed up over millions of operations that accumulates
  pretty fast. There is a way bigger surprise in that benchmark as you will
  see below.

  Of course that does not exercise the switch from task mode to per CPU
  mode and back, but with some artifical test cases which I wrote myself,
  the assumption that the occasional overhead in fork() and exit() does not
  matter at all turns out to be true.


6) Required serialization

  A) The hotpath relevant members of tsk->mm.mm_cid are protected by
     mm::mm_cid::lock a raw spinlock with a minimal hold time even
     for the mode switch case. This lock is also taken in the
     affinity change path.

  B) As the mode changes (ignore affinity increasing @nr_allowed_cpus
     for now) happen in fork() and exit() this obviously requires
     serialization for those. But that serialization wants to be
     separate from the raw spinlock for obvious reasons and is simply
     using a mutex. That's possible because the exit path does not
     rely on sighand lock and the TASK_EXITING state anymore.

     The contention case on that mutex is when there is a scenario where
     really frequent thread creation runs concurrently with frequent thread
     termination. But as the common lock held times are really short and
     both fork() and exit() are subject to lock contention on various
     sleeping locks already, this is the least of my worries.

     The reason why this needs to be serialized is that mode switching
     and the therefore required fixups need to be synchronous from the
     fork()/exit() context point of view in order to maintain the
     completely uncontended fast path in context_switch() and that's
     what really matters.

     So on a mode switch from per task mode to per CPU mode, which can
     only happen on fork(), the task which creates the thread has to
     walk the thread list of the process (MM), which is simple and safe
     because of the per MM serialization through the mutex, which
     prevents it to change in the same way as sighand or tasklist lock.

     The simplified sequence in fork for going to per CPU mode is:

     guard(mutex)(&mm->mm_cid.mutex);
     scoped_guard(raw_spinlock_irq, &mm->mm_cid.lock) {
               mm->mm_cid.users++;
               mode_change = update_constraints(mm);

               if (!mm->mm_cid.percpu)
                   t->mm_cid.cid = mm_get_cid(mm);

	       if (!mode_change)
               	   return;

               if (mm->mm_cid.percpu)
               	   transfer_cid_to_cpu(current);
               else
              	   transfer_cid_to_task(current);
    }
    // mutex stays locked

    Depending on mode, this either walks the possible CPUs and
    transfers/removes ownership from them or walks the thread list
    and transfers ownership to the corresponding CPUs.

    You might ask why that can do both directions as the obvious
    direction in fork() is from per task to per CPU mode as fork()
    increases the number of tasks. But that fork can race with the
    deferred mode switch towards per task caused by a affinity change
    which increased @nr_cpus_allowed. See below.

    That transfer needs obviously to be serialized against context
    switching and that needs runqueue lock, but the hold time of that
    is in both cases minimal:

    In the thread list walk case this does for the transition to CPU
    mode (simplified) for each thread until all thread owned CIDs are
    transferred or the thread list walk terminates:

           // Protects both the task _and_ the per CPU CID store
           // and prevents a concurrent schedule from screwing it up.
           guard(task_rq_lock)(sibling);
           if (!cid_on_task(sibling->mm_cid.cid))
                 continue;

	   if (sibling == task_rq(sibling)->curr) {
                 // Transfer it to the CPU
            	 sibling->mm_cid.cid |= MM_CID_ONCPU;
                 per_cpu(mm->mm_cid.pcpu.cid, task_cpu(sibling)) = sibling->mm_cid.cid;
           } else {
                 // Remove it from the sibling and drop it
                 mm_cid_drop(sibling->mm_cid.cid);
                 sibling->mm_cid.cid = MM_CID_UNSET;
           }

    For the unlikley CLONE_VM case where multiple processes share a MM,
    that's unfortunately not sufficient. In that case it has to walk the
    full tasklist to find all tasks which reference the MM. As this is
    mostly a historical leftover, I'm not using my sleep over that.  As the
    number of users is known the walk can terminated when all tasks have
    been accounted for.

    In the reverse case which is also shared by exit() this walks the
    possible CPUs and does (again simplified):

        pcpu = per_cpu_ptr(mm->mm_cid.pcpu, cpu);

	rq = cpu_rq(cpu);
        // See above ....
        guard(rq_lock_irq)(rq);
        if (!cid_on_cpu(pcp->cid))
               continue;

        if (rq->curr->mm == mm) {
              // Transfer it to rq->curr
              rq->curr->mm_cid.cid = cpu_cid_to_cid(pcp->cid);
              pcpu->cid = rq->curr->mm_cid.cid;
	} else {
             // Remove ownership from the CPU and drop it
             pcp->cid = cpu_cid_to_cid(pcp->cid);
             mm_drop_cid(mm, pcp->cid);
	}

    So the whole point of this exercise is that _all_ other tasks
    related to the MM can freely schedule independent of the mode
    change happening. The only constraint is on fork() vs. exit().

    That puts a constraint on the user count threshold which allows to
    switch back to per task mode. It can't be larger than nr_cpu_ids / 2.
    Otherwise the CID space might become exhausted when tasks are scheduled
    in on CPUs which already transferred ownership before the fixup
    function was able to free or transfer enough CIDs. That would result in
    a live lock because the task loops in mm_get_cid() with runqueue lock
    held and the fixup function is stuck on that runqueue lock.

    A very reasonable limitation in my opinion because the kernel should
    never optimize for user space being careless and just expecting that
    the most insane configurations are magically offering the best results.

 C) The other case is the affinity change when it increases the number of
    allowed CPUs to the point where per CPU mode can be disabled.

    As that happens in the guts of the scheduler with the runqueue lock
    held, there is no way to do any of that sanely. So it is deferred
    first to irq work and then to a regular work, which then does the
    thread/tasklist walk.

    If a fork() or exit() happens before the work runs, the deferred
    operation is taken care of in that context.

    But that's sill using all the same infrastructure and logic without
    affecting the actual context switch path in a noticable way.

 D) Lock order and protection rules

    mm::mm_cid::mutex serializes fork() and exit() and therefore protects
    mm::mm_cid::users.

    mm::mm_cid::lock: serializes mm_update_max_cids() and
    mm_update_cpus_allowed().

    The mm_cidmask bitmap is not protected by any of the mm::mm_cid locks
    and can only be modified with atomic operations.

    The mm::mm_cid:pcpu per CPU storage is protected by the CPUs runqueue
    lock.

    Possible lock nesting
    mm::mm_cid::mutex -> mm::mm_cid::lock
    mm::mm_cid::mutex -> rq::lock
    rq::lock          -> mm::mm_cid::lock


7) CID space consolidation

   If the number of tasks decreases the CID space is consolidated when a
   task schedules in. In the CID on tasks case this means:

     tcid = next->mm_cid.cid;
     if (likely(cid_on_task(tcid | cpu_cid))) {
         if (likely(task_cid_in_range(tcid))) { <- range check fails
             this_cpu_write(next->mm.mm_cid.pcpu.cid, tcid);
             return;
         }
	 // Try to get a lower CID and drop the old one

   In that case it does not matter when a task with a high numbered CID
   sleeps forever because that CID is not visible to user space and when it
   wakes up it will correct it into the then valid range before going back.


8) Concurrency

   Context switch has no real constraints vs. concurrent mode changes
   because the entity which initiates it ensures that the mode change
   operation is finished before the number of users can change again.

   Due to that it does not matter whether mm_cid_schedin() observes the
   correct state of mm::mm_cid::per_cpu or not, which means none of this is
   subject to (memory) ordering constraints outside of those provided by
   rq::lock.


9) Size impact

   text	   data	    bss	    dec	    hex	filename
  62852	  25501	    640	  88993	  15ba1	kernel/sched/core.o - Before
  62316	  25513	    576	  88405	  15955	kernel/sched/core.o - After


10) Test results

   All tests except the malloc benchmark were run with glibc 2.41 as
   provided by Debian 13. It registers rseq unconditionally.

   malloc benchmarks

     I tried https://github.com/f18m/malloc-benchmarks, but that's not even
     producing stable results for repeated runs on the same kernel...

     Then I noticed that neither glibc nor the community tcmalloc nor
     jemalloc use rseq::mm_cid.

     The google tcmalloc code does, but that just does not build without
     trying to debug their bazel java buildsystem monstrosity. So *shrug*

     Which makes me really ask the obvious question why this CID mechanism
     had to be shoved unconditionally into the kernel when there is very
     close to zero usage for it. The proper solution would have been to
     make it opt in with the rseq registration so that it's only used when
     actually required. But that ship has mostly sailed or perhaps not...


   sysbench postgres benchmark

     That's the one which exposed the lock contention in the original MMCID
     implementation and led to the complexity we have today.

     After capping the maximum frequency of my test machine so it stays in
     the thermal limits under load I was able to produce halfways
     reproducible results with a low single digit standard deviation,
     except for the number of CPU migrations which vary widely.

   	         transactions/sec	 queries/sec
     v6.17	      89024.34		 1424389.43
     +rseq/perf	     102123.97		 1633984.04	+14.7%
     +cid rework     105239.86		 1683837.80	 +3.1%

     I triple checked the upstream vs. rseq/perf numbers because I couldn't
     believe them on first sight. They stayed that way. perf does not give an
     obvious explanation, but enabling the lightweight counters from the
     rseq/perf branch shows:

     exit:          666811173
     signal:             2243
     slowp:         198556051	// TIF_NOTIFY_RESUME	30% of all exits
     ids:           198408670
     cs:            198408670
     clear:         198408670
     fixup:                 0

     versus rseq/perf:

     exit:          797571142				
     signal:             2262	
     slowp:              1908	// TIF_NOTIFY_RESUME    ~0%
     fastp:           1991494	// TIF_RSEQ		 0.25% of all exits
     ids:             1864356
     cs:                    0
     clear:                 0
     fixup:                 0

     While perf top does not really notice as there is more noise going on
     in the system, this simply adds up pretty fast. Assuming 50
     instructions per rseq update on exit adds up to ~6 seconds of time on
     that machine. Take the corresponding exits of the psql workers into
     account and it gets pretty close to the loss projected on the full 60
     seconds runtime of the benchmark: 60s * 0.147 = 8.82s

     I really need to sit down now and get this rseq rewrite sorted, but it
     seems that everything related to it is amazingly efficient.


   Thread create teardown

     I wrote a micro benchmark, which spawns pools which each create
     threads and let the threads die after creation. The 32 pools/ 32
     threads case triggers the ownership mode change case in both
     directions. The source is appended at the end of this mail.

     I initialy wrote it to stress the mode change mechanics, but then I
     noticed the massive difference when I ran it on upstream:

		8 pools / 8 threads	32 pools / 32 threads

     v6.17	  23666 thr/sec	       16161 thr/sec
     +rseq/perf	  23656 thr/sec	 0%    16196 thr/sec	 0%	
     +cid rework  32025 thr/sec	 +35%  21004 thr/sec	+30%

     Both v6.17 and v6.17 + rseq/perf show this in perf top:

      14.62%  [kernel]     [k] update_sd_lb_stats.constprop.0
      13.08%  [kernel]     [k] native_queued_spin_lock_slowpath
       4.66%  [kernel]     [k] osq_lock
       3.06%  [kernel]     [k] _find_next_and_bit
       2.21%  [kernel]     [k] __schedule
       2.16%  [kernel]     [k] sched_balance_rq

    with the CID rewrite this becomes:

     13.48%  [kernel]      [k] native_queued_spin_lock_slowpath
      8.98%  [kernel]      [k] update_sd_lb_stats.constprop.0
      5.16%  [kernel]      [k] osq_lock
      2.28%  [kernel]      [k] _find_next_and_bit
      2.11%  [kernel]      [k] __schedule
      1.75%  [kernel]      [k] psi_group_change
      ...
      1.32%  [kernel]      [k] sched_balance_rq

   I haven't been able to understand that massive difference yet.


The following series rewrites MM CID management as outlined above. After
trying to gradually rework the upstream implementation which ended up in
uncomprehensible patches, I more or less reverted the complex CID
management and switched back to the simple set/clear bit approach in
context_switch(). That let me gradually work towards the final solution in
a hopefully comprehensible way. The actual revert at the beginning does not
qualify obviously.

The series is based on the V4 series of the rseq rewrite:

    https://lore.kernel.org/20250908212737.353775467@linutronix.de

which is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

That branch is now rebased on v6.17 to have a sane base point.

The series on top of the rseq/perf branch is available from git as well:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid

While this is slightly more code than the naive solution, this still
removes more code than it adds. See below.

Further testing required

As there is no ready to use benchmark for MM CID this really needs to be
tested by those people who use it and can actually build their stuff.

The other benchmarks like sysbench etc. want to be run by people who
actually know how to do proper benchmarking and have automated test rigs at
their hand.

It would be helpful to have results based on v6.17, the rseq/perf branch
and the rseq/cid branch of my devel repository. See links above.

Thanks,

	tglx
---
 include/linux/bitmap.h     |   15 +
 include/linux/cpumask.h    |   16 +
 include/linux/mm_types.h   |  125 ---------
 include/linux/rseq.h       |    8 
 include/linux/rseq_types.h |   54 ++++
 include/linux/sched.h      |   19 -
 init/init_task.c           |    3 
 kernel/exit.c              |    1 
 kernel/fork.c              |    8 
 kernel/sched/core.c        |  582 +++++++--------------------------------------
 kernel/sched/sched.h       |  426 +++++++++++++++-----------------
 kernel/signal.c            |    2 
 lib/bitmap.c               |   17 +
 13 files changed, 426 insertions(+), 850 deletions(-)
---
#include <errno.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/sysinfo.h>

struct threadarg {
	pthread_mutex_t	mut;
	pthread_cond_t	cond;
	unsigned int	cnt;
	unsigned int	die;
	unsigned int	step;
	unsigned int	timeout;
	unsigned int	fail;
	unsigned int	create;
	unsigned long	tcreate;
};

static int num_threads = 8;
static int num_pools = 8;

static void *tfn(void *arg)
{
	struct threadarg *ta = arg;

	pthread_mutex_lock(&ta->mut);
	ta->cnt++;
	pthread_cond_broadcast(&ta->cond);
	while (!ta->die)
		pthread_cond_wait(&ta->cond, &ta->mut);
	ta->die--;
	ta->cnt--;
	pthread_cond_broadcast(&ta->cond);
	pthread_mutex_unlock(&ta->mut);
	return NULL;
}

static void *poolfn(void *arg)
{
	struct threadarg *ta = arg;

	while (1) {
		unsigned int threads = 0;
		pthread_t t;

		for (int i = 0; i < num_threads; i++) {
			struct timespec st, end;

			clock_gettime(CLOCK_MONOTONIC, &st);

			if (!pthread_create(&t, NULL, tfn, ta)) {
				clock_gettime(CLOCK_MONOTONIC, &end);

				pthread_detach(t);
				ta->step++;
				ta->create++;
				threads++;

				unsigned long s = st.tv_sec * 1000000000UL + st.tv_nsec;
				unsigned long e = end.tv_sec * 1000000000UL + end.tv_nsec;
				ta->tcreate += e - s;
			} else {
				ta->fail++;
			}
		}

		pthread_mutex_lock(&ta->mut);
		ta->step = 1000;
		while (ta->cnt < threads) {
			struct timespec to;

			clock_gettime(CLOCK_REALTIME, &to);
			to.tv_sec++;

			int r = pthread_cond_timedwait(&ta->cond, &ta->mut, &to);

			switch (r) {
			case ETIMEDOUT:
				ta->timeout++;
			case EINTR:
				pthread_cond_broadcast(&ta->cond);
				break;
			default:
				ta->step++;
			}
		}
		ta->step = 2000;
		ta->die = threads;
		pthread_cond_broadcast(&ta->cond);
		while (ta->cnt > 0) {
			pthread_cond_wait(&ta->cond, &ta->mut);
			ta->step++;
		}
		pthread_mutex_unlock(&ta->mut);
		ta->step = 0;
	}
	return NULL;
}

static struct threadarg targs[1024];

static void __sigfun(unsigned int secs)
{
	unsigned long tot = 0;

	if (secs) {
		for (int i = 0; i < num_pools; i++)
			tot += targs[i].create;
		printf("%8u %16lu %lu / sec\n", secs, tot, tot / secs);
	}

	printf("    Step     Cnt     Die    Tout  Create    Fail Tcreate\n");
	for (int i = 0; i < num_pools; i++) {
		printf("%8u%8u%8u%8u%8u%8u%16lu\n", targs[i].step, targs[i].cnt,
		       targs[i].die, targs[i].timeout, targs[i].create,
		       targs[i].fail, targs[i].tcreate / targs[i].create);
	}
}

static void sigfun(int sig)
{
	__sigfun(0);
}

int main(int argc, char *argv[])
{
	pthread_t t;
	int opt;

	while ((opt = getopt(argc, argv, "p:t:")) != -1) {
		switch (opt) {
		case 'p':
			num_pools = atoi(optarg);
			break;
		case 't':
			num_threads = atoi(optarg);
			break;
		default: /* '?'	 */
			fprintf(stderr, "Usage: %s [-p pools] [-t] threads\n",
				argv[0]);
			exit(1);
		}
	}

	signal(SIGUSR1, sigfun);

	for (int i = 0; i < num_pools; i++) {
		targs[i] = (struct threadarg) {
			.mut  = PTHREAD_MUTEX_INITIALIZER,
			.cond = PTHREAD_COND_INITIALIZER,
		};

		pthread_create(&t, NULL, poolfn, &targs[i]);
	}

	unsigned int strt = time(NULL);
	for (;;) {
		sleep(10);
		__sigfun(time(NULL) - strt);
	}

	return 0;
}