linux-kernel - Wislist for Linux from the mold linker's POV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACKH++baPUaoQQhL0+qcc_DzX7kGcmAOizgfaCQ8gG=oBKDDYw@mail.gmail.com>
Date: Thu, 28 Nov 2024 11:52:35 +0900
From: Rui Ueyama <rui314@...il.com>
To: LKML <linux-kernel@...r.kernel.org>
Subject: Wislist for Linux from the mold linker's POV

Hi,

I'm the author of the mold linker. I developed mold for speed, and I
think I achieved that goal. As a ballpark number, mold can create a 1
GiB executable in a second on a recent 32-core x86 machine. While
developing mold, I noticed that the kernel's performance occasionally
became a bottleneck. I’d like to share these observations as a
wishlist so that kernel developers can at least recognize potential
areas for improvement.

mold might be somewhat unique from the kernel's point of view. Speed
is the utmost goal for the program, so we care about every
millisecond. Its performance characteristics are very bursty: as soon
as the linker is invoked, it reads hundreds or thousands of object
files, creates a multi-gibibyte output file, and then exits, while
utilizing all available cores on a machine, all within just a few
seconds.

Here is what I noticed while developing mold:

- exit(2) takes a few hundred milliseconds for a large process

I believe this is because mold mmaps all input files and an output
file, and clearing/flushing memory-mapped data is fairly expensive. I
wondered if this could be improved. If it is unavoidable, could the
cleanup process be made asynchronous so that exit(2) takes effect
immediately?

To avoid this overhead, mold currently forks a child process, lets the
child handle the actual linking task, and then, as soon as the child
closes the output file, the parent exits (which takes no time since
the parent is lightweight). Since the child is not an interactive
process, it can afford to take its time for exit. While this works, I
would prefer to avoid it if possible, as it is somewhat a hacky
workaround.

- Writing to a fresh file is slower than writing to an existing file

mold can link a 4 GiB LLVM/clang executable in ~1.8 seconds on my
machine if the linker reuses an existing file and overwrites it.
However, the speed decreases to ~2.8 seconds if the output file does
not exist and mold needs to create a fresh file. I tried using
fallocate(2) to preallocate disk blocks, but it didn't help. While 4
GiB is not small, should creating a file really take almost a second?

- Lack of a safe system-wide semaphore

mold is multi-threaded itself, so it doesn't make much sense to run
multiple instances of the linker in parallel if the number of cores
is, say, less than 16. In fact, doing so could decrease performance
because the working set increases as the number of linker processes
grows. In the worst case, they may even crash due to OOM. Therefore,
we want mold to wait for other mold processes to terminate if another
instance is already running. However, achieving this appears to be
difficult.

Currently, we are using a lockfile. This approach is simple and
reliable -- a file lock is guaranteed to be released by the kernel if
the process exits, whether gracefully or unexpectedly. However, this
only allows one active process at a time. If your machine has 64
cores, you may want to run a few linker processes simultaneously.
However, allowing up to N processes where N>1 is significantly harder.
POSIX semaphores are not released on process exit, so it may cause
resource leaks. We could run a daemon process to count the number of
active processes, but that feels overkill for achieving this goal.

After all, we just want a system-wide semaphore that is guaranteed to
be released on process exit. But it seems like such a thing doesn't
exist.