linux-kernel - RE: [PATCH] printk: add early_counter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <MW5PR13MB56325360CFB26B7CAE6B0E03FDDFA@MW5PR13MB5632.namprd13.prod.outlook.com>
Date: Thu, 27 Nov 2025 00:03:32 +0000
From: "Bird, Tim" <Tim.Bird@...y.com>
To: Petr Mladek <pmladek@...e.com>, Francesco Valla <francesco@...la.it>
CC: Tim Bird <tim@...d.org>, Steve Rostedt <rostedt@...dmis.org>,
        "john.ogness@...utronix.de" <john.ogness@...utronix.de>,
        "senozhatsky@...omium.org" <senozhatsky@...omium.org>,
        Andrew Morton
	<akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Anna-Maria
 Behnsen <anna-maria@...utronix.de>,
        Frederic Weisbecker
	<frederic@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Linux Embedded
	<linux-embedded@...r.kernel.org>
Subject: RE: [PATCH] printk: add early_counter_ns routine for printk blind
 spot

> -----Original Message-----
> From: Petr Mladek <pmladek@...e.com>
> On Tue 2025-11-25 14:08:40, Francesco Valla wrote:
> > Hi Tim,
> >
> > I tested this on my i.MX93 FRDM (arm64) board and after a bit of
> > fiddling with the MULT/SHIFT values I got it working. It can be a very
> > valuable addition.
> >
> > Some comments follow.
> >
> > On Mon, Nov 24, 2025 at 10:30:52PM -0700, Tim Bird wrote:
> > > From: Tim Bird <tim.bird@...y.com>
> > >
> > > During early boot, printk timestamps are reported as zero,
> > > which creates a blind spot in early boot timings.  This blind
> > > spot hinders timing and optimization efforts for code that
> > > executes before time_init(), which is when local_clock() is
> > > initialized sufficiently to start returning non-zero timestamps.
> > > This period is about 400 milliseconds for many current desktop
> > > and embedded machines running Linux.
> > >
> > > Add an early_counter_ns function that returns nanosecond
> > > timestamps based on get_cycles().  get_cycles() is operational
> > > on arm64 and x86_64 from kernel start.  Add some calibration
> > > printks to allow setting configuration variables that are used
> > > to convert cycle counts to nanoseconds (which are then used
> > > in early printks).  Add CONFIG_EARLY_COUNTER_NS, as well as
> > > some associated conversion variables, as new kernel config
> > > variables.
> > >
> > > After proper configuration, this yields non-zero timestamps for
> > > printks from the very start of kernel execution.  The timestamps
> > > are relative to the start of the architecture-specific counter
> > > used in get_cycles (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> > > This means that the time reported reflects time-from-power-on for
> > > most embedded products.  This is also a useful data point for
> > > boot-time optimization work.
> > >
> > > Note that there is a discontinuity in the timestamp sequencing
> > > when standard clocks are finally initialized in time_init().
> > > The printk timestamps are thus not monotonically increasing
> > > through the entire boot.
> >
> > This is... not going to work, IMO, and might lead to breakages in
> > userspace tools (are printk timings a userspace API?).
> 
> Honestly, I am not sure if it would break anything. The fact is
> that printk() always used monotonic timers. And it is possible
> that some userspace depends on it.
> 
> I personally thing that non-monotonic time stamps might be confusing
> but they should not cause any serious breakage. But I might be wrong.
> People are creative...

I worried about this, but I'm skeptical it's a big deal.  Humans might be
a little confused, but it's not difficult to see what's going on just by looking
at the timestamps.   If a tool breaks, especially something that's used
in automation, e.g. it's used to report results, or is in some sort of CI
loop where the break will cascade into a test failure, then that's a bigger issue. 
But right now I'm not aware of any boot-time tests where that would be the case.

I'll comment more on different fixes for this below.

> 
> > I actually have a counter-proposal: the time obtained through cycle
> > evaluation is used as an offset to be added to the printk time after
> > time_init() is called. A (working, but maybe sub-optimal) patch to
> > obtain this is attached at the end.
> 
> I am not sure if this is a good idea. The offset would cause
> that all post-timer-init printk timestamps differ from values
> provided by the timer API. And it might cause confusion,
> for example, when they are printed as part of the message,
> or when analyzing a crash dump.
> 
> On the other hand, there are various clock sources in the kernel
> which are not comparable anyway. So maybe I am too cautious.

I thought of adding an offset, but I didn't want to disturb anything past
the time_init() call.  As it is now, the early_counter_ns feature only
changes the zero-valued timestamps.  So anything relying on the absolute
value of an existing timestamp later in the boot would not be affected.
I thought that if people suddenly saw the timestamps jump by 10 to 30 seconds
(since they are now relative to machine start instead of to kernel clock start
(time_init()), it would be very jarring.  I suppose they would get used to it,
though, and all relative timings should stay the same.

I also didn't want to add additional overhead (even a single add) in the case
where CONFIG_EARLY_COUNTER_NS was disabled.  But, realistically, I
don't think an additional add in the printk path is not going to be noticeable,
and I can probably structure it so that there's absolutely no overhead when
CONFIG_EARLY_COUNTER_NS is disabled.

I considered embedding an early counter offset into local_clock(), and thus
not modifying the printk code at all.  This would have the benefit
of keeping the printk timestamps consistent with other uses of local_clock()
data (such as crash dumps or inside other messages). But then that would embed the
early_counter overhead into every user of local_clock() (whether humans
saw the timestamp values or not). And some of those users might be more
performance sensitive than printk is.

Finally, I considered adding another config option to control adding the offset,
but at 3 configs for this fairly niche functionality, I thought I was already
pushing my luck.  But a new config would be easy to add.

My plan is to add an offset for the early_counter_ns value to printk, without
a config, in the next patch version, and see what people think.

> 
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -770,6 +770,53 @@ config IKHEADERS
> > >  	  or similar programs.  If you build the headers as a module, a module called
> > >  	  kheaders.ko is built which can be loaded on-demand to get access to headers.
> > >
> > > +config EARLY_COUNTER_NS
> > > +	bool "Use counter for early printk timestamps"
> > > +	default y
> > > +	depends on PRINTK
> > > +	help
> > > +	  Use a cycle-counter to provide printk timestamps during early
> > > +	  boot.  This allows seeing timing information that would
> > > +	  otherwise be displayed with 0-valued timestamps.
> > > +
> > > +	  In order for this to work, you need to specify values for
> > > +	  EARLY_COUNTER_MULT and EARLY_COUNTER_SHIFT, used to convert
> > > +	  from the cycle count to nanoseconds.
> > > +
> > > +config EARLY_COUNTER_MULT
> > > +	int "Multiplier for early cycle counter"
> > > +	depends on PRINTK && EARLY_COUNTER_NS
> > > +	default 1
> > > +	help
> > > +	  This value specifies a multiplier to be used when converting
> > > +	  cycle counts to nanoseconds.  The formula used is:
> > > +		  ns = (cycles * mult) >> shift
> > > +
> > > +	  Use a multiplier that will bring the value of (cycles * mult)
> > > +	  to near a power of two, that is greater than 1000.  The
> > > +	  nanoseconds returned by this conversion are divided by 1000
> > > +	  to be used as the printk timestamp counter (with resolution
> > > +	  of microseconds).
> > > +
> > > +	  As an example, for a cycle-counter with a frequency of 200 Mhz,
> > > +	  the multiplier would be: 10485760, and the shift would be 21.
> > > +
> >
> > If I got this correclty:
> >
> > 	EARLY_COUNTER_MULT = (10^9 / freq) << EARLY_COUNTER_SHIFT
> >
> > where EARLY_COUNTER_SHIFT can be chosen at will, provided it is big
> > enough to survice the ns->us conversion but small enough not to overflow
> > the u64 container.

Yeah, I was worried about these Kconfig explanations.  I think it will be easier
to just explain how I recommend this should be configured, which is something like:

Turn on CONFIG_EARLY_COUNTER_NS, run the kernel once, get the values for MULT
and SHIFT (printed by the calibration code), enter them in the appropriate configs, and
then build and run the kernel again.  This only needs to be done once per platform,
and could even be put into the defconfig (or a config fragment) for a platform.
(More on having hardcoded config for this below).  It was not my intent to have
kernel developers doing weird (shift) math to enable this feature.  Note that I can't
get the values from devicetree or the kernel command line, as this is used before
either of those is initialized or parsed.

> >
> > > +config EARLY_COUNTER_SHIFT
> > > +	int "Shift value for early cycle counter"
> > > +	range 0 63
> > > +	depends on PRINTK && EARLY_COUNTER_NS
> > > +	default 0
> > > +	help
> > > +	  This value specifies a shift value to be used when converting
> > > +	  cycle counts to nanoseconds.  The formula used is:
> > > +		  ns = (cycles * mult) >> shift
> > > +
> > > +	  Use a shift that will bring the result to a value
> > > +	  in nanoseconds.
> > > +
> > > +	  As an example, for a cycle-counter with a frequency of 200 Mhz,
> > > +	  the multiplier would be: 10485760, and the shift would be 21.
> > > +
> > >  config LOG_BUF_SHIFT
> > >  	int "Kernel log buffer size (16 => 64KB, 17 => 128KB)"
> > >  	range 12 25
> 
> So, it is usable only for a particular HW. It is not usable for a
> generic kernel which is supposed to run on misc HW.

That's correct.  It is mostly targeted at embedded products,
where shaving off 10 to 200 milliseconds in the pre-clock-initialization
region of boot would be valuable.  For people doing aggressive
boot time optimization, they will have a custom kernel anyway
(and probably a custom bootloader, device tree, initramfs, udev rules,
SE linux rules, module load ordering, systemd config, etc.)

Basically, if you're optimizing the code in this kernel boot time blind
spot, you are very likely not using a generic kernel.  (That's not to
say that the optimizations made won't ultimately be valuable to people using
generic kernels).

> 
> I guess that there is no way to detect the CPU frequency at runtime?
This "feature" is used before clock initialization, which is what
would be used to calibrate the CPU frequency at runtime.  This runs so 
early that doing the calibration inline doesn't work.  Not enough kernel
services are available (Actually, zero services are available, as this
can be used in the very first printk in start_kernel.)

Plan from here...

I got a compile error from 0-day on powerpc, so I need to re-spin to fix that.
I'll address the other issues raised and submit a new version when I can.
I'm off to Japan this week, and between business travel and the holidays,
and being away from my lab where I can do hardware testing, it will
probably be some time in January before I send the next version.

Thanks very much for the feedback!
 -- Tim