linux-kernel - Re: [RFC 0/1] block: export windowed IO P99 latency

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <003ed785-455d-4245-bb7d-47e58eae7814@bytedance.com>
Date: Fri, 30 Jan 2026 10:59:06 +0800
From: "Diangang Li" <lidiangang@...edance.com>
To: "Diangang Li" <diangangli@...il.com>, <axboe@...nel.dk>
Cc: <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>, 
	<changfengnan@...edance.com>
Subject: Re: [RFC 0/1] block: export windowed IO P99 latency

On 2026/1/9 16:31, Diangang Li wrote:
> Production environments occasionally run into elevated tail latencies. The
> source can be the underlying device, but it can also be higher in the
> stack (filesystem contention/journaling, memory reclaim, writeback, etc.).
> Existing block IO statistics only provide throughput and average latency,
> which fail to capture the critical tail end of the latency distribution
> that often causes user-visible performance problems.
> 
> This patch adds windowed P99 latency tracking for block IO operations,
> exposing the 99th percentile latency in /proc/diskstats and
> /sys/block/<dev>/stat. System administrators can now monitor tail latency
> trends over time using tools like iostat, enabling quick validation or
> elimination of disk hardware as the source of latency issues.
> 
> Implementation uses per-CPU sliced ring histograms (21 buckets, 8us..~8s
> range) with minimal overhead. P99 values are computed by aggregating
> recent 1-second slices when reading statistics, reported in microseconds
> using bucket midpoints.
> 
> The added work on the IO path is intentionally small (bucket selection and
> a per-CPU counter update, with occasional per-slice reset), and in our
> testing it does not have a measurable impact on IO performance.
> 
> Diangang Li (1):
>    block: export windowed IO P99 latency
> 
>   block/blk-core.c          |  5 ++-
>   block/blk-flush.c         |  6 ++-
>   block/blk-mq.c            |  5 ++-
>   block/genhd.c             | 50 ++++++++++++++++++++++++-
>   include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++
>   5 files changed, 139 insertions(+), 6 deletions(-)
> 

Hi Jens, hi all,

Quick sanity check on the motivation/design before I respin.

I want to expose a simple tail metric (P99) via diskstats/sysfs stat, 
since avg latency/throughput often miss the spikes seen in prod.

I considered read-to-read deltas, but diskstats is polled frequently 
(often sub-second, multiple agents), so the effective window becomes 
reader-dependent and too short/noisy. Current approach uses a fixed 
window (per-CPU 1s slices in a small ring histogram) and aggregates on read.

Does this direction make sense? Is diskstats/sysfs the right place for 
it? Any better low-overhead, polling-independent approach (and 
preferences on percentile/window/buckets)?

Best regards,
Diangang Li