[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <003ed785-455d-4245-bb7d-47e58eae7814@bytedance.com>
Date: Fri, 30 Jan 2026 10:59:06 +0800
From: "Diangang Li" <lidiangang@...edance.com>
To: "Diangang Li" <diangangli@...il.com>, <axboe@...nel.dk>
Cc: <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<changfengnan@...edance.com>
Subject: Re: [RFC 0/1] block: export windowed IO P99 latency
On 2026/1/9 16:31, Diangang Li wrote:
> Production environments occasionally run into elevated tail latencies. The
> source can be the underlying device, but it can also be higher in the
> stack (filesystem contention/journaling, memory reclaim, writeback, etc.).
> Existing block IO statistics only provide throughput and average latency,
> which fail to capture the critical tail end of the latency distribution
> that often causes user-visible performance problems.
>
> This patch adds windowed P99 latency tracking for block IO operations,
> exposing the 99th percentile latency in /proc/diskstats and
> /sys/block/<dev>/stat. System administrators can now monitor tail latency
> trends over time using tools like iostat, enabling quick validation or
> elimination of disk hardware as the source of latency issues.
>
> Implementation uses per-CPU sliced ring histograms (21 buckets, 8us..~8s
> range) with minimal overhead. P99 values are computed by aggregating
> recent 1-second slices when reading statistics, reported in microseconds
> using bucket midpoints.
>
> The added work on the IO path is intentionally small (bucket selection and
> a per-CPU counter update, with occasional per-slice reset), and in our
> testing it does not have a measurable impact on IO performance.
>
> Diangang Li (1):
> block: export windowed IO P99 latency
>
> block/blk-core.c | 5 ++-
> block/blk-flush.c | 6 ++-
> block/blk-mq.c | 5 ++-
> block/genhd.c | 50 ++++++++++++++++++++++++-
> include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++
> 5 files changed, 139 insertions(+), 6 deletions(-)
>
Hi Jens, hi all,
Quick sanity check on the motivation/design before I respin.
I want to expose a simple tail metric (P99) via diskstats/sysfs stat,
since avg latency/throughput often miss the spikes seen in prod.
I considered read-to-read deltas, but diskstats is polled frequently
(often sub-second, multiple agents), so the effective window becomes
reader-dependent and too short/noisy. Current approach uses a fixed
window (per-CPU 1s slices in a small ring histogram) and aggregates on read.
Does this direction make sense? Is diskstats/sysfs the right place for
it? Any better low-overhead, polling-independent approach (and
preferences on percentile/window/buckets)?
Best regards,
Diangang Li
Powered by blists - more mailing lists