lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <77f6fd38-e216-47cf-8ceb-930395614aca@oracle.com>
Date: Fri, 13 Sep 2024 13:08:34 +0100
From: John Garry <john.g.garry@...cle.com>
To: Daniel Gomez <da.gomez@...sung.com>
Cc: Jens Axboe <axboe@...nel.dk>, Steven Rostedt <rostedt@...dmis.org>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-trace-kernel@...r.kernel.org, gost.dev@...sung.com,
        Luis Chamberlain <mcgrof@...nel.org>,
        Pankaj Raghav <p.raghav@...sung.com>,
        Dave Chinner <dchinner@...hat.com>, Daniel Gomez <d@...ces.com>
Subject: Re: [PATCH RFC] block: trace: add block alignment information

On 13/09/2024 12:26, Daniel Gomez wrote:
> On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
>> On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
>>> From: Daniel Gomez <da.gomez@...sung.com>
>>>
>>> Report block alignment in terms of LBA and size during block tracing for
>>> block_rq. Calculate alignment only for read/writes where the length is
>>> greater than 0. Otherwise, report 0 to indicate no alignment calculated.
>>>
>>> Suggested-by: Dave Chinner <dchinner@...hat.com>
>>> Signed-off-by: Daniel Gomez <da.gomez@...sung.com>
>>> ---
>>> This patch introduces LBA and size alignment information for
>>> the block_rq tracepoints (block_rq{insert, issue, merge} and
>>> block_{io_start, io_done}).
>>
>> eh, shouldn't this belong in the description of the patch?
> 
> Yes. I'll move this to the commit message.
> 
>>
>> And I still don't know what we mean by alignment in this context.
>>
>>  From looking at the code, it seems to be the max detected block size
>> granularity. For example, for a 64KB write at a 32KB offset, that would give
>> a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
>> alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
>> 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2

note: I meant "8KB is the largest power-of-2"

>> which is divisible into 24KB. Is this a correct understanding?
> 
> That is correct. 

So maybe it's me, but I just find it odd to call this information 
"alignment". To me, what you are looking for is largest block size 
granularity.

 > Do you think adding examples like yours can help to explain
 > this better?
> Below the same examples using fio with the trace output:
> 
> 
> 	sudo fio -bs=64k -size=64k -offset=32k -rw=write \
> 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> 	
> 	sudo fio -bs=64k -size=64k -offset=64k -rw=write \
> 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> 	
> 	sudo fio -bs=8k -size=8k -offset=64k -rw=write \
> 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> 	
> 	sudo fio -bs=24k -size=24k -offset=0k -rw=write \
> 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> 
> 	fio-789     [000] .....  4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
> 	fio-801     [000] .....  4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
> 	fio-813     [000] .....  4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
> 	fio-825     [000] .....  4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]
> 
> 
> Also, the motivation behind this is explained in the LBS RFC [1] and I should
> have included it here for context. I hope [1] and my description below helps to
> explain what alignment means and why is needed:
> 
> [1] Subject: [RFC 00/23] Enable block size > page size in XFS
> https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!NoMpDxzuA5uKlv0RAWE5UtOQKOrNB2zv8PHmOLWxfGCEzw5WpyyvonfhcMi0REPjCgF8pgBvEO9kyhTPO8z1$
> 
> Tracing alignment information is important for high-capacity and QLC SSDs with
> Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
> Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
> boundaries can imply in a read-modify-write (RMW).

Yes, I get that this might be important to know.

> 
> The graph below is a representation of the device IU vs an I/O block aligned/
> unaligned.
> 
>      |--- IU Boundaries ----|      |-PS-|
> a)  [====][====][====][====][····][····][····]--
>      |                      |
> b)  [····][====][====][====][====][····][····]--
>      |                      |
> c)  [====][====][====][====][····][====][====]--

is there meant to be a gap at page index #4?

>      |                      |
> d)  [····][····][====][====][····][····][····]--
>      |                      |
> LBA 0                      4
>    
>      Key:
>      [====] = I/O Block
>      [····] = Memory in Page Size (PS) chunks
> 
> a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
> b) The size of the I/O matches the IU size but the I/O is not aligned to the
> IU boundaries. I/O is unaligned.
> c) I/O does not match in either size or LBA. I/O is unaligned.

what about d)? Not aligned to IU, I assume.

> 
>>
>>>
>>> The idea of reporting alignment in a tracepoint was first suggested in
>>> this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
>>> tracing tool [2] was developed and used during LBS development, as
>>> mentioned in the patch series [3] and in [1].
>>>
>>> With this addition, users can check block alignment directly through the
>>> block layer tracepoints without needing any additional tools.
>>>
>>> In case we have a use case, this can be extended to other tracepoints,
>>> such as complete and error.
>>>
>>> Another potential enhancement could be the integration of this
>>> information into blktrace. Would that be a feasible option to consider?
>>>
>>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
>>> [2] blkalgn tool written in eBPF/bcc:
>>> https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
>>> [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
>>> ---
>>>    block/blk-mq.c               | 29 +++++++++++++++++++++++++++++
>>>    include/linux/blk-mq.h       | 11 +++++++++++
>>>    include/linux/blkdev.h       |  6 ++++++
>>>    include/trace/events/block.h |  7 +++++--
>>>    4 files changed, 51 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 831c5cf5d874..714452bc236b 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>>>    }
>>>    EXPORT_SYMBOL_GPL(blk_rq_poll);
>>> +u32 __blk_rq_lba_algn(struct request *req)
>>
>> why use "algn", and not "align"?
>>
>> "algn" is not a natural abbreviation of "alignment".
> 
> That's okay with me, changing the var name to a more natural abbreviation.
> 
>>
>> And why can't userspace figure this out? All the info is available already,
>> right?
> 
> We are interested in the block alignment (LBA and size) at block device driver
> level, not userspace level. That is, everything that is going out from the block
> layer. Using the block tracing points currently available makes it block-driver
> generic.

I am just saying that the information already present in the block trace 
point can be used to get this "alignment" info, right? And userspace can 
do the work of reading those trace events to find this "alignment" info.

Thanks,
John


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ