linux-kernel - Re: [PATCH 0/4] nvme-pci: support device coredump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <61bf6f0b-4087-cfb3-1ae6-539f18b5b6ea@gmail.com>
Date:   Sat, 4 May 2019 18:40:42 +0900
From:   Minwoo Im <minwoo.im.dev@...il.com>
To:     Akinobu Mita <akinobu.mita@...il.com>,
        Christoph Hellwig <hch@....de>
Cc:     Jens Axboe <axboe@...com>, Sagi Grimberg <sagi@...mberg.me>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-nvme@...ts.infradead.org,
        Keith Busch <keith.busch@...el.com>,
        Keith Busch <kbusch@...nel.org>,
        Johannes Berg <johannes@...solutions.net>
Subject: Re: [PATCH 0/4] nvme-pci: support device coredump

Hi Akinobu,

On 5/4/19 1:20 PM, Akinobu Mita wrote:
> 2019年5月3日(金) 21:20 Christoph Hellwig <hch@....de>:
>>
>> On Fri, May 03, 2019 at 06:12:32AM -0600, Keith Busch wrote:
>>> Could you actually explain how the rest is useful? I personally have
>>> never encountered an issue where knowing these values would have helped:
>>> every device timeout always needed device specific internal firmware
>>> logs in my experience.
> 
> I agree that the device specific internal logs like telemetry are the most
> useful.  The memory dump of command queues and completion queues is not
> that powerful but helps to know what commands have been submitted before
> the controller goes wrong (IOW, it's sometimes not enough to know
> which commands are actually failed), and it can be parsed without vendor
> specific knowledge.

I'm not pretty sure I can say that memory dump of queues are useless at all.

As you mentioned, sometimes it's not enough to know which command has
actually been failed because we might want to know what happened before and
after the actual failure.

But, the information of commands handled from device inside would be much
more useful to figure out what happened because in case of multiple queues,
the arbitration among them could not be represented by this memory dump.

> 
> If the issue is reproducible, the nvme trace is the most powerful for this
> kind of information.  The memory dump of the queues is not that powerful,
> but it can always be enabled by default.

If the memory dump is a key to reproduce some issues, then it will be 
powerful
to hand it to a vendor to solve it.  But I'm afraid of it because the 
dump might
not be able to give relative submitted times among the commands in queues.

> 
>> Yes.  Also not that NVMe now has the 'device initiated telemetry'
>> feauture, which is just a wired name for device coredump.  Wiring that
>> up so that we can easily provide that data to the device vendor would
>> actually be pretty useful.
> 
> This version of nvme coredump captures controller registers and each queue.
> So before resetting controller is a suitable time to capture these.
> If we'll capture other log pages in this mechanism, the coredump procedure
> will be splitted into two phases (before resetting controller and after
> resetting as soon as admin queue is available).

I agree with that it would be nice if we have a information that might not
be that powerful rather than nothing.

But, could we request controller-initiated telemetry log page if 
supported by
the controller to get the internal information at the point of failure 
like reset?
If the dump is generated with the telemetry log page, I think it would 
be great
to be a clue to solve the issue.

Thanks,