lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CADnq5_NvBsbmTteDKmzi1DZHPKGfoSMjW5TFfy2x60YDxydc=g@mail.gmail.com>
Date: Thu, 14 Mar 2024 13:51:42 -0400
From: Alex Deucher <alexdeucher@...il.com>
To: "Khatri, Sunil" <sukhatri@....com>
Cc: Sunil Khatri <sunil.khatri@....com>, Alex Deucher <alexander.deucher@....com>, 
	Christian König <christian.koenig@....com>, 
	Shashank Sharma <shashank.sharma@....com>, amd-gfx@...ts.freedesktop.org, 
	dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

On Thu, Mar 14, 2024 at 12:16 PM Khatri, Sunil <sukhatri@....com> wrote:
>
>
> On 3/14/2024 8:12 PM, Alex Deucher wrote:
> > On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil <sukhatri@....com> wrote:
> >>
> >> On 3/14/2024 1:58 AM, Alex Deucher wrote:
> >>> On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri <sunil.khatri@....com> wrote:
> >>>> Add all the IP's information on a SOC to the
> >>>> devcoredump.
> >>>>
> >>>> Signed-off-by: Sunil Khatri <sunil.khatri@....com>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++
> >>>>    1 file changed, 19 insertions(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> index a0dbccad2f53..611fdb90a1fc 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
> >>>>                              coredump->reset_task_info.process_name,
> >>>>                              coredump->reset_task_info.pid);
> >>>>
> >>>> +       /* GPU IP's information of the SOC */
> >>>> +       if (coredump->adev) {
> >>>> +               drm_printf(&p, "\nIP Information\n");
> >>>> +               drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
> >>>> +               drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
> >>>> +
> >>>> +               for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
> >>>> +                       struct amdgpu_ip_block *ip =
> >>>> +                               &coredump->adev->ip_blocks[i];
> >>>> +                       drm_printf(&p, "IP type: %d IP name: %s\n",
> >>>> +                                  ip->version->type,
> >>>> +                                  ip->version->funcs->name);
> >>>> +                       drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
> >>>> +                                  ip->version->major,
> >>>> +                                  ip->version->minor,
> >>>> +                                  ip->version->rev);
> >>>> +               }
> >>>> +       }
> >>> I think the IP discovery table would be more useful.  Either walk the
> >>> adev->ip_versions structure, or just include the IP discovery binary.
> >> I did explore the adev->ip_versions and if i just go through the array
> >> it doesn't give any useful information directly.
> >> There are no ways to find directly from adev->ip_versions below things
> >> until i also reparse the discovery binary again like done the discovery
> >> amdgpu_discovery_reg_base_init and walk through the headers of various
> >> ips using discovery binary.
> >> a. Which IP is available on soc or not.
> >> b. How many instances are there
> >> Also i again have to change back to major, minor and rev convention for
> >> this information to be useful. I am exploring it more if i find some
> >> other information i will update.
> >>
> >> adev->ip_block[] is derived from ip discovery only for each block which
> >> is there on the SOC, so we are not reading information which isnt
> >> applicable for the soc. We have name , type and version no of the IPs
> >> available on the soc. If you want i could add no of instances of each IP
> >> too if you think that's useful information here. Could you share what
> >> information is missing in this approach so i can include that.
> > I was hoping to get the actual IP versions for the IPs from IP
> > discovery rather than the versions from the ip_block array.  The
> > latter are common so you can end up with the same version used across
> > a wide variety of chips (e.g., all gfx10.x based chips use the same
> > gfx 10 IP code even if the actual IP version is different for most of
> > the chips).
> Got it. let me check how to get it could be done rightly.
> >
> >> For dumping the IP discovery binary, i dont understand how that
> >> information would be useful directly and needs to be decoded like we are
> >> doing in discovery init. Please correct me if my understanding is wrong
> >> here.
> > It's probably not a high priority, I was just thinking it might be
> > useful to have in case there ended up being some problem related to
> > the IP discovery table on some boards.  E.g., we'd know that all
> > boards with a certain harvest config seem to align with a reported
> > problem.  Similar for vbios.  It's more for telemetry.  E.g., all the
> > boards reporting some problem have a particular powerplay config or
> > whatever.
> I got it.
> But two points of contention here in my understanding. The dump works
> only where there is reset and not sure if it could be used very early in
> development of not. Second point is that devcoredump is 4096
> bytes/4Kbyte of memory where we are dumping all the information. Not
> sure if that could be increased but it might not be enough if we are
> planning to dump all to it.

ah, ok.  Let's skip the IP versions in that case, we can use the
family and rev_id and external_rev_id to look up the IP versions.

Alex

>
> Another point is since we have sysfs/debugfs/info ioctl etc information
> available. We should sort out what really is helpful in debugging GPU
> hang and that's added in devcore.
>
> Regards
> Sunil
>
> >
> > Alex
> >
> >
> >>> Alex
> >>>
> >>>> +
> >>>>           if (coredump->ring) {
> >>>>                   drm_printf(&p, "\nRing timed out details\n");
> >>>>                   drm_printf(&p, "IP Type: %d Ring Name: %s\n",
> >>>> --
> >>>> 2.34.1
> >>>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ