linux-stable-mirror

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2026-01-23 15:12:55 +01:00

Author	SHA1	Message	Date
Shuicheng Lin	08fdfd260e	drm/xe/hw_engine_group: Fix double write lock release in error path In xe_hw_engine_group_get_mode(), a write lock is acquired before calling switch_mode(), which in turn invokes xe_hw_engine_group_suspend_faulting_lr_jobs(). On failure inside xe_hw_engine_group_suspend_faulting_lr_jobs(), the write lock is released there, and then again in xe_hw_engine_group_get_mode(), leading to a double release. Fix this by keeping both acquire and release operation in xe_hw_engine_group_get_mode(). Fixes: `770bd1d341` ("drm/xe/hw_engine_group: Ensure safe transition between execution modes") Cc: Francois Dugast <francois.dugast@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://lore.kernel.org/r/20250925023145.1203004-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `662d98b8b3`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-03 14:17:00 -05:00
Matthew Auld	2d1684a077	drm/xe/uapi: loosen used tracking restriction Currently this is hidden behind perfmon_capable() since this is technically an info leak, given that this is a system wide metric. However the granularity reported here is always PAGE_SIZE aligned, which matches what the core kernel is already willing to expose to userspace if querying how many free RAM pages there are on the system, and that doesn't need any special privileges. In addition other drm drivers seem happy to expose this. The motivation here if with oneAPI where they want to use the system wide 'used' reporting here, so not the per-client fdinfo stats. This has also come up with some perf overlay applications wanting this information. Fixes: `1105ac15d2` ("drm/xe/uapi: restrict system wide accounting") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Joshua Santosh <joshua.santosh.ranjan@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250919122052.420979-2-matthew.auld@intel.com (cherry picked from commit `4d0b035fd6`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-03 14:16:55 -05:00
Mallesh Koujalagi	6982a462cb	drm/xe/xe_late_bind_fw: Initialize uval variable in xe_late_bind_fw_num_fans() Initialize the uval variable to 0 in xe_late_bind_fw_num_fans() to fix a potential use of uninitialized variable warning and ensure predictable behavior. The variable is passed by reference to xe_pcode_read() which should populate it on success, but initializing it to 0 provides a safe default value and follows kernel coding best practices. v2: - uval = 0 which serves as both a safe default and the fallback value when the pcode read operation fails. v3: - Handle MMIO failure (Rodrigo) - The function should probably return the error and make the uval as pointer-argument, like the pcode_read. - Change the caller of this function to propagate the error upwards if mmio failed. Fixes: `45832bf9c1` ("drm/xe/xe_late_bind_fw: Initialize late binding firmware") Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> Link: https://lore.kernel.org/r/20251002005648.3185636-1-mallesh.koujalagi@intel.com Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit `07abc16c14`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:52 -07:00
Thomas Hellström	10aa5c8060	drm/gpusvm, drm/xe: Fix userptr to not allow device private pages When userptr is used on SVM-enabled VMs, a non-NULL hmm_range::dev_private_owner value might mean that hmm_range_fault() attempts to return device private pages. Either that will fail, or the userptr code will not know how to handle those. Use NULL for hmm_range::dev_private_owner to migrate such pages to system. In order to do that, move the struct drm_gpusvm::device_private_page_owner field to struct drm_gpusvm_ctx::device_private_page_owner so that it doesn't remain immutable over the drm_gpusvm lifetime. v2: - Don't conditionally compile xe_svm_devm_owner(). - Kerneldoc xe_svm_devm_owner(). Fixes: `9e97874148` ("drm/xe/userptr: replace xe_hmm with gpusvm") Cc: Matthew Auld <matthew.auld@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://lore.kernel.org/r/20250930122752.96034-1-thomas.hellstrom@linux.intel.com (cherry picked from commit `ad298d9ec9`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:52 -07:00
Colin Ian King	5b440a8bba	drm/xe/xe_late_bind_fw: Fix missing initialization of variable offset The variable offset is not being initialized, and it is only set inside a for-loop if entry->name is the same as manifest_entry. In the case where it is not initialized a non-zero check on offset is potentialy checking a bogus uninitalized value. Fix this by initializing offset to zero. Fixes: `efa29317a5` ("drm/xe/xe_late_bind_fw: Extract and print version info") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://lore.kernel.org/r/20250924102208.9216-1-colin.i.king@gmail.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit `20f3b28e2e`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:52 -07:00
Thomas Hellström	17f6f6f25a	drm/xe/bo: Fix an idle assertion for local bos Before calling ttm_bo_populate() in the CPU fault path of a bo, we assert that the bo is not being migrated. However, for local bos we share the reservation object with other local bos that might be in the process of being migrated. Also some VM operations may attach USAGE_KERNEL fences to the common reservation object and trigger false positives from the assert. So remove the assert and instead wait for bo idle. This may unnecessarily wait for idle in some cases but since we're doing this wait later in the fault path anyway we might as well do it here as well. This fixes warnings like: Sep 25 14:56:23 desky kernel: ------------[ cut here ]------------ Sep 25 14:56:23 desky kernel: xe 0000:03:00.0: [drm] Assertion `dma_resv_test_signaled(tbo->base.resv, DMA_RESV_USAGE_KERNEL) \|\| (tbo->ttm && ttm_tt_is_populated(tbo->ttm))` failed! platform: BATTLEMAGE subplatform: 1 graphics: Xe2_HPG 20.01 step A0 media: Xe2_HPM 13.01 step A1 Sep 25 14:56:23 desky kernel: WARNING: CPU: 6 PID: 24767 at drivers/gpu/drm/xe/xe_bo.c:1748 xe_bo_fault_migrate+0x1bb/0x300 [xe] Sep 25 14:56:23 desky kernel: Modules linked in: cpuid dm_crypt xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc xfrm_user xfr> Sep 25 14:56:23 desky kernel: snd_soc_sdca snd_seq_midi prime_numbers coretemp snd_seq_midi_event drm_ttm_helper snd_hda_codec drm_buddy drm_exec snd_rawmidi snd_soc_core snd_hda_cor> Sep 25 14:56:23 desky kernel: CPU: 6 UID: 1000 PID: 24767 Comm: steamwebhelper Tainted: G U W 6.17.0-rc7+ #32 PREEMPT(voluntary) Sep 25 14:56:23 desky kernel: Tainted: [U]=USER, [W]=WARN Sep 25 14:56:23 desky kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D36/PRO Z690-P DDR4 (MS-7D36), BIOS A.A1 10/18/2022 Sep 25 14:56:23 desky kernel: RIP: 0010:xe_bo_fault_migrate+0x1bb/0x300 [xe] Sep 25 14:56:23 desky kernel: Code: fa 64 29 f9 48 c7 c7 40 e0 d3 c1 51 48 c7 c1 c0 e3 d3 c1 52 4c 8b 45 c0 41 50 44 8b 4d c8 4d 89 e0 48 8b 55 a8 e8 25 27 95 ef <0f> 0b 48 83 c4 40 4> Sep 25 14:56:23 desky kernel: RSP: 0000:ffffae1ca88c7b10 EFLAGS: 00010286 Sep 25 14:56:23 desky kernel: RAX: 0000000000000000 RBX: ffff8d7cfd7e6800 RCX: 0000000000000027 Sep 25 14:56:23 desky kernel: RDX: ffff8d845019cec8 RSI: 0000000000000001 RDI: ffff8d845019cec0 Sep 25 14:56:23 desky kernel: RBP: ffffae1ca88c7bc8 R08: 0000000000000000 R09: 0000000000000000 Sep 25 14:56:23 desky kernel: R10: 0000000000000000 R11: 0000000000000004 R12: ffffffffc1db1faa Sep 25 14:56:23 desky kernel: R13: ffffffffc1db2ab4 R14: 0000000000000001 R15: ffffae1ca88c7bd8 Sep 25 14:56:23 desky kernel: FS: 00007fb1baf31940(0000) GS:ffff8d849c870000(0000) knlGS:0000000000000000 Sep 25 14:56:23 desky kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 25 14:56:23 desky kernel: CR2: 00007fb1b2860020 CR3: 00000001705a9004 CR4: 0000000000772ef0 Sep 25 14:56:23 desky kernel: PKRU: 55555558 Sep 25 14:56:23 desky kernel: Call Trace: Sep 25 14:56:23 desky kernel: <TASK> Sep 25 14:56:23 desky kernel: xe_bo_cpu_fault_fastpath+0x11e/0x220 [xe] Sep 25 14:56:23 desky kernel: xe_bo_cpu_fault+0x84/0x410 [xe] Sep 25 14:56:23 desky kernel: ? __x64_sys_mmap+0x33/0x50 Sep 25 14:56:23 desky kernel: ? x64_sys_call+0x1b2e/0x20d0 Sep 25 14:56:23 desky kernel: ? do_syscall_64+0x9d/0x1f0 Sep 25 14:56:23 desky kernel: ? __check_object_size+0x4a/0x2e0 Sep 25 14:56:23 desky kernel: __do_fault+0x36/0x190 Sep 25 14:56:23 desky kernel: do_fault+0xcf/0x570 Sep 25 14:56:23 desky kernel: __handle_mm_fault+0x92b/0xfe0 Sep 25 14:56:23 desky kernel: ? ktime_get_mono_fast_ns+0x39/0xd0 Sep 25 14:56:23 desky kernel: handle_mm_fault+0x164/0x2c0 Sep 25 14:56:23 desky kernel: do_user_addr_fault+0x2cb/0x840 Sep 25 14:56:23 desky kernel: exc_page_fault+0x75/0x180 Sep 25 14:56:23 desky kernel: asm_exc_page_fault+0x27/0x30 Sep 25 14:56:23 desky kernel: RIP: 0033:0x7fb1bc388bb7 Sep 25 14:56:23 desky kernel: Code: 48 ff c7 48 01 fe 48 8d 54 11 80 0f 1f 84 00 00 00 00 00 c5 fe 6f 0e c5 fe 6f 56 20 c5 fe 6f 5e 40 c5 fe 6f 66 60 48 83 ee 80 <c5> fd 7f 0f c5 fd 7> Sep 25 14:56:23 desky kernel: RSP: 002b:00007ffd7814fad8 EFLAGS: 00010207 Sep 25 14:56:23 desky kernel: RAX: 00007fb1b2860000 RBX: 0000000000000690 RCX: 00007fb1b2860000 Sep 25 14:56:23 desky kernel: RDX: 00007fb1b2860610 RSI: 0000556eda79f4c0 RDI: 00007fb1b2860020 Sep 25 14:56:23 desky kernel: RBP: 00007ffd7814fb60 R08: 0000000000000000 R09: 000000012be0e000 Sep 25 14:56:23 desky kernel: R10: 00007fb1b2860000 R11: 0000000000000246 R12: 0000556edd39a240 Sep 25 14:56:23 desky kernel: R13: 00007fb1b2dcb010 R14: 0000556eda79f420 R15: 0000000000000000 Sep 25 14:56:23 desky kernel: </TASK> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5250 Fixes: `c2ae94cf8c` ("drm/xe: Convert the CPU fault handler for exhaustive eviction") Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250929112649.6131-1-thomas.hellstrom@linux.intel.com (cherry picked from commit `8f1756a7ea`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:52 -07:00
Michal Wajdeczko	3734c8184d	drm/xe/vf: Don't claim support for firmware late-bind if VF In general, the VFs can't load firmwares so attempt to initialize the firmware late-bind component leads to errors like: [] xe 0000:03:00.1: [drm] ERROR Late bind component not bound Fixes: `918bd789d6` ("drm/xe/xe_late_bind_fw: Introduce xe_late_bind_fw") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6190 Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://lore.kernel.org/r/20250928174811.198933-3-michal.wajdeczko@intel.com (cherry picked from commit `e35e288090`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:52 -07:00
Michal Wajdeczko	de61d5944c	drm/xe/vf: Rename sriov_update_device_info This is a VF only function and its name should reflect that to avoid any confusion. Move the VF check to the caller side. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Link: https://lore.kernel.org/r/20250928174811.198933-2-michal.wajdeczko@intel.com (cherry picked from commit `b88bb1eefa`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:51 -07:00
Lucas De Marchi	59b7ed0ba2	drm/xe/configfs: Improve doc for ctx_restore* attributes Spell out the syntax instead of only using examples. Particularly important the <engine-class> part since that's different than engines_allowed and may confuse users. The same batch buffer is used for all engines of a certain class. Cc: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Fixes: `e2a9854d80` ("drm/xe/configfs: Allow to select by class only") Link: https://lore.kernel.org/r/20250924152709.659483-4-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `47ca7acff4`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:51 -07:00
Lucas De Marchi	7646423c7f	drm/xe/configfs: Fix engine class parsing If mask is NULL, only the engine class should be accepted, so the pattern string should be completely parsed. This should fix passing e.g. rcs0 to ctx_restore_post_bb when it's only expecting the engine class. Reported-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Closes: https://lore.kernel.org/r/20250922155544.67712-1-jonathan.cavitt@intel.com Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/aNJKnrCQmL9xS9Gv@stanley.mountain Fixes: `e2a9854d80` ("drm/xe/configfs: Allow to select by class only") Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250924152709.659483-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `dd79796716`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:51 -07:00
Michal Wajdeczko	7bd03e3914	drm/xe/tests: Fix build break on clang 16.0.6 The following error was reported when building with clang 16.0.6: In file included from drivers/gpu/drm/xe/xe_pci.c:1104: >> drivers/gpu/drm/xe/tests/xe_pci.c:214:2: error: initializer \ element is not a compile-time constant graphics_ip_xelp, ^~~~~~~~~~~~~~~~ drivers/gpu/drm/xe/tests/xe_pci.c:221:2: error: initializer \ element is not a compile-time constant media_ip_xem, ^~~~~~~~~~~~ 2 errors generated. Fix that by explicit re-definition of pre-GMDID IPs, as there are not so many of them. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509192041.tQwdE4DS-lkp@intel.com/ Fixes: `5bb5258e35` ("drm/xe/tests: Add pre-GMDID IP descriptors to param generators") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250922101207.192028-1-michal.wajdeczko@intel.com (cherry picked from commit `2de80e2da7`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2025-10-02 21:57:51 -07:00
Dave Airlie	b2ec5ca9d5	Merge tag 'amd-drm-next-6.18-2025-09-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-09-26: amdgpu: - Misc fixes - Misc cleanups - SMU 13.x fixes - MES fix - VCN 5.0.1 reset fixes - DCN 3.2 watermark fixes - AVI infoframe fixes - PSR fix - Brightness fixes - DCN 3.1.4 fixes - DCN 3.1+ DTM fixes - DCN powergating fixes - DMUB fixes - DCN/SMU cleanup - DCN stutter fixes - DCN 3.5 fixes - GAMMA_LUT fixes - Add UserQ documentation - GC 9.4 reset fixes - Enforce isolation cleanups - UserQ fixes - DC/non-DC common modes cleanup - DCE6-10 fixes amdkfd: - Fix a race in sw_fini - Switch partition fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250926143918.2030854-1-alexander.deucher@amd.com	2025-09-30 09:26:31 +10:00
Dave Airlie	62bea0e1d5	Merge tag 'drm-habanalabs-next-2025-09-25' of https://github.com/HabanaAI/drivers.accel.habanalabs.kernel into drm-next This tag contains habanalabs driver changes for v6.18. It continues the previous upstream work from tags/drm-habanalabs-next-2024-06-23, Including improvements in debug and visibility, alongside general code cleanups, and new features such as vmalloc-backed coherent mmap, HLDIO infrastructure, etc. Signed-off-by: Dave Airlie <airlied@redhat.com> From: "Elbaz, Koby" <koby.elbaz@intel.com> Link: https://lore.kernel.org/r/da02d370-9967-49d2-9eef-7aeaa40c987c@intel.com	2025-09-26 13:26:51 +10:00
Dave Airlie	a2caae58f8	Merge tag 'drm-misc-next-fixes-2025-09-25' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next Short summary of fixes pull: bridge: - waveshare-dsi: Fix error handling in probe function pixpaper: - select GEM SHMEM helpers Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://lore.kernel.org/r/20250925064257.GA9107@linux.fritz.box	2025-09-26 13:06:39 +10:00
Mario Limonciello	df2ba57094	drm/amd: Add name to modes from amdgpu_connector_add_common_modes() [Why] When DC adds common modes it adds modes with a string to match what they are. Non-DC doesn't. This can be inconsistent when turning on/off DC support. [How] Add a name member to common_modes[] and copy it into the drm display mode. Cc: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Link: https://lore.kernel.org/r/20250924161624.1975819-6-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:54:22 -04:00
Mario Limonciello	6d622755bc	drm/amd: Drop some common modes from amdgpu_connector_add_common_modes() [Why] DC and non-DC codepaths have different sets of common modes that are added for eDP and LVDS cases. This can cause different behaviors for turning on DC on hardware that can support both. [How] Drop extra modes from amdgpu_connector_add_common_modes() not present in amdgpu_dm_connector_add_common_modes(). Cc: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250924161624.1975819-5-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:54:12 -04:00
Alex Deucher	dbf2341569	drm/amdgpu: update MODULE_PARM_DESC for freesync_video To better describe what it does. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3756 Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:54:07 -04:00
Mario Limonciello	123a1750c5	drm/amd: Use dynamic array size declaration for amdgpu_connector_add_common_modes() [Why] Adding or removing a mode from common_modes[] can be fragile if a user forgot to update the for loop boundaries. [How] Use ARRAY_SIZE() to detect size of the array and use that instead. Cc: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Link: https://lore.kernel.org/r/20250924161624.1975819-4-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:55 -04:00
Timur Kristóf	1f721ebcf3	drm/amd/display: Share dce100_validate_global with DCE6-8 The dce100_validate_global function was verbatim exactly the same as dce60_validate_global and dce80_validate_global. Share dce100_validate_global between DCE6-10 to save code size. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:46 -04:00
Timur Kristóf	ee352f6c56	drm/amd/display: Share dce100_validate_bandwidth with DCE6-8 DCE6-8 have very similar capabilities to DCE10, they support the same DP and HDMI versions and work similarly. Share dce100_validate_bandwidth between DCE6-10 to reduce code duplication in the DC driver. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:33 -04:00
Jesse.Zhang	b8ae2640f9	drm/amdgpu: Fix fence signaling race condition in userqueue This commit fixes a potential race condition in the userqueue fence signaling mechanism by replacing dma_fence_is_signaled_locked() with dma_fence_is_signaled(). The issue occurred because: 1. dma_fence_is_signaled_locked() should only be used when holding the fence's individual lock, not just the fence list lock 2. Using the locked variant without the proper fence lock could lead to double-signaling scenarios: - Hardware completion signals the fence - Software path also tries to signal the same fence By using dma_fence_is_signaled() instead, we properly handle the locking hierarchy and avoid the race condition while still maintaining the necessary synchronization through the fence_list_lock. v2: drop the comment (Christian) Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:23 -04:00
Yifan Zhang	45da20e00d	amd/amdkfd: enhance kfd process check in switch partition current switch partition only check if kfd_processes_table is empty. kfd_prcesses_table entry is deleted in kfd_process_notifier_release, but kfd_process tear down is in kfd_process_wq_release. consider two processes: Process A (workqueue) -> kfd_process_wq_release -> Access kfd_node member Process B switch partition -> amdgpu_xcp_pre_partition_switch -> amdgpu_amdkfd_device_fini_sw -> kfd_node tear down. Process A and B may trigger a race as shown in dmesg log. This patch is to resolve the race by adding an atomic kfd_process counter kfd_processes_count, it increment as create kfd process, decrement as finish kfd_process_wq_release. v2: Put kfd_processes_count per kfd_dev, move decrement to kfd_process_destroy_pdds and bug fix. (Philip Yang) [3966658.307702] divide error: 0000 [#1] SMP NOPTI [3966658.350818] i10nm_edac [3966658.356318] CPU: 124 PID: 38435 Comm: kworker/124:0 Kdump: loaded Tainted [3966658.356890] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu] [3966658.362839] nfit [3966658.366457] RIP: 0010:kfd_get_num_sdma_engines+0x17/0x40 [amdgpu] [3966658.366460] Code: 00 00 e9 ac 81 02 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 48 8b 4f 08 48 8b b7 00 01 00 00 8b 81 58 26 03 00 99 <f7> be b8 01 00 00 80 b9 70 2e 00 00 00 74 0b 83 f8 02 ba 02 00 00 [3966658.380967] x86_pkg_temp_thermal [3966658.391529] RSP: 0018:ffffc900a0edfdd8 EFLAGS: 00010246 [3966658.391531] RAX: 0000000000000008 RBX: ffff8974e593b800 RCX: ffff888645900000 [3966658.391531] RDX: 0000000000000000 RSI: ffff888129154400 RDI: ffff888129151c00 [3966658.391532] RBP: ffff8883ad79d400 R08: 0000000000000000 R09: ffff8890d2750af4 [3966658.391532] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000000 [3966658.391533] R13: ffff8883ad79d400 R14: ffffe87ff662ba00 R15: ffff8974e593b800 [3966658.391533] FS: 0000000000000000(0000) GS:ffff88fe7f600000(0000) knlGS:0000000000000000 [3966658.391534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3966658.391534] CR2: 0000000000d71000 CR3: 000000dd0e970004 CR4: 0000000002770ee0 [3966658.391535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3966658.391535] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [3966658.391536] PKRU: 55555554 [3966658.391536] Call Trace: [3966658.391674] deallocate_sdma_queue+0x38/0xa0 [amdgpu] [3966658.391762] process_termination_cpsch+0x1ed/0x480 [amdgpu] [3966658.399754] intel_powerclamp [3966658.402831] kfd_process_dequeue_from_all_devices+0x5b/0xc0 [amdgpu] [3966658.402908] kfd_process_wq_release+0x1a/0x1a0 [amdgpu] [3966658.410516] coretemp [3966658.434016] process_one_work+0x1ad/0x380 [3966658.434021] worker_thread+0x49/0x310 [3966658.438963] kvm_intel [3966658.446041] ? process_one_work+0x380/0x380 [3966658.446045] kthread+0x118/0x140 [3966658.446047] ? __kthread_bind_mask+0x60/0x60 [3966658.446050] ret_from_fork+0x1f/0x30 [3966658.446053] Modules linked in: kpatch_20765354(OEK) [3966658.455310] kvm [3966658.464534] mptcp_diag xsk_diag raw_diag unix_diag af_packet_diag netlink_diag udp_diag act_pedit act_mirred act_vlan cls_flower kpatch_21951273(OEK) kpatch_18424469(OEK) kpatch_19749756(OEK) [3966658.473462] idxd_mdev [3966658.482306] kpatch_17971294(OEK) sch_ingress xt_conntrack amdgpu(OE) amdxcp(OE) amddrm_buddy(OE) amd_sched(OE) amdttm(OE) amdkcl(OE) intel_ifs iptable_mangle tcm_loop target_core_pscsi tcp_diag target_core_file inet_diag target_core_iblock target_core_user target_core_mod coldpgs kpatch_18383292(OEK) ip6table_nat ip6table_filter ip6_tables ip_set_hash_ipportip ip_set_hash_ipportnet ip_set_hash_ipport ip_set_bitmap_port xt_comment iptable_nat nf_nat iptable_filter ip_tables ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sn_core_odd(OE) i40e overlay binfmt_misc tun bonding(OE) aisqos(OE) aisqos_hotfixes(OE) rfkill uio_pci_generic uio cuse fuse nf_tables nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm idxd_mdev [3966658.491237] vfio_pci [3966658.501196] vfio_pci vfio_virqfd mdev vfio_iommu_type1 vfio iax_crypto intel_pmt_telemetry iTCO_wdt intel_pmt_class iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq [3966658.508537] vfio_virqfd [3966658.517569] snd_seq_device ipmi_ssif isst_if_mbox_pci isst_if_mmio pcspkr snd_pcm idxd intel_uncore ses isst_if_common intel_vsec idxd_bus enclosure snd_timer mei_me snd i2c_i801 i2c_smbus mei i2c_ismt soundcore joydev acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad vfat fat [3966658.526851] mdev [3966658.536096] nfsd auth_rpcgss nfs_acl lockd grace slb_vtoa(OE) sunrpc dm_mod hookers mlx5_ib(OE) ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm mlx5_core(OE) mlxfw(OE) [3966658.540381] vfio_iommu_type1 [3966658.544341] nvme mpt3sas tls drm nvme_core pci_hyperv_intf raid_class psample libcrc32c crc32c_intel mlxdevm(OE) i2c_core [3966658.551254] vfio [3966658.558742] scsi_transport_sas wmi pinctrl_emmitsburg sd_mod t10_pi sg ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [3966658.563004] iax_crypto [3966658.570988] [last unloaded: diagnose] [3966658.571027] ---[ end trace cc9dbb180f9ae537 ]--- Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Philip.Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:53:08 -04:00
Yifan Zhang	99d7181bca	amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw There is race in amdgpu_amdkfd_device_fini_sw and interrupt. if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and kfree(kfd), and KGD interrupt generated. kernel panic log: BUG: kernel NULL pointer dereference, address: 0000000000000098 amdgpu 0000:c8:00.0: amdgpu: Requesting 4 partitions through PSP PGD d78c68067 P4D d78c68067 kfd kfd: amdgpu: Allocated 3969056 bytes on gart PUD 1465b8067 PMD @ Oops: @002 [#1] SMP NOPTI kfd kfd: amdgpu: Total number of KFD nodes to be created: 4 CPU: 115 PID: @ Comm: swapper/115 Kdump: loaded Tainted: G S W OE K RIP: 0010:_raw_spin_lock_irqsave+0x12/0x40 Code: 89 e@ 41 5c c3 cc cc cc cc 66 66 2e Of 1f 84 00 00 00 00 00 OF 1f 40 00 Of 1f 44% 00 00 41 54 9c 41 5c fa 31 cO ba 01 00 00 00 <fO> OF b1 17 75 Ba 4c 89 e@ 41 Sc 89 c6 e8 07 38 5d RSP: 0018: ffffc90@1a6b0e28 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018 0000000000000001 RSI: ffff8883bb623e00 RDI: 0000000000000098 ffff8883bb000000 RO8: ffff888100055020 ROO: ffff888100055020 0000000000000000 R11: 0000000000000000 R12: 0900000000000002 ffff888F2b97da0@ R14: @000000000000098 R15: ffff8883babdfo00 CS: 010 DS: 0000 ES: 0000 CRO: 0000000080050033 CR2: 0000000000000098 CR3: 0000000e7cae2006 CR4: 0000000002770ce0 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 0000000000000000 DR6: 00000000fffeO7FO DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> kgd2kfd_interrupt+@x6b/0x1f@ [amdgpu] ? amdgpu_fence_process+0xa4/0x150 [amdgpu] kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 3 YcpxFl Rant tErace amdgpu_irq_dispatch+0x165/0x210 [amdgpu] amdgpu_ih_process+0x80/0x100 [amdgpu] amdgpu: Virtual CRAT table created for GPU amdgpu_irq_handler+0x1f/@x60 [amdgpu] __handle_irq_event_percpu+0x3d/0x170 amdgpu: Topology: Add dGPU node [0x74a2:0x1002] handle_irq_event+0x5a/@xcO handle_edge_irq+0x93/0x240 kfd kfd: amdgpu: KFD node 1 partition @ size 49148M asm_call_irq_on_stack+0xf/@x20 </IRQ> common_interrupt+0xb3/0x130 asm_common_interrupt+0x1le/0x40 5.10.134-010.a1i5000.a18.x86_64 #1 Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:52:36 -04:00
Timur Kristóf	118800b079	drm/amd/display: Reject modes with too high pixel clock on DCE6-10 Reject modes with a pixel clock higher than the maximum display clock. Use 400 MHz as a fallback value when the maximum display clock is not known. Pixel clocks that are higher than the display clock just won't work and are not supported. With the addition of the YUV422 fallback, DC can now accidentally select a mode requiring higher pixel clock than actually supported when the DP version supports the required bandwidth but the clock is otherwise too high for the display engine. DCE 6-10 don't support these modes but they don't have a bandwidth calculation to reject them properly. Fixes: `db291ed173` ("drm/amd/display: Add fallback path for YCBCR422") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:51:07 -04:00
Mario Limonciello	210844d2c0	drm/amd: Drop unnecessary check in amdgpu_connector_add_common_modes() [Why] amdgpu_connector_add_common_modes() has a check for the width and height of common modes being too small, but the array of common_modes[] has fixed values. The check is dead code. [How] Drop unnecessary check. Cc: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250924161624.1975819-3-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:50:58 -04:00
Mario Limonciello	0fb915d64d	drm/amd/display: Only enable common modes for eDP and LVDS [Why] The main reason common modes are added is for compatibility with clone mode when a laptop is connected to a projector or external monitor. Since commit `978fa2f6d0` ("drm/amd/display: Use scaling for non-native resolutions on eDP") when non-native modes are picked for eDP the GPU scalar will be used. This is because it is inconsistent whether eDP panels have the capability to actually drive non-native resolutions. With panels connected to other connectors this limitation generally doesn't exist as we the EDID will advertise support for a number of resolutions and monitors will use built in scaling hardware. Comparing DC and non-DC code paths the non-DC code path only adds common modes for LVDS and eDP whereas the DC codepath does it for all connector types. In the past there was an experiment done to disable common mode adding for eDP and LVDS from commit `6d396e7ac1` ("drm/amd/display: Disable common modes for LVDS") and commit `7948afb46a` ("drm/amd/display: Disable common modes for eDP") but this was reverted in commit `a8b79b0918` ("drm/amd: Re-enable common modes for eDP and LVDS") because it caused problems with Xorg. [How] Only add common modes for eDP and LVDS for DC, matching the behavior of non-DC. Suggested-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250924161624.1975819-2-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:47:55 -04:00
Sunil Khatri	4e3b45d7b6	drm/amdgpu: remove the redeclaration of variable i Variable "i" has been redeclared as integer later in the function which is wrong and not serving any purpose. Fixes: `899fbde146` ("drm/amdgpu: replace get_user_pages with HMM mirror helpers") Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:46:35 -04:00
Prike Liang	883bd89d00	drm/amdgpu/userq: assign an error code for invalid userq va It should return an error code if userq VA validation fails. Fixes: `9e46b8bb05` ("drm/amdgpu: validate userq buffer virtual address and size") Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:40:18 -04:00
Christian König	90e09ea4cf	drm/amdgpu: revert "rework reserved VMID handling" v2 This reverts commit `e44a0fe630`. Initially we used VMID reservation to enforce isolation between processes. That has now been replaced by proper fence handling. Both OpenGL, RADV and ROCm developers requested a way to reserve a VMID for SPM, so restore that approach by reverting back to only allowing a single process to use the reserved VMID. Only compile tested for now. v2: use -ENOENT instead of -EINVAL if VMID is not available Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:39:00 -04:00
Christian König	66f3883dbc	drm/amdgpu: remove leftover from enforcing isolation by VMID Initially we enforced isolation by reserving a VMID, but that practice was now removed. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:38:54 -04:00
Jesse.Zhang	7469567d88	drm/amdgpu: Add fallback to pipe reset if KCQ ring reset fails Add a fallback mechanism to attempt pipe reset when KCQ reset fails to recover the ring. After performing the KCQ reset and queue remapping, test the ring functionality. If the ring test fails, initiate a pipe reset as an additional recovery step. v2: fix the typo (Lijo) v3: try pipeline reset when kiq mapping fails (Lijo) Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-25 15:38:48 -04:00
Pavan S	6ca282c3e6	accel/habanalabs: add Infineon version check On HL338 ASICs, the Infineon first‑stage firmware is not present and the reported version is 0. In this case printing a version number is misleading, as it suggests valid firmware when it does not exist. Fix this by printing the first‑stage Infineon firmware version only if the reported value is non‑zero. This avoids confusing or incorrect log messages on devices where the first stage is not applicable. Signed-off-by: Pavan S <pavan.sreenivas@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:14:45 +03:00
Konstantin Sinyuk	a0d866bab1	accel/habanalabs/gaudi2: read preboot status after recovering from dirty state Dirty state can occur when the host VM undergoes a reset while the device does not. In such a case, the driver must reset the device before it can be used again. As part of this reset, the device capabilities are zeroed. Therefore, the driver must read the Preboot status again to learn the Preboot state, capabilities, and security configuration. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:32 +03:00
Ariel Aviad	65a3f5bc33	accel/habanalabs: add HL_GET_P_STATE passthrough type Add a new passthrough type HL_GET_P_STATE to the cpucp generic ioctl to allow userspace to read the device performance state via firmware. Signed-off-by: Ariel Aviad <ariel.aviad@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:31 +03:00
Konstantin Sinyuk	eeb38d0e91	accel/habanalabs: add debugfs interface for HLDIO testing Add debugfs files for NVMe Direct I/O (HLDIO) functionality. This interface allows userspace access to direct SSD ↔ device transfers through debugfs nodes. Four debugfs files are created under /sys/kernel/debug/habanalabs/hlN/: - dio_ssd2hl : trigger SSD-to-device transfers - dio_hl2ssd : trigger device-to-SSD transfers (placeholder, not yet implemented) - dio_stats : show transfer statistics - dio_reset : reset statistics counters Usage examples: # Perform SSD → device transfer echo "fd=3 va=0x10000 off=0 len=4096" > \ /sys/kernel/debug/habanalabs/hl0/dio_ssd2hl # View statistics cat /sys/kernel/debug/habanalabs/hl0/dio_stats # Reset counters echo 1 > /sys/kernel/debug/habanalabs/hl0/dio_reset This interface provides access to HLDIO functionality for validation and diagnostics. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Farah Kassabri <farah.kassabri@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:31 +03:00
Konstantin Sinyuk	8cbacc9a27	accel/habanalabs: add NVMe Direct I/O (HLDIO) infrastructure Introduce NVMe Direct I/O (HLDIO) infrastructure to support peer‑to‑peer DMA in the habanalabs driver. This adds internal helpers and data structures to enable direct transfers between NVMe storage and device memory. The feature is built only when CONFIG_HL_HLDIO is enabled. A debugfs interface is also provided for functional validation. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Farah Kassabri <farah.kassabri@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:30 +03:00
Moti Haimovski	513024d5a0	accel/habanalabs: support mapping cb with vmalloc-backed coherent memory When IOMMU is enabled, dma_alloc_coherent() with GFP_USER may return addresses from the vmalloc range. If such an address is mapped without VM_MIXEDMAP, vm_insert_page() will trigger a BUG_ON due to the VM_PFNMAP restriction. Fix this by checking for vmalloc addresses and setting VM_MIXEDMAP in the VMA before mapping. This ensures safe mapping and avoids kernel crashes. The memory is still driver-allocated and cannot be accessed directly by userspace. Signed-off-by: Moti Haimovski <moti.haimovski@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:30 +03:00
Ilia Levi	0668db41b5	accel/habanalabs: remove old interface variation of 'access_ok()' The access_ok() API no longer requires the VERIFY_WRITE argument, and the use of the old interface with VERIFY_WRITE is deprecated. Clean up the habanalabs memory manager to use the modern access_ok() interface consistently. This removes old #ifdef guards and aligns the driver with current upstream kernel APIs. Signed-off-by: Ilia Levi <ilia.levi@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:29 +03:00
Konstantin Sinyuk	0529b191ac	accel/habanalabs/gaudi2: use the CPLD_SHUTDOWN event handler After CPLD shutdown event the device is not usable anymore. The common CPLD_SHUTDOWN event handler disables any subsequent device access. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:29 +03:00
Konstantin Sinyuk	083c53a854	accel/habanalabs: disable device access after CPLD_SHUTDOWN After a CPLD shutdown event the device becomes unusable. Prevent further device access once this event is received. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:28 +03:00
Tomer Tayar	cade027efa	accel/habanalabs: fix typo in trace output (cms -> cmd) Fix a typo in TP_printk format string of habanalabs tracepoint: replace "cms" with "cmd". Signed-off-by: Tomer Tayar <tomer.tayar@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:28 +03:00
Tomer Tayar	d0dd796bec	accel/habanalabs: clarify ctx use after hl_ctx_put() in dmabuf release In hl_release_dmabuf(), ctx is dereferenced after calling hl_ctx_put() to obtain the compute device file. This is safe because the dma-buf object holds a file reference taken in export_dmabuf(), and the file release (which drops another ctx reference) can only happen after we drop that file reference via fput(). Thus, this hl_ctx_put() call cannot be the last one at this point. Add a comment explaining this to avoid confusion. Signed-off-by: Tomer Tayar <tomer.tayar@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:27 +03:00
Sharley Calzolari	b5cddeb0dc	accel/habanalabs/gaudi2: add support for logging register accesses from debugfs Add infrastructure for logging the last configuration register accesses that occur via debugfs read/write operations. At interrupt time, these log entries can be dumped to dmesg, which helps in diagnosing the cause of RAZWI and ADDR_DEC interrupts. The logging is implemented as a ring buffer of access entries, with each entry recording timestamp and access details. To ensure correctness under concurrent access, operations are now protected using spinlocks. Entries are copied under lock and then printed after releasing it, which minimizes time spent in the critical section. Signed-off-by: Sharley Calzolari <sharley.calzolari@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:26 +03:00
Ariel Suller	214e26a43f	accel/habanalabs/gaudi2: stringify engine/queue ids Print engine/queue names instead of numerical engine/queue IDs to make logs and debug output more readable. Signed-off-by: Ariel Suller <ariel.suller@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:25 +03:00
Vitaly Margolin	5295be6c4e	accel/habanalabs: add generic message type to get error counters Add a new CPUCP generic message type to retrieve HBM, SRAM and critical error counters from the device. Signed-off-by: Vitaly Margolin <vitaly.margolin@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:25 +03:00
Vered Yavniely	b4fd8e56c9	accel/habanalabs/gaudi2: fix BMON disable configuration Change the BMON_CR register value back to its original state before enabling, so that BMON does not continue to collect information after being disabled. Signed-off-by: Vered Yavniely <vered.yavniely@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:24 +03:00
Tomer Tayar	9f5067531c	accel/habanalabs: return ENOMEM if less than requested pages were pinned EFAULT is currently returned if less than requested user pages are pinned. This value means a "bad address" which might be confusing to the user, as the address of the given user memory is not necessarily "bad". Modify the return value to ENOMEM, as "out of memory" is more suitable in this case. Signed-off-by: Tomer Tayar <tomer.tayar@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>	2025-09-25 09:09:22 +03:00
Jesse.Zhang	4c709ccc47	drm/amd/pm: Add VCN reset message support for SMU v13.0.12 This commit adds support for VCN reset functionality in SMU v13.0.12 by: 1. Adding two new PPSMC messages in smu_v13_0_12_ppsmc.h: - PPSMC_MSG_ResetVCN (0x5E) - Updates PPSMC_Message_Count to 0x5F to account for new messages 2. Adding message mapping for ResetVCN in smu_v13_0_12_ppt.c: - Maps SMU_MSG_ResetVCN to PPSMC_MSG_ResetVCN These changes enable proper VCN reset handling through the SMU firmware interface for compatible AMD GPUs. v2: Added fw version check to support vcn queue reset. Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Sonny Jiang <sonny.jiang@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:41:41 -04:00
Jesse.Zhang	5886090032	drm/amdgpu: Move VCN reset mask setup to late_init for VCN 5.0.1 This patch moves the initialization of the VCN supported_reset mask from sw_init to a new late_init function for VCN 5.0.1. The change ensures that all necessary hardware and firmware initialization is complete before determining the supported reset types. Key changes: - Added vcn_v5_0_1_late_init() function to handle late initialization - Moved supported_reset mask setup from sw_init to late_init - Added check for per-queue reset support via amdgpu_dpm_reset_vcn_is_supported() - Updated ip_funcs to use the new late_init function This change helps ensure proper reset behavior by waiting until all dependencies are initialized before determining available reset types. Reviewed-by: Sonny Jiang <sonny.jiang@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Ruili Ji <ruiliji2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:41:37 -04:00
Jesse.Zhang	dc704458dd	drm/amdgpu: Add ring reset support for VCN v5.0.1 Implement the ring reset callback for VCN v5.0.1 to properly handle hardware recovery when encountering GPU hangs. The new functionality: 1. Adds vcn_v5_0_1_ring_reset() function that: - Prepares for reset using amdgpu_ring_reset_helper_begin() - Performs VCN instance reset via amdgpu_dpm_reset_vcn() - Re-initializes hardware through vcn_v5_0_1_hw_init_inst() - Restarts DPG mode with vcn_v5_0_1_start_dpg_mode() - Completes reset with amdgpu_ring_reset_helper_end() 2. Hooks the reset function into the unified ring functions via: - Adding .reset = vcn_v5_0_1_ring_reset to vcn_v5_0_1_unified_ring_vm_funcs 3. Maintains existing behavior for SR-IOV VF cases by checking RRMT status This provides proper hardware recovery capabilities for VCN 5.0.1 IP block during fault conditions, matching functionality available in other VCN versions. v2: Remove the RRMT_ENABLED cap setting in the reset function and replace adev->vcn.inst[ring->me].indirect_sram with vinst->indirect_sram (Lijo) Reviewed-by: Sonny Jiang <sonny.jiang@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Ruili Ji <ruiliji2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-23 10:41:27 -04:00

1 2 3 4 5 ...

1384752 Commits