linux-stable-mirror

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2026-04-03 12:05:13 +02:00

Author	SHA1	Message	Date
Shuicheng Lin	549b68ba83	drm/xe/reg_sr: Fix leak on xa_store failure [ Upstream commit `3091723785` ] Free the newly allocated entry when xa_store() fails to avoid a memory leak on the error path. v2: use goto fail_free. (Bala) Fixes: `e5283bd4df` ("drm/xe/reg_sr: Remove register pool") Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260204172810.1486719-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit 6bc6fec71ac45f52db609af4e62bdb96b9f5fadb) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-13 17:20:43 +01:00
Matthew Brost	1a42ea28e0	drm/xe: Do not preempt fence signaling CS instructions [ Upstream commit `cdc8a1e11f` ] If a batch buffer is complete, it makes little sense to preempt the fence signaling instructions in the ring, as the largest portion of the work (the batch buffer) is already done and fence signaling consists of only a few instructions. If these instructions are preempted, the GuC would need to perform a context switch just to signal the fence, which is costly and delays fence signaling. Avoid this scenario by disabling preemption immediately after the BB start instruction and re-enabling it after executing the fence signaling instructions. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Carlos Santa <carlos.santa@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patch.msgid.link/20260115004546.58060-1-matthew.brost@intel.com (cherry picked from commit 2bcbf2dcde0c839a73af664a3c77d4e77d58a3eb) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-13 17:20:40 +01:00
Matthew Brost	5b27fcb5d1	drm/xe: Only toggle scheduling in TDR if GuC is running [ Upstream commit `dd1ef5e245` ] If the firmware is not running during TDR (e.g., when the driver is unloading), there's no need to toggle scheduling in the GuC. In such cases, skip this step. v4: - Bail on wait UC not running (Niranjana) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Link: https://patch.msgid.link/20260110012739.2888434-4-matthew.brost@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:21:02 -05:00
Matt Roper	bc78bfd287	drm/xe/xe2_hpg: Fix handling of Wa_14019988906 & Wa_14019877138 [ Upstream commit `bc6387a2e0` ] The PSS_CHICKEN register has been part of the RCS engine's LRC since it was first introduced in Xe_LP. That means that any workarounds that adjust its value (such as Wa_14019988906 and Wa_14019877138) need to be implemented in the lrc_was[] table so that they become part of the default LRC from which all subsequent LRCs are copied. Although these workarounds were implemented correctly on most platforms, they were incorrectly placed on the engine_was[] table for Xe2_HPG. Move the workarounds to the proper lrc_was[] table and switch the 'xe_rtp_match_first_render_or_compute' rule to specifically match the RCS since that's the engine whose LRC manages the register. Bspec: 65182 Fixes: `7f3ee7d880` ("drm/xe/xe2hpg: Add initial GT workarounds") Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com> Link: https://patch.msgid.link/20260205220508.51905-2-matthew.d.roper@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit e04c609eedf4d6748ac0bcada4de1275b034fed6) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:45 -05:00
Shekhar Chauhan	c89bde96e8	drm/xe/xe2_hpg: Add set of workarounds [ Upstream commit `a5d221924e` ] Add set of workarounds for xe2_hpg. -v2: Fix xe2_hpg GMD version for some workarounds. -v3: Removed extra Workaround (Matt Roper) Signed-off-by: Shekhar Chauhan <shekhar.chauhan@intel.com> Signed-off-by: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250605190804.1287289-3-dnyaneshwar.bhadane@intel.com Stable-dep-of: `bc6387a2e0` ("drm/xe/xe2_hpg: Fix handling of Wa_14019988906 & Wa_14019877138") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:45 -05:00
Vinay Belgaumkar	9e18acc5aa	drm/xe/ptl: Apply Wa_13011645652 [ Upstream commit `dddc53806d` ] Extend Wa_13011645652 to PTL. Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250116184659.384874-1-vinay.belgaumkar@intel.com Stable-dep-of: `bc6387a2e0` ("drm/xe/xe2_hpg: Fix handling of Wa_14019988906 & Wa_14019877138") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:45 -05:00
Shuicheng Lin	8f6848b2f6	drm/xe/mmio: Avoid double-adjust in 64-bit reads [ Upstream commit `4a9b4e1fa5` ] xe_mmio_read64_2x32() was adjusting register addresses and then calling xe_mmio_read32(), which applies the adjustment again. This may shift accesses twice if adj_offset < adj_limit. There is no issue currently, as for media gt, adj_offset > adj_limit, so the 2nd adjust will be a no-op. But it may not work in future. To fix it, replace the adjusted-address comparison with a direct sanity check that ensures the MMIO address adjustment cutoff never falls within the 8-byte range of a 64-bit register. And let xe_mmio_read32() handle address translation. v2: rewrite the sanity check in a more natural way. (Matt) v3: Add Fixes tag. (Jani) Fixes: `07431945d8` ("drm/xe: Avoid 64-bit register reads") Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260130165621.471408-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit a30f999681126b128a43137793ac84b6a5b7443f) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:44 -05:00
Matt Roper	26a40327c2	drm/xe: Switch MMIO interface to take xe_mmio instead of xe_gt [ Upstream commit `a84590c5ce` ] Since much of the MMIO register access done by the driver is to non-GT registers, use of 'xe_gt' in these interfaces has been a long-standing design flaw that's been hard to disentangle. To avoid a flag day across the whole driver, munge the function names and add temporary compatibility macros with the original function names that can accept either the new xe_mmio or the old xe_gt structure as a parameter. This will allow us to slowly convert parts of the driver over to the new interface independently. Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-54-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:44 -05:00
Matt Roper	f5508f1e65	drm/xe: Adjust mmio code to pass VF substructure to SRIOV code [ Upstream commit `6fb5d1a1d3` ] Although we want to break the GT-centric nature of the MMIO code in the general driver, the SRIOV handling still relies on data in a VF substructure of the GT. So add a GT backpointer, but name it sriov_vf_gt to make it clear that it's only for this one specific special case and will not be set or usable for anything else. v2: - Store backpointer to the GT itself rather than the SRIOV-specific substructure. (Michal) Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> # v1 Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-53-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:44 -05:00
Matt Roper	9472d36b3d	drm/xe: Add xe_tile backpointer to xe_mmio [ Upstream commit `1877c88fa9` ] Once MMIO operations stop being (incorrectly) tied to a GT, we'll still need a backpointer for feature checks, message logging, and tracepoints. Use a tile backpointer since that may allow the most useful debugging output, while also providing access to the xe_device. v2: - Make backpointer an xe_tile instead of xe_device. (Michal) Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> # v1 Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-52-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:44 -05:00
Matt Roper	74bff541ec	drm/xe: Switch mmio_ext to use 'struct xe_mmio' [ Upstream commit `960a83799f` ] The mmio_ext stuff is completely unused right now, but it isn't providing any functionality that couldn't be treated as a regular mmio space. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-51-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:44 -05:00
Matt Roper	0b433e086b	drm/xe: Populate GT's mmio iomap from tile during init [ Upstream commit `fa599b8c95` ] Each GT should share the same register iomap as its parent tile. Future patches will switch to access the iomap through the GT's mmio substruct rather than through the tile. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-50-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:44 -05:00
Matt Roper	85354b21a5	drm/xe: Move GSI offset adjustment fields into 'struct xe_mmio' [ Upstream commit `9d383916a5` ] By moving the GSI adjustment fields into 'struct xe_mmio' we can replace the GT's MMIO substructure with another instance of xe_mmio. At the moment this means MMIO operations wind up pulling information from two different places (the tile's xe_mmio for the iomap and the GT's xe_mmio for the adjustment), but we'll address that in future patches. The type headers change a bit with this change, meaning that various files should be including xe_device_types.h instead of (or in addition to) xe_gt_types.h. v2: - Fix pre-existing kerneldoc typo while moving the fields (Lucas) v3: - Add missing '@' in kerneldoc. (Rodrigo) Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-49-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:43 -05:00
Matt Roper	c31019bca8	drm/xe: Clarify size of MMIO region [ Upstream commit `d4aff99aef` ] xe_mmio currently has a size parameter that is assigned but never used anywhere. The current values assigned appear to be the size of the BAR region assigned for the tile (both for registers and other purposes such as the GGTT). Since the current field isn't being used for anything, change the assignments to 4MB (the size of the register region on all current platform) and rename the field to 'regs_size' to more clearly describe what it represents. We can use this value in later patches to help ensure no register accesses accidentally go past the end of the desired register space (which might not be caught easily if they still fall within the iomap). v2: - s/regs_length/regs_size/ (Lucas) - Clarify kerneldoc description (Lucas) Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-48-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:43 -05:00
Matt Roper	6c5894d7ac	drm/xe: Create dedicated xe_mmio structure [ Upstream commit `34953ee349` ] Pull the 'mmio' substructure from xe_tile out into a dedicated type. Future patches will expand this structure and then eventually move MMIO read/write operations over to using this type. v2: - Fix kerneldoc of 'size' field. The rename/refocusing of this field got moved to the next patch of the series. (Lucas) - Correct commit message; it's the tile, not the device, mmio that's been pulled out to a separate type. (Michal) Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-47-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:43 -05:00
Matt Roper	4d4e940bc0	drm/xe: Move forcewake to 'gt.pm' substructure [ Upstream commit `998fde0647` ] Forcewake is a general GT power management concept that isn't specific to MMIO register access. Move the forcewake information for a GT out of the 'mmio' substruct and into a 'pm' substruct. Also use the gt_to_fw() helper in a few more places where it was being open-coded. v2: - Kerneldoc tweaks. (Lucas) Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-46-matthew.d.roper@intel.com Stable-dep-of: `4a9b4e1fa5` ("drm/xe/mmio: Avoid double-adjust in 64-bit reads") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:20:43 -05:00
Shuicheng Lin	be74863074	drm/xe: Unregister drm device on probe error [ Upstream commit `96c2c72b81` ] Call drm_dev_unregister() when xe_device_probe() fails after successful drm_dev_register(). This ensures the DRM device is promptly unregistered before returning an error, avoiding leaving it registered on the failure path. Otherwise, there is warn message if xe_device_probe() is called again: " [ 207.322365] [drm:drm_minor_register] [ 207.322381] debugfs: '128' already exists in 'dri' [ 207.322432] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/renderD128' [ 207.322435] CPU: 5 UID: 0 PID: 10261 Comm: modprobe Tainted: G B W 6.19.0-rc2-lgci-xe-kernel+ #223 PREEMPT(voluntary) [ 207.322439] Tainted: [B]=BAD_PAGE, [W]=WARN [ 207.322440] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023 [ 207.322441] Call Trace: [ 207.322442] <TASK> [ 207.322443] dump_stack_lvl+0xa0/0xc0 [ 207.322446] dump_stack+0x10/0x20 [ 207.322448] sysfs_warn_dup+0xd5/0x110 [ 207.322451] sysfs_create_dir_ns+0x1f6/0x280 [ 207.322453] ? __pfx_sysfs_create_dir_ns+0x10/0x10 [ 207.322455] ? lock_acquire+0x1a4/0x2e0 [ 207.322458] ? __kasan_check_read+0x11/0x20 [ 207.322461] kobject_add_internal+0x28d/0x8e0 [ 207.322464] kobject_add+0x11f/0x1f0 [ 207.322465] ? lock_acquire+0x1a4/0x2e0 [ 207.322467] ? __pfx_kobject_add+0x10/0x10 [ 207.322469] ? __kasan_check_write+0x14/0x20 [ 207.322471] ? kobject_put+0x62/0x4a0 [ 207.322473] ? get_device_parent.isra.0+0x1bb/0x4c0 [ 207.322475] ? kobject_put+0x62/0x4a0 [ 207.322477] device_add+0x2d7/0x1500 [ 207.322479] ? __pfx_device_add+0x10/0x10 [ 207.322481] ? drm_debugfs_add_file+0xfa/0x170 [ 207.322483] ? drm_debugfs_add_files+0x82/0xd0 [ 207.322485] ? drm_debugfs_add_files+0x82/0xd0 [ 207.322487] drm_minor_register+0x10a/0x2d0 [ 207.322489] drm_dev_register+0x143/0x860 [ 207.322491] ? xe_configfs_get_psmi_enabled+0x12/0x90 [xe] [ 207.322667] xe_device_probe+0x185b/0x2c40 [xe] [ 207.322812] ? __pfx___drm_dev_dbg+0x10/0x10 [ 207.322815] ? add_dr+0x180/0x220 [ 207.322818] ? __pfx___drmm_mutex_release+0x10/0x10 [ 207.322821] ? __pfx_xe_device_probe+0x10/0x10 [xe] [ 207.322966] ? xe_pm_init_early+0x33a/0x410 [xe] [ 207.323136] xe_pci_probe+0x936/0x1250 [xe] [ 207.323298] ? lock_acquire+0x1a4/0x2e0 [ 207.323302] ? __pfx_xe_pci_probe+0x10/0x10 [xe] [ 207.323464] local_pci_probe+0xe6/0x1a0 [ 207.323468] pci_device_probe+0x523/0x840 [ 207.323470] ? __pfx_pci_device_probe+0x10/0x10 [ 207.323473] ? sysfs_do_create_link_sd.isra.0+0x8c/0x110 [ 207.323476] ? sysfs_create_link+0x48/0xc0 [ 207.323479] really_probe+0x1fd/0x8a0 ... " Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patch.msgid.link/20260109211041.2446012-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit 60bfb8baf8f0d5b0d521744dfd01c880ce1a23f3) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-03-04 07:19:56 -05:00
Karthik Poosa	c0de1cc6a6	drm/xe/pm: Disable D3Cold for BMG only on specific platforms [ Upstream commit `bb36170d95` ] Restrict D3Cold disablement for BMG to unsupported NUC platforms, instead of disabling it on all platforms. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Fixes: `3e331a6715` ("drm/xe/pm: Temporarily disable D3Cold on BMG") Link: https://patch.msgid.link/20260123173238.1642383-1-karthik.poosa@intel.com Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 39125eaf8863ab09d70c4b493f58639b08d5a897) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-02-11 13:40:27 +01:00
Rodrigo Vivi	c8a5ec95c9	drm/xe/pm: Also avoid missing outer rpm warning on system suspend [ Upstream commit `f2eedadf19` ] Fix the false-positive "Missing outer runtime PM protection" warning triggered by release_async_domains() -> intel_runtime_pm_get_noresume() -> xe_pm_runtime_get_noresume() during system suspend. xe_pm_runtime_get_noresume() is supposed to warn if the device is not in the runtime resumed state, using xe_pm_runtime_get_if_in_use() for this. However the latter function will fail if called during runtime or system suspend/resume, regardless of whether the device is runtime resumed or not. Based on the above suppress the warning during system suspend/resume, similarly to how this is done during runtime suspend/resume. Suggested-by: Imre Deak <imre.deak@intel.com> Reviewed-by: Imre Deak <imre.deak@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241217230547.1667561-1-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Stable-dep-of: `bb36170d95` ("drm/xe/pm: Disable D3Cold for BMG only on specific platforms") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-02-11 13:40:27 +01:00
Shuicheng Lin	422f646b4a	drm/xe/query: Fix topology query pointer advance [ Upstream commit `7ee9b3e091` ] The topology query helper advanced the user pointer by the size of the pointer, not the size of the structure. This can misalign the output blob and corrupt the following mask. Fix the increment to use sizeof(topo). There is no issue currently, as sizeof(topo) happens to be equal to sizeof(topo) on 64-bit systems (both evaluate to 8 bytes). Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260130043907.465128-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit c2a6859138e7f73ad904be17dd7d1da6cc7f06b3) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-02-11 13:40:27 +01:00
Xin Wang	32dc49f49e	drm/xe: Ensure GT is in C0 during resumes commit `95d0883ac8` upstream. This patch ensures the gt will be awake for the entire duration of the resume sequences until GuCRC takes over and GT-C6 gets re-enabled. Before suspending GT-C6 is kept enabled, but upon resume, GuCRC is not yet alive to properly control the exits and some cases of instability and corruption related to GT-C6 can be observed. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037 Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037 Link: https://lore.kernel.org/r/20250827000633.1369890-3-x.wang@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-17 16:31:18 +01:00
Xin Wang	e724d0261b	drm/xe: make xe_gt_idle_disable_c6() handle the forcewake internally commit `1313351e71` upstream. Move forcewake_get() into xe_gt_idle_enable_c6() to streamline the code and make it easier to use. Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250827000633.1369890-2-x.wang@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-17 16:31:18 +01:00
Thomas Hellström	700cd81dc5	drm/xe: Drop preempt-fences when destroying imported dma-bufs. commit `fe3ccd2413` upstream. When imported dma-bufs are destroyed, TTM is not fully individualizing the dma-resv, but it is copying the fences that need to be waited for before declaring idle. So in the case where the bo->resv != bo->_resv we can still drop the preempt-fences, but make sure we do that on bo->_resv which contains the fence-pointer copy. In the case where the copying fails, bo->_resv will typically not contain any fences pointers at all, so there will be nothing to drop. In that case, TTM would have ensured all fences that would have been copied are signaled, including any remaining preempt fences. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Fixes: `fa0af721bd` ("drm/ttm: test private resv obj on release/destroy") Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.16+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Tested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251217093441.5073-1-thomas.hellstrom@linux.intel.com (cherry picked from commit `425fe550fb`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:55 +01:00
Matthew Brost	dd3278ebfc	drm/xe: Use usleep_range for accurate long-running workload timeslicing commit `80f9c601d9` upstream. msleep is not very accurate in terms of how long it actually sleeps, whereas usleep_range is precise. Replace the timeslice sleep for long-running workloads with the more accurate usleep_range to avoid jitter if the sleep period is less than 20ms. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20251212182847.1683222-3-matthew.brost@intel.com (cherry picked from commit `ca415c4d4c`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:55 +01:00
Matthew Brost	d420cea519	drm/xe: Adjust long-running workload timeslices to reasonable values commit `6f0f404bd2` upstream. A 10ms timeslice for long-running workloads is far too long and causes significant jitter in benchmarks when the system is shared. Adjust the value to 5ms for preempt-fencing VMs, as the resume step there is quite costly as memory is moved around, and set it to zero for pagefault VMs, since switching back to pagefault mode after dma-fence mode is relatively fast. Also change min_run_period_ms to 'unsiged int' type rather than 's64' as only positive values make sense. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20251212182847.1683222-2-matthew.brost@intel.com (cherry picked from commit `33a5abd9a6`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:54 +01:00
Ashutosh Dixit	641797734d	drm/xe/oa: Disallow 0 OA property values commit `3595114bc3` upstream. An OA property value of 0 is invalid and will cause a NPD. Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6452 Fixes: `cc4e6994d5` ("drm/xe/oa: Move functions up so they can be reused for config ioctl") Cc: stable@vger.kernel.org Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Reviewed-by: Harish Chegondi <harish.chegondi@intel.com> Link: https://patch.msgid.link/20251212061850.1565459-3-ashutosh.dixit@intel.com (cherry picked from commit `7a100e6ddc`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:54 +01:00
Thomas Hellström	4f26159adc	drm/xe/bo: Don't include the CCS metadata in the dma-buf sg-table commit `449bcd5d45` upstream. Some Xe bos are allocated with extra backing-store for the CCS metadata. It's never been the intention to share the CCS metadata when exporting such bos as dma-buf. Don't include it in the dma-buf sg-table. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251209204920.224374-1-thomas.hellstrom@linux.intel.com (cherry picked from commit `a4ebfb9d95`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:54 +01:00
Sanjay Yadav	c6d30b65b7	drm/xe/oa: Fix potential UAF in xe_oa_add_config_ioctl() commit `dcb1719319` upstream. In xe_oa_add_config_ioctl(), we accessed oa_config->id after dropping metrics_lock. Since this lock protects the lifetime of oa_config, an attacker could guess the id and call xe_oa_remove_config_ioctl() with perfect timing, freeing oa_config before we dereference it, leading to a potential use-after-free. Fix this by caching the id in a local variable while holding the lock. v2: (Matt A) - Dropped mutex_unlock(&oa->metrics_lock) ordering change from xe_oa_remove_config_ioctl() Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6614 Fixes: `cdf02fe1a9` ("drm/xe/oa/uapi: Add/remove OA config perf ops") Cc: <stable@vger.kernel.org> # v6.11+ Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20251118114859.3379952-2-sanjay.kumar.yadav@intel.com (cherry picked from commit `28aeaed130`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-08 10:14:53 +01:00
Shuicheng Lin	b963636331	drm/xe/oa: Limit num_syncs to prevent oversized allocations [ Upstream commit `f8dd66bfb4` ] The OA open parameters did not validate num_syncs, allowing userspace to pass arbitrarily large values, potentially leading to excessive allocations. Add check to ensure that num_syncs does not exceed DRM_XE_MAX_SYNCS, returning -EINVAL when the limit is violated. v2: use XE_IOCTL_DBG() and drop duplicated check. (Ashutosh) Fixes: `c8507a25ce` ("drm/xe/oa/uapi: Define and parse OA sync properties") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251205234715.2476561-6-shuicheng.lin@intel.com (cherry picked from commit `e057b2d2b8`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-01-08 10:14:05 +01:00
Shuicheng Lin	e281d1fd69	drm/xe: Limit num_syncs to prevent oversized allocations [ Upstream commit `8e46130400` ] The exec and vm_bind ioctl allow userspace to specify an arbitrary num_syncs value. Without bounds checking, a very large num_syncs can force an excessively large allocation, leading to kernel warnings from the page allocator as below. Introduce DRM_XE_MAX_SYNCS (set to 1024) and reject any request exceeding this limit. " ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1217 at mm/page_alloc.c:5124 __alloc_frozen_pages_noprof+0x2f8/0x2180 mm/page_alloc.c:5124 ... Call Trace: <TASK> alloc_pages_mpol+0xe4/0x330 mm/mempolicy.c:2416 ___kmalloc_large_node+0xd8/0x110 mm/slub.c:4317 __kmalloc_large_node_noprof+0x18/0xe0 mm/slub.c:4348 __do_kmalloc_node mm/slub.c:4364 [inline] __kmalloc_noprof+0x3d4/0x4b0 mm/slub.c:4388 kmalloc_noprof include/linux/slab.h:909 [inline] kmalloc_array_noprof include/linux/slab.h:948 [inline] xe_exec_ioctl+0xa47/0x1e70 drivers/gpu/drm/xe/xe_exec.c:158 drm_ioctl_kernel+0x1f1/0x3e0 drivers/gpu/drm/drm_ioctl.c:797 drm_ioctl+0x5e7/0xc50 drivers/gpu/drm/drm_ioctl.c:894 xe_drm_ioctl+0x10b/0x170 drivers/gpu/drm/xe/xe_device.c:224 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:598 [inline] __se_sys_ioctl fs/ioctl.c:584 [inline] __x64_sys_ioctl+0x18b/0x210 fs/ioctl.c:584 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xbb/0x380 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... " v2: Add "Reported-by" and Cc stable kernels. v3: Change XE_MAX_SYNCS from 64 to 1024. (Matt & Ashutosh) v4: s/XE_MAX_SYNCS/DRM_XE_MAX_SYNCS/ (Matt) v5: Do the check at the top of the exec func. (Matt) Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Reported-by: Koen Koning <koen.koning@intel.com> Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6450 Cc: <stable@vger.kernel.org> # v6.12+ Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: Carl Zhang <carl.zhang@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Cc: Ivan Briano <ivan.briano@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251205234715.2476561-5-shuicheng.lin@intel.com (cherry picked from commit `b07bac9bd7`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Stable-dep-of: `f8dd66bfb4` ("drm/xe/oa: Limit num_syncs to prevent oversized allocations") Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-01-08 10:14:05 +01:00
Jan Maslak	ca29fc28fb	drm/xe: Restore engine registers before restarting schedulers after GT reset [ Upstream commit `eed5b815fa` ] During GT reset recovery in do_gt_restart(), xe_uc_start() was called before xe_reg_sr_apply_mmio() restored engine-specific registers. This created a race window where the scheduler could run jobs before hardware state was fully restored. This caused failures in eudebug tests (xe_exec_sip_eudebug@breakpoint- waitsip-*) where TD_CTL register (containing TD_CTL_GLOBAL_DEBUG_ENABLE) wasn't restored before jobs started executing. Breakpoints would fail to trigger SIP entry because the debug enable bit wasn't set yet. Fix by moving xe_uc_start() after all MMIO register restoration, including engine registers and CCS mode configuration, ensuring all hardware state is fully restored before any jobs can be scheduled. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Jan Maslak <jan.maslak@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251210145618.169625-2-jan.maslak@intel.com (cherry picked from commit `825aed0328`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-01-08 10:14:04 +01:00
Junxiao Chang	d0326fd9df	drm/me/gsc: mei interrupt top half should be in irq disabled context [ Upstream commit `17445af7dc` ] MEI GSC interrupt comes from i915 or xe driver. It has top half and bottom half. Top half is called from i915/xe interrupt handler. It should be in irq disabled context. With RT kernel(PREEMPT_RT enabled), by default IRQ handler is in threaded IRQ. MEI GSC top half might be in threaded IRQ context. generic_handle_irq_safe API could be called from either IRQ or process context, it disables local IRQ then calls MEI GSC interrupt top half. This change fixes B580 GPU boot issue with RT enabled. Fixes: `e02cea83d3` ("drm/xe/gsc: add Battlemage support") Tested-by: Baoli Zhang <baoli.zhang@intel.com> Signed-off-by: Junxiao Chang <junxiao.chang@intel.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251107033152.834960-1-junxiao.chang@intel.com Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> (cherry picked from commit `3efadf0287`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-01-08 10:14:04 +01:00
Harish Chegondi	15f4066889	drm/xe: Fix conversion from clock ticks to milliseconds [ Upstream commit `7276878b06` ] When tick counts are large and multiplication by MSEC_PER_SEC is larger than 64 bits, the conversion from clock ticks to milliseconds can go bad. Use mul_u64_u32_div() instead. Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Suggested-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Fixes: `49cc215aad` ("drm/xe: Add xe_gt_clock_interval_to_ms helper") Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patch.msgid.link/1562f1b62d5be3fbaee100f09107f3cc49e40dd1.1763408584.git.harish.chegondi@intel.com (cherry picked from commit `96b93ac214`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-07 06:24:55 +09:00
Shuicheng Lin	23ba534d73	drm/xe: Prevent BIT() overflow when handling invalid prefetch region [ Upstream commit `d52dea485c` ] If user provides a large value (such as 0x80) for parameter prefetch_mem_region_instance in vm_bind ioctl, it will cause BIT(prefetch_region) overflow as below: " ------------[ cut here ]------------ UBSAN: shift-out-of-bounds in drivers/gpu/drm/xe/xe_vm.c:3414:7 shift exponent 128 is too large for 64-bit type 'long unsigned int' CPU: 8 UID: 0 PID: 53120 Comm: xe_exec_system_ Tainted: G W 6.18.0-rc1-lgci-xe-kernel+ #200 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023 Call Trace: <TASK> dump_stack_lvl+0xa0/0xc0 dump_stack+0x10/0x20 ubsan_epilogue+0x9/0x40 __ubsan_handle_shift_out_of_bounds+0x10e/0x170 ? mutex_unlock+0x12/0x20 xe_vm_bind_ioctl.cold+0x20/0x3c [xe] ... " Fix it by validating prefetch_region before the BIT() usage. v2: Add Closes and Cc stable kernels. (Matt) Reported-by: Koen Koning <koen.koning@intel.com> Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com> Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6478 Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20251112181005.2120521-2-shuicheng.lin@intel.com (cherry picked from commit `8f565bdd14`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `d52dea485c`) Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-12-01 11:43:37 +01:00
Jouni Högander	008d3b0f09	drm/xe: Do clean shutdown also when using flr [ Upstream commit `b11a020d91` ] Currently Xe driver is triggering flr without any clean-up on shutdown. This is causing random warnings from pending related works as the underlying hardware is reset in the middle of their execution. Fix this by performing clean shutdown also when using flr. Fixes: `501d799a47` ("drm/xe: Wire up device shutdown handler") Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: Jouni Högander <jouni.hogander@intel.com> Reviewed-by: Maarten Lankhorst <dev@lankhorst.se> Link: https://patch.msgid.link/20251031122312.1836534-1-jouni.hogander@intel.com Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> (cherry picked from commit `a4ff26b7c8`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-24 10:35:46 +01:00
Tejas Upadhyay	006a41c935	drm/xe: Move declarations under conditional branch [ Upstream commit `9cd27eec87` ] The xe_device_shutdown() function was needing a few declarations that were only required under a specific condition. This change moves those declarations to be within that conditional branch to avoid unnecessary declarations. Reviewed-by: Nitin Gote <nitin.r.gote@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20251007100208.1407021-1-tejas.upadhyay@intel.com Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com> (cherry picked from commit `15b3036045`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Stable-dep-of: `b11a020d91` ("drm/xe: Do clean shutdown also when using flr") Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-24 10:35:46 +01:00
Balasubramani Vivekanandan	35959ab7d1	drm/xe/guc: Synchronize Dead CT worker with unbind [ Upstream commit `95af8f4fdc` ] Cancel and wait for any Dead CT worker to complete before continuing with device unbinding. Else the worker will end up using resources freed by the undind operation. Cc: Zhanjun Dong <zhanjun.dong@intel.com> Fixes: `d2c5a5a926` ("drm/xe/guc: Dead CT helper") Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://patch.msgid.link/20251103123144.3231829-6-balasubramani.vivekanandan@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `4926713391`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-24 10:35:45 +01:00
John Harrison	91630b700f	drm/xe/guc: Return an error code if the GuC load fails [ Upstream commit `3b09b11805` ] Due to multiple explosion issues in the early days of the Xe driver, the GuC load was hacked to never return a failure. That prevented kernel panics and such initially, but now all it achieves is creating more confusing errors when the driver tries to submit commands to a GuC it already knows is not there. So fix that up. As a stop-gap and to help with debug of load failures due to invalid GuC init params, a wedge call had been added to the inner GuC load function. The reason being that it leaves the GuC log accessible via debugfs. However, for an end user, simply aborting the module load is much cleaner than wedging and trying to continue. The wedge blocks user submissions but it seems that various bits of the driver itself still try to submit to a dead GuC and lots of subsequent errors occur. And with regards to developers debugging why their particular code change is being rejected by the GuC, it is trivial to either add the wedge back in and hack the return code to zero again or to just do a GuC log dump to dmesg. v2: Add support for error injection testing and drop the now redundant wedge call. CC: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Link: https://lore.kernel.org/r/20250909224132.536320-1-John.C.Harrison@Intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-13 15:34:25 -05:00
Michal Wajdeczko	8831d3a5d8	drm/xe/guc: Set upper limit of H2G retries over CTB [ Upstream commit `2506af5f81` ] The GuC communication protocol allows GuC to send NO_RESPONSE_RETRY reply message to indicate that due to some interim condition it can not handle incoming H2G request and the host shall resend it. But in some cases, due to errors, this unsatisfied condition might be final and this could lead to endless retries as it was recently seen on the CI: [drm] GT0: PF: VF1 FLR didn't finish in 5000 ms (-ETIMEDOUT) [drm] GT0: PF: VF1 resource sanitizing failed (-ETIMEDOUT) [drm] GT0: PF: VF1 FLR failed! [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0 To avoid such dangerous loops allow only limited number of retries (for now 50) and add some delays (n * 5ms) to slow down the rate of resending this repeated request. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://lore.kernel.org/r/20250903223330.6408-1-michal.wajdeczko@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-13 15:34:19 -05:00
Zhanjun Dong	4a988c672b	drm/xe/guc: Increase GuC crash dump buffer size [ Upstream commit `ad83b1da5b` ] There are platforms already have a maximum dump size of 12KB, to avoid data truncating, increase GuC crash dump buffer size to 16KB. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250829160427.1245732-1-zhanjun.dong@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-13 15:34:18 -05:00
Maarten Lankhorst	99428bd612	drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test. [ Upstream commit `1cda3c755b` ] I saw an oops in xe_gem_fault when running the xe-fast-feedback testlist against the realtime kernel without debug options enabled. The panic happens after core_hotunplug unbind-rebind finishes. Presumably what happens is that a process mmaps, unlocks because of the FAULT_FLAG_RETRY_NOWAIT logic, has no process memory left, causing ttm_bo_vm_dummy_page() to return VM_FAULT_NOPAGE, since there was nothing left to populate, and then oopses in "mem_type_is_vram(tbo->resource->mem_type)" because tbo->resource is NULL. It's convoluted, but fits the data and explains the oops after the test exits. Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250715152057.23254-2-dev@lankhorst.se Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-13 15:34:09 -05:00
John Harrison	efffbbbe80	drm/xe/guc: Add more GuC load error status codes [ Upstream commit `45fbb51050` ] The GuC load process will abort if certain status codes (which are indicative of a fatal error) are reported. Otherwise, it keeps waiting until the 'success' code is returned. New error codes have been added in recent GuC releases, so add support for aborting on those as well. v2: Shuffle HWCONFIG_START to the front of the switch to keep the ordering as per the enum define for clarity (review feedback by Jonathan). Also add a description for the basic 'invalid init data' code which was missing. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250726024337.4056272-1-John.C.Harrison@Intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-13 15:34:09 -05:00
Matthew Brost	c8788295ce	drm/xe: Do not wake device during a GT reset commit `b3fbda1a63` upstream. Waking the device during a GT reset can lead to unintended memory allocation, which is not allowed since GT resets occur in the reclaim path. Prevent this by holding a PM reference while a reset is in flight. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20251022005538.828980-3-matthew.brost@intel.com (cherry picked from commit `480b358e7d`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-11-13 15:34:00 -05:00
Shuicheng Lin	2c6e5904c5	drm/xe/guc: Check GuC running state before deregistering exec queue commit `9f64b3cd05` upstream. In normal operation, a registered exec queue is disabled and deregistered through the GuC, and freed only after the GuC confirms completion. However, if the driver is forced to unbind while the exec queue is still running, the user may call exec_destroy() after the GuC has already been stopped and CT communication disabled. In this case, the driver cannot receive a response from the GuC, preventing proper cleanup of exec queue resources. Fix this by directly releasing the resources when GuC is not running. Here is the failure dmesg log: " [ 468.089581] ---[ end trace 0000000000000000 ]--- [ 468.089608] pci 0000:03:00.0: [drm] ERROR GT0: GUC ID manager unclean (1/65535) [ 468.090558] pci 0000:03:00.0: [drm] GT0: total 65535 [ 468.090562] pci 0000:03:00.0: [drm] GT0: used 1 [ 468.090564] pci 0000:03:00.0: [drm] GT0: range 1..1 (1) [ 468.092716] ------------[ cut here ]------------ [ 468.092719] WARNING: CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298 ttm_vram_mgr_fini+0xf8/0x130 [xe] " v2: use xe_uc_fw_is_running() instead of xe_guc_ct_enabled(). As CT may go down and come back during VF migration. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20251010172529.2967639-2-shuicheng.lin@intel.com (cherry picked from commit `9b42321a02`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-23 16:20:17 +02:00
Matthew Auld	ee49c1cf1b	drm/xe/uapi: loosen used tracking restriction commit `2d1684a077` upstream. Currently this is hidden behind perfmon_capable() since this is technically an info leak, given that this is a system wide metric. However the granularity reported here is always PAGE_SIZE aligned, which matches what the core kernel is already willing to expose to userspace if querying how many free RAM pages there are on the system, and that doesn't need any special privileges. In addition other drm drivers seem happy to expose this. The motivation here if with oneAPI where they want to use the system wide 'used' reporting here, so not the per-client fdinfo stats. This has also come up with some perf overlay applications wanting this information. Fixes: `1105ac15d2` ("drm/xe/uapi: restrict system wide accounting") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Joshua Santosh <joshua.santosh.ranjan@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250919122052.420979-2-matthew.auld@intel.com (cherry picked from commit `4d0b035fd6`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:33:47 +02:00
Shuicheng Lin	c772e7cc90	drm/xe/hw_engine_group: Fix double write lock release in error path [ Upstream commit `08fdfd260e` ] In xe_hw_engine_group_get_mode(), a write lock is acquired before calling switch_mode(), which in turn invokes xe_hw_engine_group_suspend_faulting_lr_jobs(). On failure inside xe_hw_engine_group_suspend_faulting_lr_jobs(), the write lock is released there, and then again in xe_hw_engine_group_get_mode(), leading to a double release. Fix this by keeping both acquire and release operation in xe_hw_engine_group_get_mode(). Fixes: `770bd1d341` ("drm/xe/hw_engine_group: Ensure safe transition between execution modes") Cc: Francois Dugast <francois.dugast@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://lore.kernel.org/r/20250925023145.1203004-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `662d98b8b3`) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-10-19 16:33:39 +02:00
Dan Carpenter	ea5cbcecd5	drm/xe: Fix a NULL vs IS_ERR() in xe_vm_add_compute_exec_queue() [ Upstream commit `cbc7f3b4f6` ] The xe_preempt_fence_create() function returns error pointers. It never returns NULL. Update the error checking to match. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/aJTMBdX97cof_009@stanley.mountain Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit `75cc23ffe5`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-09-25 11:13:48 +02:00
Shuicheng Lin	503de75db4	drm/xe/tile: Release kobject for the failure path [ Upstream commit `013e484dbd` ] Call kobject_put() for the failure path to release the kobject v2: remove extra newline. (Matt) Fixes: `e3d0839aa5` ("drm/xe/tile: Abort driver load for sysfs creation failure") Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://lore.kernel.org/r/20250819153950.2973344-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit `b98775bca9`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-09-25 11:13:48 +02:00
Thomas Hellström	7d07bc9c4f	drm/xe: Attempt to bring bos back to VRAM after eviction commit `5c87fee3c9` upstream. VRAM+TT bos that are evicted from VRAM to TT may remain in TT also after a revalidation following eviction or suspend. This manifests itself as applications becoming sluggish after buffer objects get evicted or after a resume from suspend or hibernation. If the bo supports placement in both VRAM and TT, and we are on DGFX, mark the TT placement as fallback. This means that it is tried only after VRAM + eviction. This flaw has probably been present since the xe module was upstreamed but use a Fixes: commit below where backporting is likely to be simple. For earlier versions we need to open- code the fallback algorithm in the driver. v2: - Remove check for dgfx. (Matthew Auld) - Update the xe_dma_buf kunit test for the new strategy (CI) - Allow dma-buf to pin in current placement (CI) - Make xe_bo_validate() for pinned bos a NOP. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5995 Fixes: `a78a8da51b` ("drm/ttm: replace busy placement with flags v6") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: <stable@vger.kernel.org> # v6.9+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250904160715.2613-2-thomas.hellstrom@linux.intel.com (cherry picked from commit `cb3d7b3b46`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-09-19 16:35:46 +02:00
Thomas Hellström	c8277d229c	drm/xe/vm: Clear the scratch_pt pointer on error commit `2b55ddf362` upstream. Avoid triggering a dereference of an error pointer on cleanup in xe_vm_free_scratch() by clearing any scratch_pt error pointer. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: `06951c2ee7` ("drm/xe: Use NULL PTEs as scratch PTEs") Cc: Brian Welty <brian.welty@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250821143045.106005-4-thomas.hellstrom@linux.intel.com (cherry picked from commit `358ee50ab5`) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-09-04 15:31:55 +02:00

1 2 3 4 5 ...

2574 Commits