Commit Graph

2574 Commits

Author SHA1 Message Date
Shuicheng Lin
549b68ba83 drm/xe/reg_sr: Fix leak on xa_store failure
[ Upstream commit 3091723785 ]

Free the newly allocated entry when xa_store() fails to avoid a memory
leak on the error path.

v2: use goto fail_free. (Bala)

Fixes: e5283bd4df ("drm/xe/reg_sr: Remove register pool")
Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260204172810.1486719-2-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 6bc6fec71ac45f52db609af4e62bdb96b9f5fadb)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13 17:20:43 +01:00
Matthew Brost
1a42ea28e0 drm/xe: Do not preempt fence signaling CS instructions
[ Upstream commit cdc8a1e11f ]

If a batch buffer is complete, it makes little sense to preempt the
fence signaling instructions in the ring, as the largest portion of the
work (the batch buffer) is already done and fence signaling consists of
only a few instructions. If these instructions are preempted, the GuC
would need to perform a context switch just to signal the fence, which
is costly and delays fence signaling. Avoid this scenario by disabling
preemption immediately after the BB start instruction and re-enabling it
after executing the fence signaling instructions.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patch.msgid.link/20260115004546.58060-1-matthew.brost@intel.com
(cherry picked from commit 2bcbf2dcde0c839a73af664a3c77d4e77d58a3eb)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13 17:20:40 +01:00
Matthew Brost
5b27fcb5d1 drm/xe: Only toggle scheduling in TDR if GuC is running
[ Upstream commit dd1ef5e245 ]

If the firmware is not running during TDR (e.g., when the driver is
unloading), there's no need to toggle scheduling in the GuC. In such
cases, skip this step.

v4:
 - Bail on wait UC not running (Niranjana)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Link: https://patch.msgid.link/20260110012739.2888434-4-matthew.brost@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:21:02 -05:00
Matt Roper
bc78bfd287 drm/xe/xe2_hpg: Fix handling of Wa_14019988906 & Wa_14019877138
[ Upstream commit bc6387a2e0 ]

The PSS_CHICKEN register has been part of the RCS engine's LRC since it
was first introduced in Xe_LP.  That means that any workarounds that
adjust its value (such as Wa_14019988906 and Wa_14019877138) need to be
implemented in the lrc_was[] table so that they become part of the
default LRC from which all subsequent LRCs are copied.  Although these
workarounds were implemented correctly on most platforms, they were
incorrectly placed on the engine_was[] table for Xe2_HPG.

Move the workarounds to the proper lrc_was[] table and switch the
'xe_rtp_match_first_render_or_compute' rule to specifically match the
RCS since that's the engine whose LRC manages the register.

Bspec: 65182
Fixes: 7f3ee7d880 ("drm/xe/xe2hpg: Add initial GT workarounds")
Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com>
Link: https://patch.msgid.link/20260205220508.51905-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit e04c609eedf4d6748ac0bcada4de1275b034fed6)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:45 -05:00
Shekhar Chauhan
c89bde96e8 drm/xe/xe2_hpg: Add set of workarounds
[ Upstream commit a5d221924e ]

Add set of workarounds for xe2_hpg.

-v2: Fix xe2_hpg GMD version for some workarounds.
-v3: Removed extra Workaround (Matt Roper)

Signed-off-by: Shekhar Chauhan <shekhar.chauhan@intel.com>
Signed-off-by: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://lore.kernel.org/r/20250605190804.1287289-3-dnyaneshwar.bhadane@intel.com
Stable-dep-of: bc6387a2e0 ("drm/xe/xe2_hpg: Fix handling of Wa_14019988906 & Wa_14019877138")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:45 -05:00
Vinay Belgaumkar
9e18acc5aa drm/xe/ptl: Apply Wa_13011645652
[ Upstream commit dddc53806d ]

Extend Wa_13011645652 to PTL.

Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250116184659.384874-1-vinay.belgaumkar@intel.com
Stable-dep-of: bc6387a2e0 ("drm/xe/xe2_hpg: Fix handling of Wa_14019988906 & Wa_14019877138")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:45 -05:00
Shuicheng Lin
8f6848b2f6 drm/xe/mmio: Avoid double-adjust in 64-bit reads
[ Upstream commit 4a9b4e1fa5 ]

xe_mmio_read64_2x32() was adjusting register addresses and then
calling xe_mmio_read32(), which applies the adjustment again.
This may shift accesses twice if adj_offset < adj_limit. There is
no issue currently, as for media gt, adj_offset > adj_limit, so
the 2nd adjust will be a no-op. But it may not work in future.

To fix it, replace the adjusted-address comparison with a direct
sanity check that ensures the MMIO address adjustment cutoff never
falls within the 8-byte range of a 64-bit register. And let
xe_mmio_read32() handle address translation.

v2: rewrite the sanity check in a more natural way. (Matt)
v3: Add Fixes tag. (Jani)

Fixes: 07431945d8 ("drm/xe: Avoid 64-bit register reads")
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://patch.msgid.link/20260130165621.471408-2-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit a30f999681126b128a43137793ac84b6a5b7443f)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:44 -05:00
Matt Roper
26a40327c2 drm/xe: Switch MMIO interface to take xe_mmio instead of xe_gt
[ Upstream commit a84590c5ce ]

Since much of the MMIO register access done by the driver is to non-GT
registers, use of 'xe_gt' in these interfaces has been a long-standing
design flaw that's been hard to disentangle.

To avoid a flag day across the whole driver, munge the function names
and add temporary compatibility macros with the original function names
that can accept either the new xe_mmio or the old xe_gt structure as a
parameter.  This will allow us to slowly convert parts of the driver
over to the new interface independently.

Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-54-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:44 -05:00
Matt Roper
f5508f1e65 drm/xe: Adjust mmio code to pass VF substructure to SRIOV code
[ Upstream commit 6fb5d1a1d3 ]

Although we want to break the GT-centric nature of the MMIO code in the
general driver, the SRIOV handling still relies on data in a VF
substructure of the GT.  So add a GT backpointer, but name it
sriov_vf_gt to make it clear that it's only for this one specific
special case and will not be set or usable for anything else.

v2:
 - Store backpointer to the GT itself rather than the SRIOV-specific
   substructure.  (Michal)

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>  # v1
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-53-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:44 -05:00
Matt Roper
9472d36b3d drm/xe: Add xe_tile backpointer to xe_mmio
[ Upstream commit 1877c88fa9 ]

Once MMIO operations stop being (incorrectly) tied to a GT, we'll still
need a backpointer for feature checks, message logging, and tracepoints.
Use a tile backpointer since that may allow the most useful debugging
output, while also providing access to the xe_device.

v2:
 - Make backpointer an xe_tile instead of xe_device.  (Michal)

Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>  # v1
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-52-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:44 -05:00
Matt Roper
74bff541ec drm/xe: Switch mmio_ext to use 'struct xe_mmio'
[ Upstream commit 960a83799f ]

The mmio_ext stuff is completely unused right now, but it isn't
providing any functionality that couldn't be treated as a regular mmio
space.

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-51-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:44 -05:00
Matt Roper
0b433e086b drm/xe: Populate GT's mmio iomap from tile during init
[ Upstream commit fa599b8c95 ]

Each GT should share the same register iomap as its parent tile.  Future
patches will switch to access the iomap through the GT's mmio substruct
rather than through the tile.

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-50-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:44 -05:00
Matt Roper
85354b21a5 drm/xe: Move GSI offset adjustment fields into 'struct xe_mmio'
[ Upstream commit 9d383916a5 ]

By moving the GSI adjustment fields into 'struct xe_mmio' we can replace
the GT's MMIO substructure with another instance of xe_mmio.  At the
moment this means MMIO operations wind up pulling information from two
different places (the tile's xe_mmio for the iomap and the GT's xe_mmio
for the adjustment), but we'll address that in future patches.

The type headers change a bit with this change, meaning that various
files should be including xe_device_types.h instead of (or in addition
to) xe_gt_types.h.

v2:
 - Fix pre-existing kerneldoc typo while moving the fields (Lucas)
v3:
 - Add missing '@' in kerneldoc.  (Rodrigo)

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-49-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:43 -05:00
Matt Roper
c31019bca8 drm/xe: Clarify size of MMIO region
[ Upstream commit d4aff99aef ]

xe_mmio currently has a size parameter that is assigned but never used
anywhere.  The current values assigned appear to be the size of the BAR
region assigned for the tile (both for registers and other purposes such
as the GGTT).  Since the current field isn't being used for anything,
change the assignments to 4MB (the size of the register region on all
current platform) and rename the field to 'regs_size' to more clearly
describe what it represents.  We can use this value in later patches to
help ensure no register accesses accidentally go past the end of the
desired register space (which might not be caught easily if they still
fall within the iomap).

v2:
 - s/regs_length/regs_size/  (Lucas)
 - Clarify kerneldoc description (Lucas)

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-48-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:43 -05:00
Matt Roper
6c5894d7ac drm/xe: Create dedicated xe_mmio structure
[ Upstream commit 34953ee349 ]

Pull the 'mmio' substructure from xe_tile out into a dedicated type.
Future patches will expand this structure and then eventually move MMIO
read/write operations over to using this type.

v2:
 - Fix kerneldoc of 'size' field.  The rename/refocusing of this field
   got moved to the next patch of the series.  (Lucas)
 - Correct commit message; it's the tile, not the device, mmio that's
   been pulled out to a separate type.  (Michal)

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-47-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:43 -05:00
Matt Roper
4d4e940bc0 drm/xe: Move forcewake to 'gt.pm' substructure
[ Upstream commit 998fde0647 ]

Forcewake is a general GT power management concept that isn't specific
to MMIO register access.  Move the forcewake information for a GT out of
the 'mmio' substruct and into a 'pm' substruct.  Also use the gt_to_fw()
helper in a few more places where it was being open-coded.

v2:
 - Kerneldoc tweaks.  (Lucas)

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-46-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa5 ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:20:43 -05:00
Shuicheng Lin
be74863074 drm/xe: Unregister drm device on probe error
[ Upstream commit 96c2c72b81 ]

Call drm_dev_unregister() when xe_device_probe() fails after successful
drm_dev_register(). This ensures the DRM device is promptly unregistered
before returning an error, avoiding leaving it registered on the failure
path.
Otherwise, there is warn message if xe_device_probe() is called again:
"
[  207.322365] [drm:drm_minor_register]
[  207.322381] debugfs: '128' already exists in 'dri'
[  207.322432] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/renderD128'
[  207.322435] CPU: 5 UID: 0 PID: 10261 Comm: modprobe Tainted: G    B   W           6.19.0-rc2-lgci-xe-kernel+ #223 PREEMPT(voluntary)
[  207.322439] Tainted: [B]=BAD_PAGE, [W]=WARN
[  207.322440] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
[  207.322441] Call Trace:
[  207.322442]  <TASK>
[  207.322443]  dump_stack_lvl+0xa0/0xc0
[  207.322446]  dump_stack+0x10/0x20
[  207.322448]  sysfs_warn_dup+0xd5/0x110
[  207.322451]  sysfs_create_dir_ns+0x1f6/0x280
[  207.322453]  ? __pfx_sysfs_create_dir_ns+0x10/0x10
[  207.322455]  ? lock_acquire+0x1a4/0x2e0
[  207.322458]  ? __kasan_check_read+0x11/0x20
[  207.322461]  kobject_add_internal+0x28d/0x8e0
[  207.322464]  kobject_add+0x11f/0x1f0
[  207.322465]  ? lock_acquire+0x1a4/0x2e0
[  207.322467]  ? __pfx_kobject_add+0x10/0x10
[  207.322469]  ? __kasan_check_write+0x14/0x20
[  207.322471]  ? kobject_put+0x62/0x4a0
[  207.322473]  ? get_device_parent.isra.0+0x1bb/0x4c0
[  207.322475]  ? kobject_put+0x62/0x4a0
[  207.322477]  device_add+0x2d7/0x1500
[  207.322479]  ? __pfx_device_add+0x10/0x10
[  207.322481]  ? drm_debugfs_add_file+0xfa/0x170
[  207.322483]  ? drm_debugfs_add_files+0x82/0xd0
[  207.322485]  ? drm_debugfs_add_files+0x82/0xd0
[  207.322487]  drm_minor_register+0x10a/0x2d0
[  207.322489]  drm_dev_register+0x143/0x860
[  207.322491]  ? xe_configfs_get_psmi_enabled+0x12/0x90 [xe]
[  207.322667]  xe_device_probe+0x185b/0x2c40 [xe]
[  207.322812]  ? __pfx___drm_dev_dbg+0x10/0x10
[  207.322815]  ? add_dr+0x180/0x220
[  207.322818]  ? __pfx___drmm_mutex_release+0x10/0x10
[  207.322821]  ? __pfx_xe_device_probe+0x10/0x10 [xe]
[  207.322966]  ? xe_pm_init_early+0x33a/0x410 [xe]
[  207.323136]  xe_pci_probe+0x936/0x1250 [xe]
[  207.323298]  ? lock_acquire+0x1a4/0x2e0
[  207.323302]  ? __pfx_xe_pci_probe+0x10/0x10 [xe]
[  207.323464]  local_pci_probe+0xe6/0x1a0
[  207.323468]  pci_device_probe+0x523/0x840
[  207.323470]  ? __pfx_pci_device_probe+0x10/0x10
[  207.323473]  ? sysfs_do_create_link_sd.isra.0+0x8c/0x110
[  207.323476]  ? sysfs_create_link+0x48/0xc0
[  207.323479]  really_probe+0x1fd/0x8a0
...
"

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patch.msgid.link/20260109211041.2446012-2-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 60bfb8baf8f0d5b0d521744dfd01c880ce1a23f3)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04 07:19:56 -05:00
Karthik Poosa
c0de1cc6a6 drm/xe/pm: Disable D3Cold for BMG only on specific platforms
[ Upstream commit bb36170d95 ]

Restrict D3Cold disablement for BMG to unsupported NUC platforms,
instead of disabling it on all platforms.

Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Fixes: 3e331a6715 ("drm/xe/pm: Temporarily disable D3Cold on BMG")
Link: https://patch.msgid.link/20260123173238.1642383-1-karthik.poosa@intel.com
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 39125eaf8863ab09d70c4b493f58639b08d5a897)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11 13:40:27 +01:00
Rodrigo Vivi
c8a5ec95c9 drm/xe/pm: Also avoid missing outer rpm warning on system suspend
[ Upstream commit f2eedadf19 ]

Fix the false-positive "Missing outer runtime PM protection" warning
triggered by
release_async_domains() -> intel_runtime_pm_get_noresume() ->
xe_pm_runtime_get_noresume()
during system suspend.

xe_pm_runtime_get_noresume() is supposed to warn if the device is not in
the runtime resumed state, using xe_pm_runtime_get_if_in_use() for this.
However the latter function will fail if called during runtime or system
suspend/resume, regardless of whether the device is runtime resumed or
not.

Based on the above suppress the warning during system suspend/resume,
similarly to how this is done during runtime suspend/resume.

Suggested-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Imre Deak <imre.deak@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241217230547.1667561-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Stable-dep-of: bb36170d95 ("drm/xe/pm: Disable D3Cold for BMG only on specific platforms")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11 13:40:27 +01:00
Shuicheng Lin
422f646b4a drm/xe/query: Fix topology query pointer advance
[ Upstream commit 7ee9b3e091 ]

The topology query helper advanced the user pointer by the size
of the pointer, not the size of the structure. This can misalign
the output blob and corrupt the following mask. Fix the increment
to use sizeof(*topo).
There is no issue currently, as sizeof(*topo) happens to be equal
to sizeof(topo) on 64-bit systems (both evaluate to 8 bytes).

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260130043907.465128-2-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit c2a6859138e7f73ad904be17dd7d1da6cc7f06b3)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11 13:40:27 +01:00
Xin Wang
32dc49f49e drm/xe: Ensure GT is in C0 during resumes
commit 95d0883ac8 upstream.

This patch ensures the gt will be awake for the entire duration
of the resume sequences until GuCRC takes over and GT-C6 gets
re-enabled.

Before suspending GT-C6 is kept enabled, but upon resume, GuCRC
is not yet alive to properly control the exits and some cases of
instability and corruption related to GT-C6 can be observed.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037

Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Xin Wang <x.wang@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037
Link: https://lore.kernel.org/r/20250827000633.1369890-3-x.wang@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-17 16:31:18 +01:00
Xin Wang
e724d0261b drm/xe: make xe_gt_idle_disable_c6() handle the forcewake internally
commit 1313351e71 upstream.

Move forcewake_get() into xe_gt_idle_enable_c6() to streamline the
code and make it easier to use.

Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Xin Wang <x.wang@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250827000633.1369890-2-x.wang@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-17 16:31:18 +01:00
Thomas Hellström
700cd81dc5 drm/xe: Drop preempt-fences when destroying imported dma-bufs.
commit fe3ccd2413 upstream.

When imported dma-bufs are destroyed, TTM is not fully
individualizing the dma-resv, but it *is* copying the fences that
need to be waited for before declaring idle. So in the case where
the bo->resv != bo->_resv we can still drop the preempt-fences, but
make sure we do that on bo->_resv which contains the fence-pointer
copy.

In the case where the copying fails, bo->_resv will typically not
contain any fences pointers at all, so there will be nothing to
drop. In that case, TTM would have ensured all fences that would
have been copied are signaled, including any remaining preempt
fences.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Fixes: fa0af721bd ("drm/ttm: test private resv obj on release/destroy")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.16+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Tested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251217093441.5073-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit 425fe550fb)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08 10:14:55 +01:00
Matthew Brost
dd3278ebfc drm/xe: Use usleep_range for accurate long-running workload timeslicing
commit 80f9c601d9 upstream.

msleep is not very accurate in terms of how long it actually sleeps,
whereas usleep_range is precise. Replace the timeslice sleep for
long-running workloads with the more accurate usleep_range to avoid
jitter if the sleep period is less than 20ms.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/20251212182847.1683222-3-matthew.brost@intel.com
(cherry picked from commit ca415c4d4c)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08 10:14:55 +01:00
Matthew Brost
d420cea519 drm/xe: Adjust long-running workload timeslices to reasonable values
commit 6f0f404bd2 upstream.

A 10ms timeslice for long-running workloads is far too long and causes
significant jitter in benchmarks when the system is shared. Adjust the
value to 5ms for preempt-fencing VMs, as the resume step there is quite
costly as memory is moved around, and set it to zero for pagefault VMs,
since switching back to pagefault mode after dma-fence mode is
relatively fast.

Also change min_run_period_ms to 'unsiged int' type rather than 's64' as
only positive values make sense.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/20251212182847.1683222-2-matthew.brost@intel.com
(cherry picked from commit 33a5abd9a6)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08 10:14:54 +01:00
Ashutosh Dixit
641797734d drm/xe/oa: Disallow 0 OA property values
commit 3595114bc3 upstream.

An OA property value of 0 is invalid and will cause a NPD.

Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6452
Fixes: cc4e6994d5 ("drm/xe/oa: Move functions up so they can be reused for config ioctl")
Cc: stable@vger.kernel.org
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Link: https://patch.msgid.link/20251212061850.1565459-3-ashutosh.dixit@intel.com
(cherry picked from commit 7a100e6ddc)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08 10:14:54 +01:00
Thomas Hellström
4f26159adc drm/xe/bo: Don't include the CCS metadata in the dma-buf sg-table
commit 449bcd5d45 upstream.

Some Xe bos are allocated with extra backing-store for the CCS
metadata. It's never been the intention to share the CCS metadata
when exporting such bos as dma-buf. Don't include it in the
dma-buf sg-table.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://patch.msgid.link/20251209204920.224374-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit a4ebfb9d95)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08 10:14:54 +01:00
Sanjay Yadav
c6d30b65b7 drm/xe/oa: Fix potential UAF in xe_oa_add_config_ioctl()
commit dcb1719319 upstream.

In xe_oa_add_config_ioctl(), we accessed oa_config->id after dropping
metrics_lock. Since this lock protects the lifetime of oa_config, an
attacker could guess the id and call xe_oa_remove_config_ioctl() with
perfect timing, freeing oa_config before we dereference it, leading to
a potential use-after-free.

Fix this by caching the id in a local variable while holding the lock.

v2: (Matt A)
- Dropped mutex_unlock(&oa->metrics_lock) ordering change from
  xe_oa_remove_config_ioctl()

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6614
Fixes: cdf02fe1a9 ("drm/xe/oa/uapi: Add/remove OA config perf ops")
Cc: <stable@vger.kernel.org> # v6.11+
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20251118114859.3379952-2-sanjay.kumar.yadav@intel.com
(cherry picked from commit 28aeaed130)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08 10:14:53 +01:00
Shuicheng Lin
b963636331 drm/xe/oa: Limit num_syncs to prevent oversized allocations
[ Upstream commit f8dd66bfb4 ]

The OA open parameters did not validate num_syncs, allowing
userspace to pass arbitrarily large values, potentially
leading to excessive allocations.

Add check to ensure that num_syncs does not exceed DRM_XE_MAX_SYNCS,
returning -EINVAL when the limit is violated.

v2: use XE_IOCTL_DBG() and drop duplicated check. (Ashutosh)

Fixes: c8507a25ce ("drm/xe/oa/uapi: Define and parse OA sync properties")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251205234715.2476561-6-shuicheng.lin@intel.com
(cherry picked from commit e057b2d2b8)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08 10:14:05 +01:00
Shuicheng Lin
e281d1fd69 drm/xe: Limit num_syncs to prevent oversized allocations
[ Upstream commit 8e46130400 ]

The exec and vm_bind ioctl allow userspace to specify an arbitrary
num_syncs value. Without bounds checking, a very large num_syncs
can force an excessively large allocation, leading to kernel warnings
from the page allocator as below.

Introduce DRM_XE_MAX_SYNCS (set to 1024) and reject any request
exceeding this limit.

"
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1217 at mm/page_alloc.c:5124 __alloc_frozen_pages_noprof+0x2f8/0x2180 mm/page_alloc.c:5124
...
Call Trace:
 <TASK>
 alloc_pages_mpol+0xe4/0x330 mm/mempolicy.c:2416
 ___kmalloc_large_node+0xd8/0x110 mm/slub.c:4317
 __kmalloc_large_node_noprof+0x18/0xe0 mm/slub.c:4348
 __do_kmalloc_node mm/slub.c:4364 [inline]
 __kmalloc_noprof+0x3d4/0x4b0 mm/slub.c:4388
 kmalloc_noprof include/linux/slab.h:909 [inline]
 kmalloc_array_noprof include/linux/slab.h:948 [inline]
 xe_exec_ioctl+0xa47/0x1e70 drivers/gpu/drm/xe/xe_exec.c:158
 drm_ioctl_kernel+0x1f1/0x3e0 drivers/gpu/drm/drm_ioctl.c:797
 drm_ioctl+0x5e7/0xc50 drivers/gpu/drm/drm_ioctl.c:894
 xe_drm_ioctl+0x10b/0x170 drivers/gpu/drm/xe/xe_device.c:224
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:598 [inline]
 __se_sys_ioctl fs/ioctl.c:584 [inline]
 __x64_sys_ioctl+0x18b/0x210 fs/ioctl.c:584
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xbb/0x380 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
...
"

v2: Add "Reported-by" and Cc stable kernels.
v3: Change XE_MAX_SYNCS from 64 to 1024. (Matt & Ashutosh)
v4: s/XE_MAX_SYNCS/DRM_XE_MAX_SYNCS/ (Matt)
v5: Do the check at the top of the exec func. (Matt)

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Reported-by: Koen Koning <koen.koning@intel.com>
Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6450
Cc: <stable@vger.kernel.org> # v6.12+
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: Carl Zhang <carl.zhang@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Ivan Briano <ivan.briano@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251205234715.2476561-5-shuicheng.lin@intel.com
(cherry picked from commit b07bac9bd7)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Stable-dep-of: f8dd66bfb4 ("drm/xe/oa: Limit num_syncs to prevent oversized allocations")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08 10:14:05 +01:00
Jan Maslak
ca29fc28fb drm/xe: Restore engine registers before restarting schedulers after GT reset
[ Upstream commit eed5b815fa ]

During GT reset recovery in do_gt_restart(), xe_uc_start() was called
before xe_reg_sr_apply_mmio() restored engine-specific registers. This
created a race window where the scheduler could run jobs before hardware
state was fully restored.

This caused failures in eudebug tests (xe_exec_sip_eudebug@breakpoint-
waitsip-*) where TD_CTL register (containing TD_CTL_GLOBAL_DEBUG_ENABLE)
wasn't restored before jobs started executing. Breakpoints would fail to
trigger SIP entry because the debug enable bit wasn't set yet.

Fix by moving xe_uc_start() after all MMIO register restoration,
including engine registers and CCS mode configuration, ensuring all
hardware state is fully restored before any jobs can be scheduled.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Jan Maslak <jan.maslak@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251210145618.169625-2-jan.maslak@intel.com
(cherry picked from commit 825aed0328)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08 10:14:04 +01:00
Junxiao Chang
d0326fd9df drm/me/gsc: mei interrupt top half should be in irq disabled context
[ Upstream commit 17445af7dc ]

MEI GSC interrupt comes from i915 or xe driver. It has top half and
bottom half. Top half is called from i915/xe interrupt handler. It
should be in irq disabled context.

With RT kernel(PREEMPT_RT enabled), by default IRQ handler is in
threaded IRQ. MEI GSC top half might be in threaded IRQ context.
generic_handle_irq_safe API could be called from either IRQ or
process context, it disables local IRQ then calls MEI GSC interrupt
top half.

This change fixes B580 GPU boot issue with RT enabled.

Fixes: e02cea83d3 ("drm/xe/gsc: add Battlemage support")
Tested-by: Baoli Zhang <baoli.zhang@intel.com>
Signed-off-by: Junxiao Chang <junxiao.chang@intel.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251107033152.834960-1-junxiao.chang@intel.com
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
(cherry picked from commit 3efadf0287)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08 10:14:04 +01:00
Harish Chegondi
15f4066889 drm/xe: Fix conversion from clock ticks to milliseconds
[ Upstream commit 7276878b06 ]

When tick counts are large and multiplication by MSEC_PER_SEC is larger
than 64 bits, the conversion from clock ticks to milliseconds can go bad.

Use mul_u64_u32_div() instead.

Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Suggested-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Fixes: 49cc215aad ("drm/xe: Add xe_gt_clock_interval_to_ms helper")
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patch.msgid.link/1562f1b62d5be3fbaee100f09107f3cc49e40dd1.1763408584.git.harish.chegondi@intel.com
(cherry picked from commit 96b93ac214)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-07 06:24:55 +09:00
Shuicheng Lin
23ba534d73 drm/xe: Prevent BIT() overflow when handling invalid prefetch region
[ Upstream commit d52dea485c ]

If user provides a large value (such as 0x80) for parameter
prefetch_mem_region_instance in vm_bind ioctl, it will cause
BIT(prefetch_region) overflow as below:
"
 ------------[ cut here ]------------
 UBSAN: shift-out-of-bounds in drivers/gpu/drm/xe/xe_vm.c:3414:7
 shift exponent 128 is too large for 64-bit type 'long unsigned int'
 CPU: 8 UID: 0 PID: 53120 Comm: xe_exec_system_ Tainted: G        W           6.18.0-rc1-lgci-xe-kernel+ #200 PREEMPT(voluntary)
 Tainted: [W]=WARN
 Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
 Call Trace:
  <TASK>
  dump_stack_lvl+0xa0/0xc0
  dump_stack+0x10/0x20
  ubsan_epilogue+0x9/0x40
  __ubsan_handle_shift_out_of_bounds+0x10e/0x170
  ? mutex_unlock+0x12/0x20
  xe_vm_bind_ioctl.cold+0x20/0x3c [xe]
 ...
"
Fix it by validating prefetch_region before the BIT() usage.

v2: Add Closes and Cc stable kernels. (Matt)

Reported-by: Koen Koning <koen.koning@intel.com>
Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6478
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20251112181005.2120521-2-shuicheng.lin@intel.com
(cherry picked from commit 8f565bdd14)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit d52dea485c)
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-01 11:43:37 +01:00
Jouni Högander
008d3b0f09 drm/xe: Do clean shutdown also when using flr
[ Upstream commit b11a020d91 ]

Currently Xe driver is triggering flr without any clean-up on
shutdown. This is causing random warnings from pending related works as the
underlying hardware is reset in the middle of their execution.

Fix this by performing clean shutdown also when using flr.

Fixes: 501d799a47 ("drm/xe: Wire up device shutdown handler")
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Reviewed-by: Maarten Lankhorst <dev@lankhorst.se>
Link: https://patch.msgid.link/20251031122312.1836534-1-jouni.hogander@intel.com
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
(cherry picked from commit a4ff26b7c8)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24 10:35:46 +01:00
Tejas Upadhyay
006a41c935 drm/xe: Move declarations under conditional branch
[ Upstream commit 9cd27eec87 ]

The xe_device_shutdown() function was needing a few declarations
that were only required under a specific condition. This change
moves those declarations to be within that conditional branch
to avoid unnecessary declarations.

Reviewed-by: Nitin Gote <nitin.r.gote@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20251007100208.1407021-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
(cherry picked from commit 15b3036045)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Stable-dep-of: b11a020d91 ("drm/xe: Do clean shutdown also when using flr")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24 10:35:46 +01:00
Balasubramani Vivekanandan
35959ab7d1 drm/xe/guc: Synchronize Dead CT worker with unbind
[ Upstream commit 95af8f4fdc ]

Cancel and wait for any Dead CT worker to complete before continuing
with device unbinding. Else the worker will end up using resources freed
by the undind operation.

Cc: Zhanjun Dong <zhanjun.dong@intel.com>
Fixes: d2c5a5a926 ("drm/xe/guc: Dead CT helper")
Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://patch.msgid.link/20251103123144.3231829-6-balasubramani.vivekanandan@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 4926713391)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24 10:35:45 +01:00
John Harrison
91630b700f drm/xe/guc: Return an error code if the GuC load fails
[ Upstream commit 3b09b11805 ]

Due to multiple explosion issues in the early days of the Xe driver,
the GuC load was hacked to never return a failure. That prevented
kernel panics and such initially, but now all it achieves is creating
more confusing errors when the driver tries to submit commands to a
GuC it already knows is not there. So fix that up.

As a stop-gap and to help with debug of load failures due to invalid
GuC init params, a wedge call had been added to the inner GuC load
function. The reason being that it leaves the GuC log accessible via
debugfs. However, for an end user, simply aborting the module load is
much cleaner than wedging and trying to continue. The wedge blocks
user submissions but it seems that various bits of the driver itself
still try to submit to a dead GuC and lots of subsequent errors occur.
And with regards to developers debugging why their particular code
change is being rejected by the GuC, it is trivial to either add the
wedge back in and hack the return code to zero again or to just do a
GuC log dump to dmesg.

v2: Add support for error injection testing and drop the now redundant
wedge call.

CC: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Link: https://lore.kernel.org/r/20250909224132.536320-1-John.C.Harrison@Intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13 15:34:25 -05:00
Michal Wajdeczko
8831d3a5d8 drm/xe/guc: Set upper limit of H2G retries over CTB
[ Upstream commit 2506af5f81 ]

The GuC communication protocol allows GuC to send NO_RESPONSE_RETRY
reply message to indicate that due to some interim condition it can
not handle incoming H2G request and the host shall resend it.

But in some cases, due to errors, this unsatisfied condition might
be final and this could lead to endless retries as it was recently
seen on the CI:

 [drm] GT0: PF: VF1 FLR didn't finish in 5000 ms (-ETIMEDOUT)
 [drm] GT0: PF: VF1 resource sanitizing failed (-ETIMEDOUT)
 [drm] GT0: PF: VF1 FLR failed!
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
 [drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0

To avoid such dangerous loops allow only limited number of retries
(for now 50) and add some delays (n * 5ms) to slow down the rate of
resending this repeated request.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://lore.kernel.org/r/20250903223330.6408-1-michal.wajdeczko@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13 15:34:19 -05:00
Zhanjun Dong
4a988c672b drm/xe/guc: Increase GuC crash dump buffer size
[ Upstream commit ad83b1da5b ]

There are platforms already have a maximum dump size of 12KB, to avoid
data truncating, increase GuC crash dump buffer size to 16KB.

Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250829160427.1245732-1-zhanjun.dong@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13 15:34:18 -05:00
Maarten Lankhorst
99428bd612 drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test.
[ Upstream commit 1cda3c755b ]

I saw an oops in xe_gem_fault when running the xe-fast-feedback
testlist against the realtime kernel without debug options enabled.

The panic happens after core_hotunplug unbind-rebind finishes.
Presumably what happens is that a process mmaps, unlocks because
of the FAULT_FLAG_RETRY_NOWAIT logic, has no process memory left,
causing ttm_bo_vm_dummy_page() to return VM_FAULT_NOPAGE, since
there was nothing left to populate, and then oopses in
"mem_type_is_vram(tbo->resource->mem_type)" because tbo->resource
is NULL.

It's convoluted, but fits the data and explains the oops after
the test exits.

Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250715152057.23254-2-dev@lankhorst.se
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13 15:34:09 -05:00
John Harrison
efffbbbe80 drm/xe/guc: Add more GuC load error status codes
[ Upstream commit 45fbb51050 ]

The GuC load process will abort if certain status codes (which are
indicative of a fatal error) are reported. Otherwise, it keeps waiting
until the 'success' code is returned. New error codes have been added
in recent GuC releases, so add support for aborting on those as well.

v2: Shuffle HWCONFIG_START to the front of the switch to keep the
ordering as per the enum define for clarity (review feedback by
Jonathan). Also add a description for the basic 'invalid init data'
code which was missing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20250726024337.4056272-1-John.C.Harrison@Intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13 15:34:09 -05:00
Matthew Brost
c8788295ce drm/xe: Do not wake device during a GT reset
commit b3fbda1a63 upstream.

Waking the device during a GT reset can lead to unintended memory
allocation, which is not allowed since GT resets occur in the reclaim
path. Prevent this by holding a PM reference while a reset is in flight.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20251022005538.828980-3-matthew.brost@intel.com
(cherry picked from commit 480b358e7d)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13 15:34:00 -05:00
Shuicheng Lin
2c6e5904c5 drm/xe/guc: Check GuC running state before deregistering exec queue
commit 9f64b3cd05 upstream.

In normal operation, a registered exec queue is disabled and
deregistered through the GuC, and freed only after the GuC confirms
completion. However, if the driver is forced to unbind while the exec
queue is still running, the user may call exec_destroy() after the GuC
has already been stopped and CT communication disabled.

In this case, the driver cannot receive a response from the GuC,
preventing proper cleanup of exec queue resources. Fix this by directly
releasing the resources when GuC is not running.

Here is the failure dmesg log:
"
[  468.089581] ---[ end trace 0000000000000000 ]---
[  468.089608] pci 0000:03:00.0: [drm] *ERROR* GT0: GUC ID manager unclean (1/65535)
[  468.090558] pci 0000:03:00.0: [drm] GT0:     total 65535
[  468.090562] pci 0000:03:00.0: [drm] GT0:     used 1
[  468.090564] pci 0000:03:00.0: [drm] GT0:     range 1..1 (1)
[  468.092716] ------------[ cut here ]------------
[  468.092719] WARNING: CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298 ttm_vram_mgr_fini+0xf8/0x130 [xe]
"

v2: use xe_uc_fw_is_running() instead of xe_guc_ct_enabled().
    As CT may go down and come back during VF migration.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251010172529.2967639-2-shuicheng.lin@intel.com
(cherry picked from commit 9b42321a02)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23 16:20:17 +02:00
Matthew Auld
ee49c1cf1b drm/xe/uapi: loosen used tracking restriction
commit 2d1684a077 upstream.

Currently this is hidden behind perfmon_capable() since this is
technically an info leak, given that this is a system wide metric.
However the granularity reported here is always PAGE_SIZE aligned, which
matches what the core kernel is already willing to expose to userspace
if querying how many free RAM pages there are on the system, and that
doesn't need any special privileges. In addition other drm drivers seem
happy to expose this.

The motivation here if with oneAPI where they want to use the system
wide 'used' reporting here, so not the per-client fdinfo stats. This has
also come up with some perf overlay applications wanting this
information.

Fixes: 1105ac15d2 ("drm/xe/uapi: restrict system wide accounting")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Joshua Santosh <joshua.santosh.ranjan@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250919122052.420979-2-matthew.auld@intel.com
(cherry picked from commit 4d0b035fd6)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19 16:33:47 +02:00
Shuicheng Lin
c772e7cc90 drm/xe/hw_engine_group: Fix double write lock release in error path
[ Upstream commit 08fdfd260e ]

In xe_hw_engine_group_get_mode(), a write lock is acquired before
calling switch_mode(), which in turn invokes
xe_hw_engine_group_suspend_faulting_lr_jobs().

On failure inside xe_hw_engine_group_suspend_faulting_lr_jobs(),
the write lock is released there, and then again in
xe_hw_engine_group_get_mode(), leading to a double release.

Fix this by keeping both acquire and release operation in
xe_hw_engine_group_get_mode().

Fixes: 770bd1d341 ("drm/xe/hw_engine_group: Ensure safe transition between execution modes")
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://lore.kernel.org/r/20250925023145.1203004-2-shuicheng.lin@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 662d98b8b3)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-19 16:33:39 +02:00
Dan Carpenter
ea5cbcecd5 drm/xe: Fix a NULL vs IS_ERR() in xe_vm_add_compute_exec_queue()
[ Upstream commit cbc7f3b4f6 ]

The xe_preempt_fence_create() function returns error pointers.  It
never returns NULL.  Update the error checking to match.

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/aJTMBdX97cof_009@stanley.mountain
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 75cc23ffe5)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-25 11:13:48 +02:00
Shuicheng Lin
503de75db4 drm/xe/tile: Release kobject for the failure path
[ Upstream commit 013e484dbd ]

Call kobject_put() for the failure path to release the kobject

v2: remove extra newline. (Matt)

Fixes: e3d0839aa5 ("drm/xe/tile: Abort driver load for sysfs creation failure")
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20250819153950.2973344-2-shuicheng.lin@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit b98775bca9)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-25 11:13:48 +02:00
Thomas Hellström
7d07bc9c4f drm/xe: Attempt to bring bos back to VRAM after eviction
commit 5c87fee3c9 upstream.

VRAM+TT bos that are evicted from VRAM to TT may remain in
TT also after a revalidation following eviction or suspend.

This manifests itself as applications becoming sluggish
after buffer objects get evicted or after a resume from
suspend or hibernation.

If the bo supports placement in both VRAM and TT, and
we are on DGFX, mark the TT placement as fallback. This means
that it is tried only after VRAM + eviction.

This flaw has probably been present since the xe module was
upstreamed but use a Fixes: commit below where backporting is
likely to be simple. For earlier versions we need to open-
code the fallback algorithm in the driver.

v2:
- Remove check for dgfx. (Matthew Auld)
- Update the xe_dma_buf kunit test for the new strategy (CI)
- Allow dma-buf to pin in current placement (CI)
- Make xe_bo_validate() for pinned bos a NOP.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5995
Fixes: a78a8da51b ("drm/ttm: replace busy placement with flags v6")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: <stable@vger.kernel.org> # v6.9+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250904160715.2613-2-thomas.hellstrom@linux.intel.com
(cherry picked from commit cb3d7b3b46)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19 16:35:46 +02:00
Thomas Hellström
c8277d229c drm/xe/vm: Clear the scratch_pt pointer on error
commit 2b55ddf362 upstream.

Avoid triggering a dereference of an error pointer on cleanup in
xe_vm_free_scratch() by clearing any scratch_pt error pointer.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Fixes: 06951c2ee7 ("drm/xe: Use NULL PTEs as scratch PTEs")
Cc: Brian Welty <brian.welty@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250821143045.106005-4-thomas.hellstrom@linux.intel.com
(cherry picked from commit 358ee50ab5)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-04 15:31:55 +02:00