Forum: >>> Magnum BBS <<<

PREEMPT_RT vs i915

From Ben Hutchings@21:1/5 to All on Tue Jul 8 21:40:01 2025

Hi all,

Debian currently provides non-default kernel packages with the
PREEMPT_RT patchset applied and enabled. However, for the Debian 14
"forky" release the plan is to use only the upstream RT support.

One result of this is that the i915 driver will not be included in our
RT kernel package on amd64 because the upstream version lacks the
patches to make it compatible with PREEMPT_RT. This was not a surprise
to us, but may disappoint some of our users (for example see <https://bugs.debian.org/1108324>).

I see that Sebastian submitted the i915 fixes upstream in October 2024.
If I understand the explanation in <https://lore.kernel.org/intel-gfx/[email protected]/> rightly,
much of these changes are unsafe because i915 has its own hard timing requirement for reprogramming multiple display controller registers
within a single frame. Is that still the sticking point?

It seems like the critical uncore lock is currently held in a lot of
places and potentially for a long time. Would it be practical to split
this lock into:

1. raw spinlock protecting only state needed for the atomic (within-one-
frame) update
2. regular spinlock protecting everything in uncore

or is almost all the uncore state potentially used during an atomic
update?

Would it help to offload the atomic updates to a kthread that runs with
RT priority but still with hard interrupts enabled?

Would it make things easier if setting CONFIG_PREEMPT_RT=y limited i915
to not run on some older hardware?

Ben.

--
Ben Hutchings
All the simple programs have been written, and all the good names taken

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEErCspvTSmr92z9o8157/I7JWGEQkFAmhtcvkACgkQ57/I7JWG EQnhMxAAm3p4Mf9jNXbIN0mu2OooFE5xYUjVuzj887FjqEydecdn3YXVdZnpkIVH jALbfFQLiGomejWH15jyQHig1c4/mwbSddrSJ0+Tl9Pkr6cdcvO2Yg1Kv6vVZS8b RrrMGJzsz8zN75pqptFDaDo8NWNFTcOcUfAn2PWZbpWfLtkX/w05nvZHTMIwILKL Gq5LiEEazGPO8T8BMcQYUGdPY2/3qNELtTWxHbmg8n4wrqcGm25mq8p13CRXwNbO XA25YH2ityz3Iwv5fCpdmtrhXfNpZI8XtUgAWjwFzeRy0ZPT7+SHZRboQN2czPjG 3BIdzaxxHlSEXOnCJ+5gkqHqEhdd8g9NJnS+6gxRlyvKTdbb1RACzSKBqHEfCK/C DzcId/O9Y8+iXpWSX7vO55fkU1c0zGSei+1Bl/kTtOccnvQM5zS7t9MKMIZM29JM QIYrcXx236A6n/5lV2fCz4PX+p3QP7tyjPCcbkO6N5b81Lw7pkG/VgCLTCqtqbkk KeGrLwZlgFgHHFapJHmBhpqGt3mAMZXUL10uGdhnVQucNc6JvyjX8wiWlSbwcjId 9S1iBnIIt3a2i/bbuFySySL2yTE5RwIKX/NynsWyaF6/pmYpfPjT6sue/Z38dbAN YaxkIbOPVYNeGAoEJ9gK/kNlYMHMGLxfTX+QyL4CKrJbMd2w3wU=
=nw8J
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Ben Hutchings on Wed Jul 9 20:00:01 2025

On Tue, Jul 08, 2025 at 09:35:21PM +0200, Ben Hutchings wrote:

Hi all,

Debian currently provides non-default kernel packages with the
PREEMPT_RT patchset applied and enabled. However, for the Debian 14
"forky" release the plan is to use only the upstream RT support.

One result of this is that the i915 driver will not be included in our
RT kernel package on amd64 because the upstream version lacks the
patches to make it compatible with PREEMPT_RT. This was not a surprise
to us, but may disappoint some of our users (for example see <https://bugs.debian.org/1108324>).

I see that Sebastian submitted the i915 fixes upstream in October 2024.
If I understand the explanation in <https://lore.kernel.org/intel-gfx/[email protected]/> rightly,
much of these changes are unsafe because i915 has its own hard timing requirement for reprogramming multiple display controller registers
within a single frame. Is that still the sticking point?

It seems like the critical uncore lock is currently held in a lot of
places and potentially for a long time.

It shouldn't be held for that long. I think it should just be
a raw spinlock.

The only edge case I know is the weird retry hack in __intel_get_crtc_scanline() which I suspect is just due to PSR
and could potentially be handled in a nicer way by actually
checking the PSR state.

Would it be practical to split
this lock into:

1. raw spinlock protecting only state needed for the atomic (within-one- frame) update

Spinlocks aren't involved in that. It is achieved by racing against
the beam, with interrupts disabled to make it more likely the CPU
wins the race.

2. regular spinlock protecting everything in uncore

or is almost all the uncore state potentially used during an atomic
update?

Would it help to offload the atomic updates to a kthread that runs with
RT priority but still with hard interrupts enabled?

Not sure what another thread would specifically get us, as opposed
to eg. just boosting the priority of the existing thread? But whatever
thread does the work needs to not be interrupted for any significant
amount of time.

The interrupt disabling part I suppose is rather hardware/workload
specific, so hard to say anything general about it.

Would it make things easier if setting CONFIG_PREEMPT_RT=y limited i915
to not run on some older hardware?

No. All hardware needs this.

Anyways, all of this is rather academic at this point. Someone
needs to try and see what works, and hammer it hard while doing so
to make sure it doesn't fall over easily.

--
Ville Syrj�l�
Intel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sebastian Andrzej Siewior@21:1/5 to All on Wed Jul 9 22:10:01 2025

On 2025-07-09 20:30:26 [+0300], Ville Syrjälä wrote:

It seems like the critical uncore lock is currently held in a lot of
places and potentially for a long time.

It shouldn't be held for that long. I think it should just be
a raw spinlock.

What about I resubmit the series and we look again? I don't think the
lock should be made raw just to be done with it.

Sebastian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Sebastian Andrzej Siewior on Wed Jul 9 22:40:01 2025

On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:

On 2025-07-09 20:30:26 [+0300], Ville Syrj�l� wrote:

It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

It shouldn't be held for that long. I think it should just be
a raw spinlock.

What about I resubmit the series and we look again? I don't think the
lock should be made raw just to be done with it.

Until someone actually does the work to confirm the thing is working
reliably there's no point in posting anything.

And IIRC the other remaining problem with RT was the spinlocks used
inside tracepoints (which is uncore lock, and probably some vblank
locks). So that too needs some kind of solution because it's going to
very hard to debug the timing sensitive parts without the tracepoints.

--
Ville Syrj�l�
Intel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sebastian Andrzej Siewior@21:1/5 to Matthew Brost on Thu Jul 10 08:50:01 2025

On 2025-07-09 15:04:27 [-0700], Matthew Brost wrote:

And IIRC the other remaining problem with RT was the spinlocks used
inside tracepoints (which is uncore lock, and probably some vblank
locks). So that too needs some kind of solution because it's going to
very hard to debug the timing sensitive parts without the tracepoints.

A bit of a drive-by comment, but taking locks inside tracepoints seems
like a pretty horrible idea in general. We've managed to write an entire driver (Xe) from scratch and bring it up without doing this. I'd be very surprised if this is truly necessary in i915.

Steven made suggestions how to get around it make it work but my
impression was that this was not appreciated by the i915 side.
The general unwritten rule is to not to take any locks but simply assign variables.

Matt

Sebastian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sebastian Andrzej Siewior@21:1/5 to All on Thu Jul 10 09:00:01 2025

On 2025-07-09 23:09:22 [+0300], Ville Syrjälä wrote:

On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:

On 2025-07-09 20:30:26 [+0300], Ville Syrjälä wrote:

It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

It shouldn't be held for that long. I think it should just be
a raw spinlock.

What about I resubmit the series and we look again? I don't think the
lock should be made raw just to be done with it.

Until someone actually does the work to confirm the thing is working
reliably there's no point in posting anything.

Well it works on my machine and this machine dose not pass the code
paths that I patch.

Every patch made was done because someone reported an error/ warning and confirmed afterwards that the patch fixes it for them and they can use
the machine and don't observe anything.

And IIRC the other remaining problem with RT was the spinlocks used
inside tracepoints (which is uncore lock, and probably some vblank
locks). So that too needs some kind of solution because it's going to
very hard to debug the timing sensitive parts without the tracepoints.

no, not just that. Making the lock raw led to latency spikes in simple
spikes and I just disabled trace points. It could be worked around by
taking the lock if the tracepoint is enabled and then invoking the
tracepoint unconditionally and not taking the lock anymore. Steven made
a suggestion a while ago how to put this in macro as far as I remember.

Looking at series there are things like execlists_dequeue_irq() which do local_irq_disable() follwed by spin_lock() which is a no-no and
explained in Documentation/locking/locktypes.rst. There is also intel_guc_send_busy_loop() no considering RCU read-locks for their
atomic detection. I have 8 patches in total and one in the pipe.

I have patches with Acks by Tvrtko Ursulin but I remain caring them.
I can repost it but usually the bot complains, the bot gets retriggered.
The first patch in series is vblank detection and I've been told this is
old code and scares people. So that might be why the remaining part is
ignored.

Sebastian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sebastian Andrzej Siewior@21:1/5 to All on Thu Jul 10 17:40:01 2025

On 2025-07-10 18:04:42 [+0300], Ville Syrjälä wrote:

When this was last discussed I suggested that there should be a
versions of the tracepoint macros that do the sampling outside
the lock, but that wasn't deemed acceptable for whatever reason.
I don't even know why the current macros are doing the
sampling while holding the lock...

Any objections to me sending the batch and we figure out later how get
the tracepoints for i915 enabled again on RT?
It would be an improvement because you could take a vanilla kernel and
enable PREEMPT_RT and you would only miss the tracepoints while now you
can't enable i915 at all and XE either doesn't compile or spills
warnings at runtime due to the code it shares with i915.

Sebastian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Sebastian Andrzej Siewior on Thu Jul 10 17:30:01 2025

On Thu, Jul 10, 2025 at 08:41:36AM +0200, Sebastian Andrzej Siewior wrote:

On 2025-07-09 23:09:22 [+0300], Ville Syrj�l� wrote:

On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:

On 2025-07-09 20:30:26 [+0300], Ville Syrj�l� wrote:

It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

It shouldn't be held for that long. I think it should just be
a raw spinlock.

What about I resubmit the series and we look again? I don't think the lock should be made raw just to be done with it.

Until someone actually does the work to confirm the thing is working reliably there's no point in posting anything.

Well it works on my machine and this machine dose not pass the code
paths that I patch.

Every patch made was done because someone reported an error/ warning and confirmed afterwards that the patch fixes it for them and they can use
the machine and don't observe anything.

And IIRC the other remaining problem with RT was the spinlocks used
inside tracepoints (which is uncore lock, and probably some vblank
locks). So that too needs some kind of solution because it's going to
very hard to debug the timing sensitive parts without the tracepoints.

no, not just that. Making the lock raw led to latency spikes in simple
spikes and I just disabled trace points. It could be worked around by
taking the lock if the tracepoint is enabled and then invoking the
tracepoint unconditionally and not taking the lock anymore. Steven made
a suggestion a while ago how to put this in macro as far as I remember.

When this was last discussed I suggested that there should be a
versions of the tracepoint macros that do the sampling outside
the lock, but that wasn't deemed acceptable for whatever reason.
I don't even know why the current macros are doing the
sampling while holding the lock...

--
Ville Syrj�l�
Intel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Matthew Brost on Thu Jul 10 17:40:01 2025

On Wed, Jul 09, 2025 at 03:04:27PM -0700, Matthew Brost wrote:

On Wed, Jul 09, 2025 at 11:09:22PM +0300, Ville Syrj�l� wrote:

On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:

On 2025-07-09 20:30:26 [+0300], Ville Syrj�l� wrote:

It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

It shouldn't be held for that long. I think it should just be
a raw spinlock.

What about I resubmit the series and we look again? I don't think the lock should be made raw just to be done with it.

Until someone actually does the work to confirm the thing is working reliably there's no point in posting anything.

And IIRC the other remaining problem with RT was the spinlocks used
inside tracepoints (which is uncore lock, and probably some vblank
locks). So that too needs some kind of solution because it's going to
very hard to debug the timing sensitive parts without the tracepoints.

A bit of a drive-by comment, but taking locks inside tracepoints seems
like a pretty horrible idea in general. We've managed to write an entire driver (Xe) from scratch and bring it up without doing this.

For xe gt stuff specifically the one reason for needing a lock
could be forcewake. Ie. if you read a register that needs
forcewake from a tracepoint you might need some kind of protection
against concurrent access. But xe lacks any kind of forcewake
sanity checker, so no one would likely even know if you get that
wrong. Unless they notice a bogus register value in the
trace that is. But maybe xe doesn't use such registers in its
tracepoints atm, who knows.

And speaking for hardware in general, indexed registers aren't
exactly an uncommon thing. So the tracing stuff really should
have a sane standard way to deal with them...

I'd be very
surprised if this is truly necessary in i915.

The most fundemental reason is the hardware issue present on
some platforms which can hang the machine if you access the
same cacheline from multiple threads simultaneously.

The other reason is that some machines lack the hardware
frame counter so we have to cook one up based on cpu side
timestamps, and that involves the vblank machinery locks.

--
Ville Syrj�l�
Intel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Mike Galbraith on Thu Jul 10 18:10:01 2025

On Thu, Jul 10, 2025 at 06:52:58AM +0200, Mike Galbraith wrote:

On Wed, 2025-07-09 at 23:09 +0300, Ville Syrj�l� wrote:

On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:

On 2025-07-09 20:30:26 [+0300], Ville Syrj�l� wrote:

It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

It shouldn't be held for that long. I think it should just be
a raw spinlock.

What about I resubmit the series and we look again? I don't think the lock should be made raw just to be done with it.

Until someone actually does the work to confirm the thing is working reliably there's no point in posting anything.

What does that entail?

Basic testing would be something like this:
- enable CONFIG_DRM_I915_DEBUG_VBLANK_EVADE
- set i915.enable_dsb=0 to make sure everything takes the
mmio path
- stress the heck out of it and make sure the histogram
doesn't look significantly worse than on !RT
(kms_atomic_transition --extended might take care of the display
side here, but it should probably be accompanied with some
horrendous system loads which is a less well defined part)
- ideally do that on a potato
(some VLV/CHV (Atom) thing would probably be a good candidate)
- repeat with lockdep enabled to make everything even harder

--
Ville Syrj�l�
Intel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Matthew Brost on Thu Jul 10 20:40:01 2025

On Thu, Jul 10, 2025 at 11:04:22AM -0700, Matthew Brost wrote:

On Thu, Jul 10, 2025 at 06:21:09PM +0300, Ville Syrjälä wrote:

On Wed, Jul 09, 2025 at 03:04:27PM -0700, Matthew Brost wrote:

On Wed, Jul 09, 2025 at 11:09:22PM +0300, Ville Syrjälä wrote:

On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:

On 2025-07-09 20:30:26 [+0300], Ville Syrjälä wrote:

It seems like the critical uncore lock is currently held in a lot of
places and potentially for a long time.

It shouldn't be held for that long. I think it should just be
a raw spinlock.

What about I resubmit the series and we look again? I don't think the lock should be made raw just to be done with it.

Until someone actually does the work to confirm the thing is working reliably there's no point in posting anything.

And IIRC the other remaining problem with RT was the spinlocks used inside tracepoints (which is uncore lock, and probably some vblank locks). So that too needs some kind of solution because it's going to very hard to debug the timing sensitive parts without the tracepoints.

A bit of a drive-by comment, but taking locks inside tracepoints seems like a pretty horrible idea in general. We've managed to write an entire driver (Xe) from scratch and bring it up without doing this.

For xe gt stuff specifically the one reason for needing a lock
could be forcewake. Ie. if you read a register that needs

Yes, we don't forcewake on demand in Xe; instead, we expect the upper
layers to hold these.

forcewake from a tracepoint you might need some kind of protection
against concurrent access. But xe lacks any kind of forcewake
sanity checker, so no one would likely even know if you get that

We do have some asserts to catch missing forcewake—see xe_force_wake_assert_held—but this is incomplete compared to the i915 checker.

wrong. Unless they notice a bogus register value in the
trace that is. But maybe xe doesn't use such registers in its
tracepoints atm, who knows.

I just audited the Xe tracepoints. We mostly just assign values from
software state, or in some cases, read memory from the VRAM BAR. The
latter again doesn't require any locks—just asserts that the device is awake.

And speaking for hardware in general, indexed registers aren't
exactly an uncommon thing. So the tracing stuff really should
have a sane standard way to deal with them...

I'd be very
surprised if this is truly necessary in i915.

The most fundemental reason is the hardware issue present on
some platforms which can hang the machine if you access the
same cacheline from multiple threads simultaneously.

Ooo, yeah—yikes.

One novel solution here: move the MMIO reads out of the tracepoint code
and into the caller, passing the values as arguments to the tracepoint.
Then, just assign the values inside the tracepoint. I think that would
work and be functionally equivalent.

I don't want a silly open coded solution like that. The tracepoint
should just do the right thing on its own. But I haven't actually
looked how hard it would to add that into the existing macro
jungle that is tracepoints...

--
Ville Syrjälä
Intel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Maarten Lankhorst@21:1/5 to Sebastian Andrzej Siewior on Fri Jul 11 15:10:01 2025

Hey,

On 2025-07-10 17:20, Sebastian Andrzej Siewior wrote:

On 2025-07-10 18:04:42 [+0300], Ville Syrjälä wrote:

When this was last discussed I suggested that there should be a
versions of the tracepoint macros that do the sampling outside
the lock, but that wasn't deemed acceptable for whatever reason.
I don't even know why the current macros are doing the
sampling while holding the lock...

Any objections to me sending the batch and we figure out later how get
the tracepoints for i915 enabled again on RT?
It would be an improvement because you could take a vanilla kernel and
enable PREEMPT_RT and you would only miss the tracepoints while now you
can't enable i915 at all and XE either doesn't compile or spills
warnings at runtime due to the code it shares with i915.

Sebastian

FWIW, I did some quick testing. There should be no need for disabling tracepoints on xe.
The uncore lock is for a very specific reason (intel_de.h):
* Certain architectures will die if the same cacheline is concurrently accessed
* by different clients (e.g. on Ivybridge). Access to registers should
* therefore generally be serialised, by either the dev_priv->uncore.lock or
* a more localised lock guarding all access to that bank of registers.

Since we only support modern platforms on xe there is no need for the uncore lock and the specific error will not trigger.
On xe, it only exists to be used in vblank evasion for compatiblity with i915.

From the xe point of view, I would say applying patch 1-3 is sufficient.

Patch 3 because i915_utils.h is used by display still, unfortunately..

I would recommend adjusting the patch to keep the display tracepoints enabled on xe,
the non-vblank i915-specific patches should be harmless to apply.

After that, the rest can be applied too.

While I understand the theoretical need for more testing, I think we should go for practical and apply
patch 1-2 too. Even on normal kernels there is absolutely no guarantee of fast completion.
I'd say on a deterministic -rt kernel, I'd say it's less likely that vblank evasion is a problem.

Kind regards,
~Maarten

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Sat Aug 1 13:03:37 2026
  from Madison, Nc via Telnet
- Centurion
  Fri Jul 31 22:18:08 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Fri Jul 31 21:25:53 2026
  from Wales, Uk via Telnet
- Rixter
  Fri Jul 31 19:41:16 2026
  from Madison, Nc via Telnet
- Rixter
  Fri Jul 31 19:29:50 2026
  from Madison, Nc via Telnet
- Rixter
  Fri Jul 31 19:18:30 2026
  from Madison, Nc via Telnet
- Bob Worm
  Fri Jul 31 15:23:30 2026
  from Wales, Uk via Telnet
- Rixter
  Fri Jul 31 12:17:09 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	140:04:37
Calls:	12,471
Calls today:	1
Files:	15,203
Messages:	6,538,765

PREEMPT_RT vs i915

Who's Online

Recent Visitors

System Info