• PREEMPT_RT vs i915

    From Ben Hutchings@21:1/5 to All on Tue Jul 8 21:40:01 2025
    Hi all,

    Debian currently provides non-default kernel packages with the
    PREEMPT_RT patchset applied and enabled. However, for the Debian 14
    "forky" release the plan is to use only the upstream RT support.

    One result of this is that the i915 driver will not be included in our
    RT kernel package on amd64 because the upstream version lacks the
    patches to make it compatible with PREEMPT_RT. This was not a surprise
    to us, but may disappoint some of our users (for example see <https://bugs.debian.org/1108324>).

    I see that Sebastian submitted the i915 fixes upstream in October 2024.
    If I understand the explanation in <https://lore.kernel.org/intel-gfx/[email protected]/> rightly,
    much of these changes are unsafe because i915 has its own hard timing requirement for reprogramming multiple display controller registers
    within a single frame. Is that still the sticking point?

    It seems like the critical uncore lock is currently held in a lot of
    places and potentially for a long time. Would it be practical to split
    this lock into:

    1. raw spinlock protecting only state needed for the atomic (within-one-
    frame) update
    2. regular spinlock protecting everything in uncore

    or is almost all the uncore state potentially used during an atomic
    update?

    Would it help to offload the atomic updates to a kthread that runs with
    RT priority but still with hard interrupts enabled?

    Would it make things easier if setting CONFIG_PREEMPT_RT=y limited i915
    to not run on some older hardware?

    Ben.

    --
    Ben Hutchings
    All the simple programs have been written, and all the good names taken

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEErCspvTSmr92z9o8157/I7JWGEQkFAmhtcvkACgkQ57/I7JWG EQnhMxAAm3p4Mf9jNXbIN0mu2OooFE5xYUjVuzj887FjqEydecdn3YXVdZnpkIVH jALbfFQLiGomejWH15jyQHig1c4/mwbSddrSJ0+Tl9Pkr6cdcvO2Yg1Kv6vVZS8b RrrMGJzsz8zN75pqptFDaDo8NWNFTcOcUfAn2PWZbpWfLtkX/w05nvZHTMIwILKL Gq5LiEEazGPO8T8BMcQYUGdPY2/3qNELtTWxHbmg8n4wrqcGm25mq8p13CRXwNbO XA25YH2ityz3Iwv5fCpdmtrhXfNpZI8XtUgAWjwFzeRy0ZPT7+SHZRboQN2czPjG 3BIdzaxxHlSEXOnCJ+5gkqHqEhdd8g9NJnS+6gxRlyvKTdbb1RACzSKBqHEfCK/C DzcId/O9Y8+iXpWSX7vO55fkU1c0zGSei+1Bl/kTtOccnvQM5zS7t9MKMIZM29JM QIYrcXx236A6n/5lV2fCz4PX+p3QP7tyjPCcbkO6N5b81Lw7pkG/VgCLTCqtqbkk KeGrLwZlgFgHHFapJHmBhpqGt3mAMZXUL10uGdhnVQucNc6JvyjX8wiWlSbwcjId 9S1iBnIIt3a2i/bbuFySySL2yTE5RwIKX/NynsWyaF6/pmYpfPjT6sue/Z38dbAN YaxkIbOPVYNeGAoEJ9gK/kNlYMHMGLxfTX+QyL4CKrJbMd2w3wU=
    =nw8J
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Ben Hutchings on Wed Jul 9 20:00:01 2025
    On Tue, Jul 08, 2025 at 09:35:21PM +0200, Ben Hutchings wrote:
    Hi all,

    Debian currently provides non-default kernel packages with the
    PREEMPT_RT patchset applied and enabled. However, for the Debian 14
    "forky" release the plan is to use only the upstream RT support.

    One result of this is that the i915 driver will not be included in our
    RT kernel package on amd64 because the upstream version lacks the
    patches to make it compatible with PREEMPT_RT. This was not a surprise
    to us, but may disappoint some of our users (for example see <https://bugs.debian.org/1108324>).

    I see that Sebastian submitted the i915 fixes upstream in October 2024.
    If I understand the explanation in <https://lore.kernel.org/intel-gfx/[email protected]/> rightly,
    much of these changes are unsafe because i915 has its own hard timing requirement for reprogramming multiple display controller registers
    within a single frame. Is that still the sticking point?

    It seems like the critical uncore lock is currently held in a lot of
    places and potentially for a long time.

    It shouldn't be held for that long. I think it should just be
    a raw spinlock.

    The only edge case I know is the weird retry hack in __intel_get_crtc_scanline() which I suspect is just due to PSR
    and could potentially be handled in a nicer way by actually
    checking the PSR state.

    Would it be practical to split
    this lock into:

    1. raw spinlock protecting only state needed for the atomic (within-one- frame) update

    Spinlocks aren't involved in that. It is achieved by racing against
    the beam, with interrupts disabled to make it more likely the CPU
    wins the race.

    2. regular spinlock protecting everything in uncore

    or is almost all the uncore state potentially used during an atomic
    update?

    Would it help to offload the atomic updates to a kthread that runs with
    RT priority but still with hard interrupts enabled?

    Not sure what another thread would specifically get us, as opposed
    to eg. just boosting the priority of the existing thread? But whatever
    thread does the work needs to not be interrupted for any significant
    amount of time.

    The interrupt disabling part I suppose is rather hardware/workload
    specific, so hard to say anything general about it.


    Would it make things easier if setting CONFIG_PREEMPT_RT=y limited i915
    to not run on some older hardware?

    No. All hardware needs this.

    Anyways, all of this is rather academic at this point. Someone
    needs to try and see what works, and hammer it hard while doing so
    to make sure it doesn't fall over easily.

    --
    Ville Syrj�l�
    Intel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sebastian Andrzej Siewior@21:1/5 to All on Wed Jul 9 22:10:01 2025
    On 2025-07-09 20:30:26 [+0300], Ville Syrjälä wrote:

    It seems like the critical uncore lock is currently held in a lot of
    places and potentially for a long time.

    It shouldn't be held for that long. I think it should just be
    a raw spinlock.

    What about I resubmit the series and we look again? I don't think the
    lock should be made raw just to be done with it.

    Sebastian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Sebastian Andrzej Siewior on Wed Jul 9 22:40:01 2025
    On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:
    On 2025-07-09 20:30:26 [+0300], Ville Syrj�l� wrote:

    It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

    It shouldn't be held for that long. I think it should just be
    a raw spinlock.

    What about I resubmit the series and we look again? I don't think the
    lock should be made raw just to be done with it.

    Until someone actually does the work to confirm the thing is working
    reliably there's no point in posting anything.

    And IIRC the other remaining problem with RT was the spinlocks used
    inside tracepoints (which is uncore lock, and probably some vblank
    locks). So that too needs some kind of solution because it's going to
    very hard to debug the timing sensitive parts without the tracepoints.

    --
    Ville Syrj�l�
    Intel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sebastian Andrzej Siewior@21:1/5 to Matthew Brost on Thu Jul 10 08:50:01 2025
    On 2025-07-09 15:04:27 [-0700], Matthew Brost wrote:
    And IIRC the other remaining problem with RT was the spinlocks used
    inside tracepoints (which is uncore lock, and probably some vblank
    locks). So that too needs some kind of solution because it's going to
    very hard to debug the timing sensitive parts without the tracepoints.

    A bit of a drive-by comment, but taking locks inside tracepoints seems
    like a pretty horrible idea in general. We've managed to write an entire driver (Xe) from scratch and bring it up without doing this. I'd be very surprised if this is truly necessary in i915.

    Steven made suggestions how to get around it make it work but my
    impression was that this was not appreciated by the i915 side.
    The general unwritten rule is to not to take any locks but simply assign variables.

    Matt

    Sebastian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sebastian Andrzej Siewior@21:1/5 to All on Thu Jul 10 09:00:01 2025
    On 2025-07-09 23:09:22 [+0300], Ville Syrjälä wrote:
    On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:
    On 2025-07-09 20:30:26 [+0300], Ville Syrjälä wrote:

    It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

    It shouldn't be held for that long. I think it should just be
    a raw spinlock.

    What about I resubmit the series and we look again? I don't think the
    lock should be made raw just to be done with it.

    Until someone actually does the work to confirm the thing is working
    reliably there's no point in posting anything.

    Well it works on my machine and this machine dose not pass the code
    paths that I patch.

    Every patch made was done because someone reported an error/ warning and confirmed afterwards that the patch fixes it for them and they can use
    the machine and don't observe anything.

    And IIRC the other remaining problem with RT was the spinlocks used
    inside tracepoints (which is uncore lock, and probably some vblank
    locks). So that too needs some kind of solution because it's going to
    very hard to debug the timing sensitive parts without the tracepoints.

    no, not just that. Making the lock raw led to latency spikes in simple
    spikes and I just disabled trace points. It could be worked around by
    taking the lock if the tracepoint is enabled and then invoking the
    tracepoint unconditionally and not taking the lock anymore. Steven made
    a suggestion a while ago how to put this in macro as far as I remember.

    Looking at series there are things like execlists_dequeue_irq() which do local_irq_disable() follwed by spin_lock() which is a no-no and
    explained in Documentation/locking/locktypes.rst. There is also intel_guc_send_busy_loop() no considering RCU read-locks for their
    atomic detection. I have 8 patches in total and one in the pipe.

    I have patches with Acks by Tvrtko Ursulin but I remain caring them.
    I can repost it but usually the bot complains, the bot gets retriggered.
    The first patch in series is vblank detection and I've been told this is
    old code and scares people. So that might be why the remaining part is
    ignored.

    Sebastian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sebastian Andrzej Siewior@21:1/5 to All on Thu Jul 10 17:40:01 2025
    On 2025-07-10 18:04:42 [+0300], Ville Syrjälä wrote:

    When this was last discussed I suggested that there should be a
    versions of the tracepoint macros that do the sampling outside
    the lock, but that wasn't deemed acceptable for whatever reason.
    I don't even know why the current macros are doing the
    sampling while holding the lock...

    Any objections to me sending the batch and we figure out later how get
    the tracepoints for i915 enabled again on RT?
    It would be an improvement because you could take a vanilla kernel and
    enable PREEMPT_RT and you would only miss the tracepoints while now you
    can't enable i915 at all and XE either doesn't compile or spills
    warnings at runtime due to the code it shares with i915.

    Sebastian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Sebastian Andrzej Siewior on Thu Jul 10 17:30:01 2025
    On Thu, Jul 10, 2025 at 08:41:36AM +0200, Sebastian Andrzej Siewior wrote:
    On 2025-07-09 23:09:22 [+0300], Ville Syrj�l� wrote:
    On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:
    On 2025-07-09 20:30:26 [+0300], Ville Syrj�l� wrote:

    It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

    It shouldn't be held for that long. I think it should just be
    a raw spinlock.

    What about I resubmit the series and we look again? I don't think the lock should be made raw just to be done with it.

    Until someone actually does the work to confirm the thing is working reliably there's no point in posting anything.

    Well it works on my machine and this machine dose not pass the code
    paths that I patch.

    Every patch made was done because someone reported an error/ warning and confirmed afterwards that the patch fixes it for them and they can use
    the machine and don't observe anything.

    And IIRC the other remaining problem with RT was the spinlocks used
    inside tracepoints (which is uncore lock, and probably some vblank
    locks). So that too needs some kind of solution because it's going to
    very hard to debug the timing sensitive parts without the tracepoints.

    no, not just that. Making the lock raw led to latency spikes in simple
    spikes and I just disabled trace points. It could be worked around by
    taking the lock if the tracepoint is enabled and then invoking the
    tracepoint unconditionally and not taking the lock anymore. Steven made
    a suggestion a while ago how to put this in macro as far as I remember.

    When this was last discussed I suggested that there should be a
    versions of the tracepoint macros that do the sampling outside
    the lock, but that wasn't deemed acceptable for whatever reason.
    I don't even know why the current macros are doing the
    sampling while holding the lock...

    --
    Ville Syrj�l�
    Intel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Matthew Brost on Thu Jul 10 17:40:01 2025
    On Wed, Jul 09, 2025 at 03:04:27PM -0700, Matthew Brost wrote:
    On Wed, Jul 09, 2025 at 11:09:22PM +0300, Ville Syrj�l� wrote:
    On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:
    On 2025-07-09 20:30:26 [+0300], Ville Syrj�l� wrote:

    It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

    It shouldn't be held for that long. I think it should just be
    a raw spinlock.

    What about I resubmit the series and we look again? I don't think the lock should be made raw just to be done with it.

    Until someone actually does the work to confirm the thing is working reliably there's no point in posting anything.

    And IIRC the other remaining problem with RT was the spinlocks used
    inside tracepoints (which is uncore lock, and probably some vblank
    locks). So that too needs some kind of solution because it's going to
    very hard to debug the timing sensitive parts without the tracepoints.

    A bit of a drive-by comment, but taking locks inside tracepoints seems
    like a pretty horrible idea in general. We've managed to write an entire driver (Xe) from scratch and bring it up without doing this.

    For xe gt stuff specifically the one reason for needing a lock
    could be forcewake. Ie. if you read a register that needs
    forcewake from a tracepoint you might need some kind of protection
    against concurrent access. But xe lacks any kind of forcewake
    sanity checker, so no one would likely even know if you get that
    wrong. Unless they notice a bogus register value in the
    trace that is. But maybe xe doesn't use such registers in its
    tracepoints atm, who knows.

    And speaking for hardware in general, indexed registers aren't
    exactly an uncommon thing. So the tracing stuff really should
    have a sane standard way to deal with them...

    I'd be very
    surprised if this is truly necessary in i915.

    The most fundemental reason is the hardware issue present on
    some platforms which can hang the machine if you access the
    same cacheline from multiple threads simultaneously.

    The other reason is that some machines lack the hardware
    frame counter so we have to cook one up based on cpu side
    timestamps, and that involves the vblank machinery locks.

    --
    Ville Syrj�l�
    Intel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Mike Galbraith on Thu Jul 10 18:10:01 2025
    On Thu, Jul 10, 2025 at 06:52:58AM +0200, Mike Galbraith wrote:
    On Wed, 2025-07-09 at 23:09 +0300, Ville Syrj�l� wrote:
    On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:
    On 2025-07-09 20:30:26 [+0300], Ville Syrj�l� wrote:

    It seems like the critical uncore lock is currently held in a lot of places and potentially for a long time.

    It shouldn't be held for that long. I think it should just be
    a raw spinlock.

    What about I resubmit the series and we look again? I don't think the lock should be made raw just to be done with it.

    Until someone actually does the work to confirm the thing is working reliably there's no point in posting anything.

    What does that entail?

    Basic testing would be something like this:
    - enable CONFIG_DRM_I915_DEBUG_VBLANK_EVADE
    - set i915.enable_dsb=0 to make sure everything takes the
    mmio path
    - stress the heck out of it and make sure the histogram
    doesn't look significantly worse than on !RT
    (kms_atomic_transition --extended might take care of the display
    side here, but it should probably be accompanied with some
    horrendous system loads which is a less well defined part)
    - ideally do that on a potato
    (some VLV/CHV (Atom) thing would probably be a good candidate)
    - repeat with lockdep enabled to make everything even harder

    --
    Ville Syrj�l�
    Intel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ville =?iso-8859-1?Q?Syrj=E4l=E4?=@21:1/5 to Matthew Brost on Thu Jul 10 20:40:01 2025
    On Thu, Jul 10, 2025 at 11:04:22AM -0700, Matthew Brost wrote:
    On Thu, Jul 10, 2025 at 06:21:09PM +0300, Ville Syrjälä wrote:
    On Wed, Jul 09, 2025 at 03:04:27PM -0700, Matthew Brost wrote:
    On Wed, Jul 09, 2025 at 11:09:22PM +0300, Ville Syrjälä wrote:
    On Wed, Jul 09, 2025 at 09:44:43PM +0200, Sebastian Andrzej Siewior wrote:
    On 2025-07-09 20:30:26 [+0300], Ville Syrjälä wrote:

    It seems like the critical uncore lock is currently held in a lot of
    places and potentially for a long time.

    It shouldn't be held for that long. I think it should just be
    a raw spinlock.

    What about I resubmit the series and we look again? I don't think the lock should be made raw just to be done with it.

    Until someone actually does the work to confirm the thing is working reliably there's no point in posting anything.

    And IIRC the other remaining problem with RT was the spinlocks used inside tracepoints (which is uncore lock, and probably some vblank locks). So that too needs some kind of solution because it's going to very hard to debug the timing sensitive parts without the tracepoints.

    A bit of a drive-by comment, but taking locks inside tracepoints seems like a pretty horrible idea in general. We've managed to write an entire driver (Xe) from scratch and bring it up without doing this.

    For xe gt stuff specifically the one reason for needing a lock
    could be forcewake. Ie. if you read a register that needs

    Yes, we don't forcewake on demand in Xe; instead, we expect the upper
    layers to hold these.

    forcewake from a tracepoint you might need some kind of protection
    against concurrent access. But xe lacks any kind of forcewake
    sanity checker, so no one would likely even know if you get that

    We do have some asserts to catch missing forcewake—see xe_force_wake_assert_held—but this is incomplete compared to the i915 checker.

    wrong. Unless they notice a bogus register value in the
    trace that is. But maybe xe doesn't use such registers in its
    tracepoints atm, who knows.


    I just audited the Xe tracepoints. We mostly just assign values from
    software state, or in some cases, read memory from the VRAM BAR. The
    latter again doesn't require any locks—just asserts that the device is awake.

    And speaking for hardware in general, indexed registers aren't
    exactly an uncommon thing. So the tracing stuff really should
    have a sane standard way to deal with them...

    I'd be very
    surprised if this is truly necessary in i915.

    The most fundemental reason is the hardware issue present on
    some platforms which can hang the machine if you access the
    same cacheline from multiple threads simultaneously.


    Ooo, yeah—yikes.

    One novel solution here: move the MMIO reads out of the tracepoint code
    and into the caller, passing the values as arguments to the tracepoint.
    Then, just assign the values inside the tracepoint. I think that would
    work and be functionally equivalent.

    I don't want a silly open coded solution like that. The tracepoint
    should just do the right thing on its own. But I haven't actually
    looked how hard it would to add that into the existing macro
    jungle that is tracepoints...

    --
    Ville Syrjälä
    Intel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Maarten Lankhorst@21:1/5 to Sebastian Andrzej Siewior on Fri Jul 11 15:10:01 2025
    Hey,

    On 2025-07-10 17:20, Sebastian Andrzej Siewior wrote:
    On 2025-07-10 18:04:42 [+0300], Ville Syrjälä wrote:

    When this was last discussed I suggested that there should be a
    versions of the tracepoint macros that do the sampling outside
    the lock, but that wasn't deemed acceptable for whatever reason.
    I don't even know why the current macros are doing the
    sampling while holding the lock...

    Any objections to me sending the batch and we figure out later how get
    the tracepoints for i915 enabled again on RT?
    It would be an improvement because you could take a vanilla kernel and
    enable PREEMPT_RT and you would only miss the tracepoints while now you
    can't enable i915 at all and XE either doesn't compile or spills
    warnings at runtime due to the code it shares with i915.

    Sebastian

    FWIW, I did some quick testing. There should be no need for disabling tracepoints on xe.
    The uncore lock is for a very specific reason (intel_de.h):
    * Certain architectures will die if the same cacheline is concurrently accessed
    * by different clients (e.g. on Ivybridge). Access to registers should
    * therefore generally be serialised, by either the dev_priv->uncore.lock or
    * a more localised lock guarding all access to that bank of registers.

    Since we only support modern platforms on xe there is no need for the uncore lock and the specific error will not trigger.
    On xe, it only exists to be used in vblank evasion for compatiblity with i915.

    From the xe point of view, I would say applying patch 1-3 is sufficient.
    Patch 3 because i915_utils.h is used by display still, unfortunately..

    I would recommend adjusting the patch to keep the display tracepoints enabled on xe,
    the non-vblank i915-specific patches should be harmless to apply.

    After that, the rest can be applied too.

    While I understand the theoretical need for more testing, I think we should go for practical and apply
    patch 1-2 too. Even on normal kernels there is absolutely no guarantee of fast completion.
    I'd say on a deterministic -rt kernel, I'd say it's less likely that vblank evasion is a problem.

    Kind regards,
    ~Maarten

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)