In article <
[email protected]>,
minforth <
[email protected]> wrote:
Am 27.06.2025 um 20:15 schrieb [email protected]:
In article <[email protected]>,
LIT <[email protected]> wrote:
It really depends on how counted loops are implemented.
Most CPUs have operators for register-based count-down loops
that are blazingly fast.
If they can be used within Forth-based loop constructs
I would expect a greater speed increase than what you measured.
In that old fig-Forth it's rather short and simple:
sqHeader '(LOOP)'
XLOOP dw $ + 2
mov BX,1
XLOO1: add [BP],BX
mov AX,[BP]
sub AX,[BP+2]
xor AX,BX
js BRAN1
add BP,4
inc SI
inc SI
jmp NEXT
It doesn't look that bad. Can it be
done even shorter?
My optimiser looks into the combination of DO and LOOP,
transfers the returns stack into registers after inlining
everything. It is near vfx performance.
All experimental, but yes there is much to be gained.
Must be tricky to do UNLOOP in a register-based loop. ;-)
Indeed. One of my testcase is:
---------------------------
\ Interfering LEAVEs and EXITs.
: (TESTL) DO
DUP IF LEAVE ELSE UNLOOP EXIT THEN SWAP
2DUP IF LEAVE ELSE UNLOOP EXIT THEN 2SWAP
LOOP ROT ;
: testL (TESTL) 2OVER ;
SEE (TESTL)
'testL SHOW-IT
-------------------------------------------------
The result after inlining is:
---------------------------------------------
: testL
0 >R SWAP >R >R DUP
0BRANCH [ 20 , ] ( between ? UNLOOP )
BRANCH [ B0 , ] ( between ? UNLOOP )
BRANCH [ 18 , ] ( between ? SWAP ) UNLOOP
BRANCH [ 98 , ] ( between ROT 2OVER ) SWAP 2DUP
0BRANCH [ 20 , ] ( between ? UNLOOP )
BRANCH [ 58 , ] ( between ? UNLOOP )
BRANCH [ 18 , ] ( between ? 2SWAP ) UNLOOP
BRANCH [ 40 , ] ( between ROT 2OVER ) 2SWAP 1 (+LOOP)
0BRANCH [ -D8 , ] ( between >R DUP ) UNLOOP ROT 2OVER
;
--------------------------------------------
then after some peepholing the assembly looks like
Report about return stack usage
new report
2 8 1 0
1 9 1 1
0 10 2 2
LEA, BP'| XO| [BP] 4294967272 L,
Q: MOVI, X| R| BX| 0 IL,
Q: MOV, X| F| BX'| XO| [BP] 16 L,
POP|X, DX|
POP|X, AX|
PUSH|X, DX|
Q: MOV, X| F| AX'| XO| [BP] 8 L,
POP|X, BX|
Q: MOV, X| F| BX'| XO| [BP] 0 L,
POP|X, AX|
PUSH|X, AX|
Q: TEST, X| AX'| R| AX|
J|X, Z| Y| 17 (RL,)
POP|X, DX|
POP|X, BX|
POP|X, AX|
PUSH|X, BX|
PUSH|X, DX|
PUSH|X, AX|
LEA, BP'| XO| [BP] 24 L,
JMP, 27 (RL,)
LEA, BP'| XO| [BP] 24 L,
JMP, 16 (RL,)
JMP, 4294967263 (RL,)
LEA, BP'| XO| [BP] 24 L,
JMP, 0 (RL,)
POP|X, BX|
POP|X, CX|
POP|X, AX|
POP|X, DX|
PUSH|X, DX|
PUSH|X, AX|
PUSH|X, CX|
PUSH|X, BX|
PUSH|X, DX|
PUSH|X, AX|
You see that here the elimination of BP (return stack) has not succeeded.
Three BRANCH/0BRANCH have disappeared though.
Simple cases are more succesful:
------------------------------------
: test2aa 4 >R 2 >R 1 R> 3 R> ;
'test2aa SHOW-IT
: test2aa
4 >R 2 >R 1 R> 3 R>
;
Report about return stack usage
new report
1 8 1 1
0 9 1 1
PUSHI|X, 1 IL,
PUSHI|X, 2 IL,
PUSHI|X, 3 IL,
QN: MOVI, X| R| AX| 4 IL,
QN: MOVI, X| R| CX| 2 IL,
PUSHI|X, 4 IL,
------------------------------------
But the optimiser doesn't detect that moving into registers AX CX can
be eleminated. (only DSP RSP and HIP - present in SP BP DI - are live
at the end of a definition.
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)