Forum: >>> Magnum BBS <<<

Re: python text, Byte Addressability And Beyond

From John Levine@21:1/5 to All on Sat May 11 22:53:09 2024

According to Anton Ertl <[email protected]>:

Looking up "splicing strings", I find that this is a term used in
connection with Python for specifying substrings. Python3 is a
language that lives the codepoint mistake to the extreme (and from
what I read, this was one of the major pain points in the
Python2->Python3 transition), but anyway, with UTF-8 one way to
represent a substring is to use the start index and length in bytes
(aka code units) rather than code points.

Python3 has a complex internal string format that stores each string
as 1, 2, or 4 byte values, depending on what the contents of the
string are, so ASCII is one byte, UCS-2 is two bytes, and strings that
contain code points beyond UCS-2 are four bytes. It's not clear how
hard they try to shrink stuff down when taking substrings.

https://peps.python.org/pep-0393/

Python lets you subscript strings either individual items or
substrings, and I have written a fair amount of code that does that. I
realize that if I were doing semantic processing on Greek or Arabic, I
would not be subscripting and expecting it to return straightforwardly
useful results.

The string structure has a field for the length of the string in
UTF-8, but they don't seem to use it for anything, at least not yet,
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sun May 12 05:40:45 2024

John Levine <[email protected]> writes:

Python3 has a complex internal string format that stores each string
as 1, 2, or 4 byte values, depending on what the contents of the
string are, so ASCII is one byte, UCS-2 is two bytes, and strings that >contain code points beyond UCS-2 are four bytes. It's not clear how
hard they try to shrink stuff down when taking substrings.

https://peps.python.org/pep-0393/

This is a nice demonstration of the unnecessary complexity that the
codepoint mistake leads to. In the general case they can have three representations of the same string: wstr, utf8, and data; only one of
them needs to be non-NULL, and data is canonical if it is non-NULL
(not sure what is canonical if wstr and utf8 are present but data is
not). If data is in latin1 format, but not ASCII, outputting both
UTF-8 and UTF-16 needs conversion (it's just 8bit->16bit expansion in
the UTF-16 case, but that means that a fast block copy is
insufficient). On top of that, they specify both zero termination and
length indicators: length, utf8_length and wstr_length.

Of course Python3 has baked this mistake into their API, and once
software has been written for that API, the complexity becomes
necessary.

But if they had decided to just store the data as UTF-8 and use byte
indexes and lengths in their API, and adjusted the rest of their API accordingly, they could have avoided this complexity and inefficiency,
and only palindrome and anagram programs that limit themselves to character=codepoint would have become harder to write.

Python lets you subscript strings either individual items or
substrings, and I have written a fair amount of code that does that. I >realize that if I were doing semantic processing on Greek or Arabic, I
would not be subscripting and expecting it to return straightforwardly
useful results.

I don't doubt that the API works, it just leads to unnecessary
complexity in the implementation.

The string structure has a field for the length of the string in
UTF-8, but they don't seem to use it for anything, at least not yet,

My understanding from the PEP is that they use it for specifying the
length of the utf8 representation; of course, they also use zero
termination, so if the utf8 field is only passed to functions that use zero-termination, the utf8_length field is not used. Given that, as
soon as data has been initialized, the contents of the utf8 and wstr
fields are no longer used (they are not canonical), I expect that the
only function that is called for the utf8 field is that for converting
from utf8 to the data form.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sun May 12 09:00:53 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

The point I wanted to make is that there is the frequent
misconception that dealing with individual arbitrary characters is
something that is relatively common, and that one can do that by using
UTF-32 (or UTF-16); it isn't, and one cannot.

Do you really mean one cannot change an individual character
using UTF-32?

Correct. That's the "one cannot" part. An Unicode code-point is not
a character, and what UTF-32 gives you is one code point per code unit
(a code unit is a fixed size container, 32 bits for UTF-32, 8 bits for
UTF-8), not one character per code unit. But Unicode supports
characters that consist of a sequence of several code points, see <https://en.wikipedia.org/wiki/Combining_character>, so if you just
store one Unicode code to the address where a different code point
currently is, you have not overwritten a character, just a code point; admittedly, the result is that you have changed one or two characters,
but that's probably not what the user wanted.

E.g., consider the following Gforth code (others can tell you how to
do it in Python):

"Ko\u0308nig" cr type

The output is:

König

That is, the second character consists of two Unicode code points, the
"o" and the "\u0308" (Combining Diaeresis).

(I think that somewhere along the way from the Forth system to the
xterm through copying and pasting into Emacs the second character has
become precomposed, but that's probably just as well, so you can see
what I see).

If I replace the third code point with an e, I get "Koenig". So by
overwriting one code point, I insert a character into the string.

If instead I replace the second code point with a "\u0316" (Combining
Grave Accent Below):

"K\u0316\u0308nig" cr type

I get this (which looks as expected in my xterm, but not in Emacs)

K̖̈nig

The first character is now a K with a diaresis above and an accent
grave below and there are now a total of 4 characters, but still 6
code points in the string; the second character has been deleted by
this code-point replacement.

Back to replacing characters instead of overwriting code points: If
you want to replace the second character, you would need to replace
two code points; if the replacement of the character has only one code
point or more than two, you need to move the remaining three
characters. You have this problem whether the string is represented
as UTF-32 or UTF-8.

I assume you mean "there is no need to do it"..

That, too. That is the "it isn't" part of the statement.

If you stick with UTF-8
and use byte lengths and byte indexes, you can do almost everything as
well or better (with less complication and more efficiently) as by
converting to UTF-32 and back.

Assume you're implementing a language which has a function of
setting an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

Instead, what we see is one language (Python3) that has an even worse misfeature: You can set an individual code point in a string; see
above for the things you get when you overwrite code points.

But why would one want to set individual code points? What about
setting individual code units (in the case of UTF-8, the code unit is
a byte) or bits? If you think that replacing parts of a character is
a feature, why not go all the way?

How would you implement it? Run through the string?

You have to do that anyway, because of combining characters.

Would you then also
store additional information somewhere so that the next character
that the user sets does not need to do it again?

Probably not. I would discourage the users from using this misfeature
and steer them to better alternatives.

Alternatively, if it's a really important misfeature, I would use an editing-friendly string representation (maybe a piece table or rope)
and/or maybe do some Python3-style crazyness and have the string be
represented by an array of characters, and every character is
represented by a pointer into an UTF-8 sequence.

In the case of Python3, the sequence seems to have been that they
started out with the bad idea that indexing a string by code point is
the way to go, and then designed a first implementation catering to
that premise, and published it without reconsidering the premise,
despite the efficiency cost. And of couse it was too inefficient for
some use cases, but it was too late to switch to a more sensible
design, so they invented the more complex, but more efficient (than
the first implementation) PEP 393 implementation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Sun May 12 13:10:56 2024

On 12/05/2024 07:40, Anton Ertl wrote:

John Levine <[email protected]> writes:

Python3 has a complex internal string format that stores each string
as 1, 2, or 4 byte values, depending on what the contents of the
string are, so ASCII is one byte, UCS-2 is two bytes, and strings that
contain code points beyond UCS-2 are four bytes. It's not clear how
hard they try to shrink stuff down when taking substrings.

https://peps.python.org/pep-0393/

This is a nice demonstration of the unnecessary complexity that the
codepoint mistake leads to.

A lot of this is, I suspect, for historical reasons. When Python was
young, most software and languages used either plain ASCII or a mess of
code pages for 8-bit encodings (or an even bigger mess of 16-bit
encodings for CJK languages). Unicode was the new hope for a unifying
16-bit system that would work for all characters in all languages. So
Python - like Java, Windows NT, QT, and some other systems of that era,
chose UCS-2 as the modern, international and future-proof solution to
strings and characters.

It turns out that UCS-2 was not enough, and these have all been
suffering from mixed APIs ever since.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sun May 12 16:12:26 2024

David Brown <[email protected]> writes:

On 12/05/2024 07:40, Anton Ertl wrote:

John Levine <[email protected]> writes:

Python3 has a complex internal string format that stores each string
as 1, 2, or 4 byte values, depending on what the contents of the
string are, so ASCII is one byte, UCS-2 is two bytes, and strings that
contain code points beyond UCS-2 are four bytes. It's not clear how
hard they try to shrink stuff down when taking substrings.

https://peps.python.org/pep-0393/

This is a nice demonstration of the unnecessary complexity that the
codepoint mistake leads to.

A lot of this is, I suspect, for historical reasons. When Python was
young, most software and languages used either plain ASCII or a mess of
code pages for 8-bit encodings (or an even bigger mess of 16-bit
encodings for CJK languages). Unicode was the new hope for a unifying
16-bit system that would work for all characters in all languages. So
Python - like Java, Windows NT, QT, and some other systems of that era,
chose UCS-2 as the modern, international and future-proof solution to
strings and characters.

It turns out that UCS-2 was not enough, and these have all been
suffering from mixed APIs ever since.

That's certainly true for Java (first release 1995), Windows NT (first
released 1993) and QT (first released 1995).

At that time Unicode 1.x (released 1991) was supposed to be the wave
of the future, and it offered the (to Westerners) familiar environment
of character = code unit (= 16 bits), ignoring the experience of the
East Asians with ASCII-compatible variable-width encodings. For new
systems the 16-bit code unit seemed to be the way to go, and the mixed
APIs directly stem from that, because they imagined that legacy
software that uses 8-bit code units would be rewritten to use 16-bit
code units after a while, but of course the new system has to run
legacy software, so it also provided a legacy API.

It did not work out. Software using 8-bit code units was (for the
most part) not converted to use 16-bit code units, and 16 bits was
found to be not enough for a universal character set.

In the meantime, the Silicon Valley based Unicode effort was merged
with the ISO-based Universal Coded Character Set (UCS) effort (the
name Unicode was kept) and we got Unicode 2.0 in 1996. Now if code
unit = character would have been as important as was thought in
Silicon Valley, the logical step would have been to go for 32-bit
characters. But the UCS effort had brought in the experience with ASCII-compatible variable-width encodings, and so we got not just
fixed-width UTF-32, but also variable-width ASCII-compatible UTF-8 and variable-width UTF-16 (to be backwards compatible with the
systems/interfaces that were designed for 16-bit code units in the
early 1990s).

And, lo and behold, the systems that had adopted 16-bit code units
kept the 16-bit code units and accepted that characters were now variable-width, because variable width is obviously easier to add to
an existing code base than switching the code unit size.

Plus at some point (not sure when) they decided that characters have
to be composable, so even an encoding like UTF-32 with 32-bit code
units would not be enough for a character. A 32-bit code unit would
only be a code point.

At that point, all encodings are variable-width, so why not just use
UTF-8. And that's what everyone who had not introduced a new platform
between 1991 and 1996 did. E.g., that's what we see in Unix (from
around 1970) and in Rust (started 2006, first release 2015).

Except Python3. I am not familiar with Python, but from the
discussions I have read my impression is: Python2 (released 2000)
supported strings of bytes, and people put UTF-8 in there and worked
with that. Python3 (released 2008) was supposed to be a cleanup and
instead of refining the code-unit-based approach of Python2 they
introduced a code-point-based approach, which supported fast indexing
of code points, a worthless feature. And they found out how hard it
is to migrate a code base.

So whatever the reason for the code point mistake in Python3 was, that
mistake was made long after Unicode 2.0 was introduced in 1996 and the
success of UTF-8 made it clear that variable-width encodings work out
fine.

For comparison: The 1994 Forth standard was designed to support 16-bit characters, and one implementation, JaxForth, actually demonstrated
that. Most Forth implementations kept 8-bit characters for the time
being, many assuming that they would have to do something like mixed
APIs at some point. But when we actually thought and worked on the
issue in 2004/2005, we were delighted to discover that UTF-8 works
very well in the existing code base (of our Forth system and others)
and there are only a few places that need changes; the additional
words proposed in <http://www.euroforth.org/ef05/ertl-paysan05.pdf>
have mostly been standardized in Forth-2012, but are actually rarely
used, because ordinary string words don't care whether a string is
ASCII or UTF-8. Anyway, this demonstrates that by 2005 it was clear
that variable-width encodings are very workable, so the Python3
mistake cannot be explained with its 2008 release date.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun May 12 18:48:10 2024

According to Anton Ertl <[email protected]>:

It turns out that UCS-2 was not enough, and these have all been
suffering from mixed APIs ever since.

That's certainly true for Java (first release 1995), Windows NT (first >released 1993) and QT (first released 1995).

Don't forget Javascript, which means every browser if full of UCS-2 and/or UTF-16.

Except Python3. I am not familiar with Python, but from the
discussions I have read my impression is: Python2 (released 2000)
supported strings of bytes, and people put UTF-8 in there and worked
with that. Python3 (released 2008) was supposed to be a cleanup and
instead of refining the code-unit-based approach of Python2 they
introduced a code-point-based approach, which supported fast indexing
of code points, a worthless feature. And they found out how hard it
is to migrate a code base.

It makes somewhat more sense than that.

Python2 had a string type which was an variable length array of 8-bit characters, and a Unicode type which was an variable length array of
code points. You could use a string to hold either ASCII text or
arbitrary strings of bytes, depending on what operators and functions
you used. Python3 reorganized this so that there is only one string
type used for both ASCII and Unicode and a separate byte type for
arbitrary strings of data.

I can say from experience that the python3 approach is less confusing,
and that in contexts where you know the strings are ASCII, e.g., mail
or http message headers, subscripting makes sense, even though it
mostly doesn't for sequences of Unicode code points. Even with code
points it can make some sense, e.g., if you know you have text in an
alphabetic language, you can find the code points that are white space
to do stuff with words.

In recent years python has been adding type declarations so you can
say that a particular variable or function parameter has to be of a
particular type or one of a union of types, e.g. it can be an int or a
float or a Decimal object. They haven't yet created subtypes to limit
the range of a type but I expect they will. That would let say a
Percent is an integer between 0 and 100 and or an ASCII is a string
with all the code points <= 0x7f. You could write more robust code
that doesn't accidentally try to subscript into random non-ASCII code
points.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue May 14 12:24:31 2024

Assume you're implementing a language which has a function of setting
an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

I suspect "individual character" meant "code point" above.
Does Unicode even has the notion of "character", really?

Instead, what we see is one language (Python3) that has an even worse misfeature: You can set an individual code point in a string; see
above for the things you get when you overwrite code points.

I think it's fairly common for languages that started with strings
as "arrays of 8bit chars".

Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!

But why would one want to set individual code points?

Because you know your string only contains "characters" made of a single
code point?

E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue May 14 17:43:43 2024

Anton Ertl wrote:

Thomas Koenig <[email protected]> writes:

E.g., consider the following Gforth code (others can tell you how to
do it in Python):

"Ko\u0308nig" cr type

The output is:

König

That is, the second character consists of two Unicode code points, the
"o" and the "\u0308" (Combining Diaeresis).

(I think that somewhere along the way from the Forth system to the
xterm through copying and pasting into Emacs the second character has
become precomposed, but that's probably just as well, so you can see
what I see).

If I replace the third code point with an e, I get "Koenig". So by overwriting one code point, I insert a character into the string.

If instead I replace the second code point with a "\u0316" (Combining
Grave Accent Below):

"K\u0316\u0308nig" cr type

I get this (which looks as expected in my xterm, but not in Emacs)

K̖̈nig

The first character is now a K with a diaresis above and an accent
grave below and there are now a total of 4 characters, but still 6
code points in the string; the second character has been deleted by
this code-point replacement.

It seems to me (in my vast ignorance) that names for things should be
written in the most appropriate set of characters in the language of
the person/thing being named.

Then when such a name is "sent out to be displayed" that it is a property
of the display what character set(s) it can properly emit, and thereby
alter the string of characters as appropriate to its capabilities.

For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
When displayed on a ASCII only line printer it would be written Koenig
When displayed on a enhanced ASCII printer it would be written König
When displayed on a full functional printer it would be written K̖̈nig

The problem is the mapping function between how it should be encoded
in its own native language to what can be expressed on a particular
device.

Only the display device needs to understand this mapping and NOT the program/software/device holding the string.

I think people in Japan should be able to use printf by using プリントフ There is way to much "english" in the way computers are being used.
It is similar to Anthropomorphizing animal behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Tue May 14 20:35:37 2024

On 14/05/2024 19:43, MitchAlsup1 wrote:

I think people in Japan should be able to use printf by using プリントフ There is way to much "english" in the way computers are being used.

I disagree entirely here.

For many things, international consistency is more important than
picking local-sounding names for things that have no localised meaning.
Having a Japanese name and spelling for "printf" doesn't give Japanese programmers any useful information, it is not easier to type or read,
and simply ensures that they can't cooperate and collaborate with
programmers using different languages. MS Office uses local languages
for its macros and formulas in Excel - I've never heard anyone in Norway
say they like it, and many who say it is a PITA that makes it hard to
work with and hard to search for information. Most people IME who
macros a lot prefer to stick to English.

It works the other way too. When discussing Karate or Judo, most
practitioners the world over know what a "mawashi geri" or an "o soto
gari" is - most consistently use the Japanese terms regardless of native languages. Most, that is, except Americans and some other English
speakers who feel they have to use English language terms, losing a lot
of the subtlety and nuances of the terms and being different from their international peers.

And when people try to force localisation of terms that have no local
words, the result is just to encourage people to move everything over to
a single language (English).

It is similar to Anthropomorphizing animal behavior.

No, it is not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to [email protected] on Tue May 14 20:47:12 2024

MitchAlsup1 <[email protected]> schrieb:

I think people in Japan should be able to use printf by using プリントフ

I have to put up with a minor version of that - Microsoft decided to
localize folder names ("Program files" is dislplayed as "Programme"
if you use German settings, except when you access it via the
command line), and all Excel functions are localized; depending
if you use English or German versions, arguments are separated
via comma or semicolon. Of course, the other way is a syntax error.

Saving things in native Excel format is OK, but generating a CSV
file from a program will either work or not, depending on locale
("," vs ";" and "." vs ".").

This is about as annoying as it gets...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Sat May 18 05:29:20 2024

Stefan Monnier <[email protected]> writes:

Anton Ertl:]

Thomas Koenig:]

Assume you're implementing a language which has a function of setting
an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

I suspect "individual character" meant "code point" above.

I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".

Does Unicode even has the notion of "character", really?

AFAIK it does not. But applications like palindrome checkers care
about characters, not code points.

OTOH, most code can be implemented fine as working on strings, without
knowing how many characters there are in the string (and it then does
not need to know about code points, either). In other words, it can
be implemented just as well when the strings are represented as
strings of code units (whether UTF-8 (bytes), UTF-16 (16-bit code
units) or UTF-32 (32-bit code units)), and then it does not help to
convert UTF-8 to something else on input and something else to UTF-8
on output.

For the code that cares about characters, if it wants to work
correctly for characters that cannot be precomposed into a single code
point, it has to deal with characters that consist of multiple code
points, i.e., that even in UTF-32 are variable-width. So given that
you have to bite the variable-width bullet anyway, you can just as
well use UTF-8.

Instead, what we see is one language (Python3) that has an even worse
misfeature: You can set an individual code point in a string; see
above for the things you get when you overwrite code points.

I think it's fairly common for languages that started with strings
as "arrays of 8bit chars".

Apart from Python3 not in those languages that I have looked at more
closely wrt this feature.

In particular, C was created by adding a byte type to B, and that type
was called "char". It was allowed to be wider to cater for
word-addressed machines, but on byte-addressed machines "char" is
invariably a byte. To cater to Unicode, they used a two-pronged
approach: they added wchar_t and multi-byte functions (IIRC both
already in C89); wchar_t was obviously introduced to cater for the
upcoming Unicode 1.0 (which satisfied code unit=code point=character),
while the multibyte stuff was probably introduced originally for
dealing with the ASCII-compatible East-Asian encodings.

When UTF-8 arrived, the multi-byte functions proved to fit that well;
but of course there is not much usage of those functions, because most
code works fine without knowing about individual code points or
characters. And UTF-8 turned out to be the answer to dealing with
Unicode that the Unix programmers who had a lot of code working with
strings of chars (i.e., bytes) were looking for.

Then Unicode 2.0 arrived and the Win32 API (which had embraced wchar_t
and defined it as being 16-bit) stuck with 16-bit wchar_t, which
breaks "code unit=code point"; this may not be in line with the
intentions of the inventors of wchar_t (e.g., there are no
multi-wchar_t functions in the C standard last time I looked), but
that has been the existing practice in wchar_t use in C for more than
a quarter-century.

Unix, where wchar_t was (and still is) little used, switched to 32-bit
wchar_t, but

1) given that Unicode at some point (probably already in 2.0) broke
"code point=character", that does not really help software like
palindrome checkers.

2) wchar_t is little-used in Unix-specific code.

3) Code that wants to be portable between Unix and Windows and uses
wchar_t cannot rely on "code unit=code point" anyway.

So, in practice, C code does not make use of the ability to set an
individual code point by overwriting a fixed-size code unit.

Forth has chars that are 8 bits wide in traditional Forth systems on byte-addressed machines. In the 1994 standard (in the middle of the
reign of Unicode 1.0, and with lots of Californians on the
standardization committe) provided the option to implement Forth
systems with chars that take a fixed number >1 of bytes, and one
system (JaxForth by Jack Woehr for Windows NT) implemented 16-bit
chars.

However, JaxForth was not very popular, and most code assumed that 1
char = 1 (i.e., 8 bits on a byte-addressed machine), and given that
there was no widely available system that deviated from that, even
code that wanted to avoid this assumption could not be tested. And
given that most code has this assumption and would not work on systems
with 1 chars > 1, all the other systems stuck with 1 char = 1. A Chicken-and-Egg problem? Not really:

When we looked at the problem in 2004, we found that most code works
fine with UTF-8; that's because most code does not care about
characters. Even code that uses words like C@ (load a char from
memory) typically does it in a way that works with UTF-8. We proposed
a number of words for dealing with variable-width xchars (what C calls multi-byte characters), and you can theoretically use them with the
pre-Unicode East-Asian encodings as well as with UTF-8. These words
were standardized in Forth-2012, but they are actually little-used
(including by me), because most code actually works fine with opaque
strings.

In Gforth, an xchar is a code point, not a character, so these words
are currently less useful for writing Palindrome checkers than one
might hope. Maybe at some point we will look at the problem again,
and provide words for dealing with characters, Unicode normalization,
collating order and such things, but for now the pain is not big
enough to tackle that problem.

Finally, I proposed to standardize the common practice 1 chars = 1;
this proposal was accepted for standardization in 2016.

Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!

One way forward might be to also provide a string-oriented API with
byte (code unit) indices, and recommend that people use that instead
of the inefficient code-point-indexed API. For a high-level language
like Elisp or Python, the internal representation can depend on which
function was last used on the string. So if code uses only the
string-oriented API, you may be able to avoid the costs of the
code-point API completely.

But why would one want to set individual code points?

Because you know your string only contains "characters" made of a single
code point?

This incorrect "knowledge" may be the reason why Emacs 27.1 displays

K̖̈nig

as if the first three-code-point character actually was three characters.

E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column >separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).

These kinds of things involve additional complications. Not only do
you have to know the difference between code points and characters,
you also have to know the visual width of a character which is 0-2 for fixed-width fonts to be used in xterm or the like. Actually, if you
treat a combining mark as having width 0, you may be able to work with
code points and do not need characters.

Why do you want to move the column separator and what do you want to
overwrite with it? This is likely the result of another operation,
and maybe that involves another string replacement; and displaying the
result involves so much overhead that using a string replacement
instead of a fixed-width store is probably not the dominant cost. And
if the replacement string happens to have as many bytes as the
replaced string (which would happen for, e.g., replacing " " with
"+"), the operation is not so expensive anyway.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat May 18 14:09:31 2024

Anton Ertl <[email protected]> schrieb:

[email protected] (MitchAlsup1) writes:

It seems to me (in my vast ignorance) that names for things should be >>written in the most appropriate set of characters in the language of
the person/thing being named.

Then when such a name is "sent out to be displayed" that it is a property >>of the display what character set(s) it can properly emit, and thereby >>alter the string of characters as appropriate to its capabilities.

For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
When displayed on a ASCII only line printer it would be written Koenig
When displayed on a enhanced ASCII printer it would be written König
When displayed on a full functional printer it would be written K̖̈nig

Why do you think that K̖̈nig should be written as Koenig or König?

On my display, this read K, n with a diacritic and something close to
a cedille under the n.

However, for König

Again, the diaresis is over the n, not the o.

Unicode specifies that the precomposed form is
König. And if you want a transcription into ASCII with the knowledge
that it's German, the result would be Koenig.

This is actually sometimes a (fairly minor) problem because the
name on my passport actually reads "König" (o-diacritic), but
people without knowledge of German tend to translscribe this as
"Konig", whereas I transcribe it as "Koenig" on offical forms
such as the one I need to fill out prior to entering the US.

This is why modern EU passports have a canonical form of the
name, which then is "KOENIG".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Sat May 18 16:25:54 2024

Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

[email protected] (MitchAlsup1) writes:

It seems to me (in my vast ignorance) that names for things should be
written in the most appropriate set of characters in the language of
the person/thing being named.

Then when such a name is "sent out to be displayed" that it is a property >>> of the display what character set(s) it can properly emit, and thereby
alter the string of characters as appropriate to its capabilities.

For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
When displayed on a ASCII only line printer it would be written Koenig
When displayed on a enhanced ASCII printer it would be written König
When displayed on a full functional printer it would be written K̖̈nig

Why do you think that K̖̈nig should be written as Koenig or König?

On my display, this read K, n with a diacritic and something close to
a cedille under the n.

However, for König

Again, the diaresis is over the n, not the o.

Unicode specifies that the precomposed form is
König. And if you want a transcription into ASCII with the knowledge
that it's German, the result would be Koenig.

This is actually sometimes a (fairly minor) problem because the
name on my passport actually reads "König" (o-diacritic), but
people without knowledge of German tend to translscribe this as
"Konig", whereas I transcribe it as "Koenig" on offical forms
such as the one I need to fill out prior to entering the US.

This is why modern EU passports have a canonical form of the
name, which then is "KOENIG".

Same problem as my wife and kids who have Norløff either a part of their surname or (my wife) as-is.

Canonical simplification of the 'ø' character is either 'o' or 'oe', and passports and airline tickets differ, something which can cause all
sorts of issues with US passport control.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Sat May 18 14:41:04 2024

Terje Mathisen <[email protected]> schrieb:

Canonical simplification of the 'ø' character is either 'o' or 'oe', and passports and airline tickets differ, something which can cause all
sorts of issues with US passport control.

Reminds me of either "Asterix and the Great Crossing" or "Asterix
and the Normans", where Viking speach was indicated by having
slashes through letters (like ø). When Obelix tries to speak
their language, he also applies slashes, but does so randomly
(like through a c) so nobody can understand him.

Hmm... a challenge, can this be represented as Unicode codepoints?
I would not be surprised if some Asterix fan had snuck it in while
nobody was looking.

(For those who don't know Asterix: It is a comic that was/is wildly
popular in France and Germany at least, about Gauls who keep on
resisting Roman occupation in the times of Julius Caesar, aided
by a magic potion which gives them superhuman strength.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat May 18 15:43:05 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

Why do you think that K̖̈nig should be written as Koenig or König?

On my display, this read K, n with a diacritic and something close to
a cedille under the n.

That displays correctly then. The "close to cedille" is an accent
grave below.

However, for König

Again, the diaresis is over the n, not the o.

That's strage, in the first case your display system composes the
diaresis correctly with the preceding glyph (at that point, a K with
accent grave below), but in the o case, it incorrectly composes it
with the next glyph.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat May 18 15:48:35 2024

Thomas Koenig <[email protected]> writes:

Terje Mathisen <[email protected]> schrieb:

Canonical simplification of the 'ø' character is either 'o' or 'oe', and
passports and airline tickets differ, something which can cause all
sorts of issues with US passport control.

Reminds me of either "Asterix and the Great Crossing" or "Asterix
and the Normans", where Viking speach was indicated by having
slashes through letters (like ø). When Obelix tries to speak
their language, he also applies slashes, but does so randomly
(like through a c) so nobody can understand him.

Hmm... a challenge, can this be represented as Unicode codepoints?

Sure. See <https://en.wikipedia.org/wiki/Bar_(diacritic)>.
Interestingly, the Obelix character ȼ you mention above has it's own precomposed code point U+023C (Latin Small Letter C with Stroke) and
its own Wikipedia page: https://en.wikipedia.org/wiki/%C8%BB, but you
can also compose it from c and the combining short solidus overlay: c̷
(this does not display correctly on emacs 27.1, but composes correctly
on an xterm. There is no precomposed Latin Small Letter D with
Stroke, but you can compose it in the same way: d̷.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat May 18 17:09:44 2024

According to Thomas Koenig <[email protected]>:

Considering the huge market for palindrome checkers, that is a
real concern, especially if they involve characters for which
UTF-32 is not sufficient, such as smileys.

Is there any language whose characters cannot be represented in
UTF-32?

Chinese. There is a huge backlog of obscure but real Chinse characters
that do not have a Unicode code point. This ISO committee is slowly
working through them. Every couple of years they approve a batch of
several thousand of them.

https://en.wikipedia.org/wiki/Ideographic_Research_Group

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Sat May 18 17:11:32 2024

Anton Ertl wrote:

snip

A similar concept was implemented in COBOL, where the designers though
that having to write

ADD A TO B GIVING C

or somesuch makes programming easier than writing

C = A+B

in FORTRAN.

I would put a slightly different spin on it. I believe that the
original COBOL was designed not so much to make programming easier, but
to make *learning* programming (for non-programmers) easier, and
because it was supposedly "self documenting", easier for managers, etc.
to see how the program worked. Remember, when COBOL was developed
(late 1950s), there weren't many programmers in existance, and it was
felt that the "mathematical" syntax of Fortran, would be too unfamiliar
to the business people who developed the new programs to solve business problems, and who were generally not mathematicians.

Of course, they were wrong about "self documenting", and as more people
became programmers, the advantages of consice syntax made a big
difference.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sun May 19 15:32:49 2024

On Tue, 14 May 2024 17:43:43 +0000, [email protected] (MitchAlsup1)
wrote:

I think people in Japan should be able to use printf by using ?????
There is way to much "english" in the way computers are being used.
It is similar to Anthropomorphizing animal behavior.

One could quibble.

If Japanese people needed to enter kana from their keyboards to write
programs, that would be awkward; there is not yet a good way to enter
that kind of text from a keyboard.

However, I think your point is valid. At least in some contexts.

Remember back in the early 8-bit days of computing, and before them,
when schools were exposing children to PDP-8 computers?

Children were learning to program computers in BASIC.

Obviously, here, if children in other countries used modified versions
of BASIC that used keywords in their own natural language, it would be
much easier for them to get started with programming than if the
keywords were simply arbitrary strings of letters, taken from a
foreign language of which they may not necessarily have any knowledge.

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Historical note: Algol was originally called IAL; remember what JOVIAL
stood for.

But the objections about sharing code between countries, and the fact
that English is so widely known in technical circles, are also true.
It is a complicated issue, made worse by the fact that nationalism and ethnocentricism are often bad things.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Sun May 19 15:36:45 2024

On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

and
because it was supposedly "self documenting", easier for managers, etc.
to see how the program worked.

Of course, if they designed COBOL that way, why did they include a
statement that let you re-direct GOTO statements from elsewhere in a
program?

I mean, that was just *asking* for dishonest programmers to direct the
odd pennies into their bank accounts and so on.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Savard on Mon May 20 11:46:20 2024

John Savard <[email protected]d> writes:

Remember back in the early 8-bit days of computing, and before them,
when schools were exposing children to PDP-8 computers?

Children were learning to program computers in BASIC.

Obviously, here, if children in other countries used modified versions
of BASIC that used keywords in their own natural language, it would be
much easier for them to get started with programming than if the
keywords were simply arbitrary strings of letters, taken from a
foreign language of which they may not necessarily have any knowledge.

Logo came in versions for different native languages, but looking at <https://de.wikipedia.org/wiki/Logo_(Programmiersprache)>, it shows
English Logo examples before German Logo examples. I tried Logo on my
C64; I don't know whether it was in English or German, but in any case
I was not particularly impressed.

The C64 as well as many other home computers came with BASIC, and
BASIC was widely used, and before today I never heard or read any
suggestion to use native-language commands in BASIC.

I have seen some suggestions to provide native-language versions of
Forth, but they never went anywhere (if they were serious). The main motivation here seems to have been that it's easy to do that in Forth,
so is there a nail to which we can apply this hammer? I attend
German-language Forth events where some of the partisipants are not
good enough at English to, e.g., read articles about Forth in English,
but none of them has Germanized his personal Forth system.

Scratch is also designed for children and supports native-language
switching, which eliminates one of the drawbacks of native-language
versions.

Like Logo, Scratch comes out of the MIT, and I wonder if the idea that programmers have problems with names that are not in their native
language is due to their American background.

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Algol 60 does not standardize a program representation in characters
(a grave mistake fixed by most later programming languages, but ). It
also does not standardize reserved words (aka keywords); instead, it
has symbols that are typically written in bold in publications to
differentiate them from identifiers written in a normal typeface.

It is up to the compiler implementor how the programmer has to provide
these symbols; one way is to surround each such symbol with single
quotes (used in ICT 1900 Algol). A compiler implementor could instead
(or in addition) support native-language representations of these
symbols, but I am not aware that this has happened. After all, it's
an international language, not a national language; or maybe such
attempts were made and sunk without much notice, for the same reasons
we have been discussing all along.

Elliot 803 Algol uses the reserved word approach that means that
programs don't work that use, e.g., "if" as identifier, but has the
advantage that you don't need to put that many single quotes in the
code. This is the approach that won in later programming languages,
but it makes it hard to introduce new reserved words in later versions
(they may conflict with existing programs).

As for why the Algol standard was written in English and used names
from English rather than from Latin, that's because Algol was designed
in 1960 when English was the lingua franca among scholars, not before
~1700 when Latin served that role. And Esperanto never reached that
status.

But concerning Latin, on the last EuroForth conference (near Rome)
Ulrich Hoffmann gave an amusing talk where he presented a Latinized
Forth complete with Roman numerals. Unfortunately, that talk is not
(yet?) online.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Savard on Mon May 20 18:27:57 2024

On Sun, 19 May 2024 15:32:49 -0600
John Savard <[email protected]d> wrote:

On Tue, 14 May 2024 17:43:43 +0000, [email protected] (MitchAlsup1)
wrote:

I think people in Japan should be able to use printf by using ?????
There is way to much "english" in the way computers are being used.
It is similar to Anthropomorphizing animal behavior.

One could quibble.

If Japanese people needed to enter kana from their keyboards to write programs, that would be awkward; there is not yet a good way to enter
that kind of text from a keyboard.

However, I think your point is valid. At least in some contexts.

Remember back in the early 8-bit days of computing, and before them,
when schools were exposing children to PDP-8 computers?

Children were learning to program computers in BASIC.

Obviously, here, if children in other countries used modified versions
of BASIC that used keywords in their own natural language, it would be
much easier for them to get started with programming than if the
keywords were simply arbitrary strings of letters, taken from a
foreign language of which they may not necessarily have any knowledge.

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Historical note: Algol was originally called IAL; remember what JOVIAL
stood for.

But the objections about sharing code between countries, and the fact
that English is so widely known in technical circles, are also true.
It is a complicated issue, made worse by the fact that nationalism and ethnocentricism are often bad things.

John Savard

https://en.wikipedia.org/wiki/Non-English-based_programming_languages
Long list.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon May 20 17:00:08 2024

Michael S <[email protected]> writes: >https://en.wikipedia.org/wiki/Non-English-based_programming_languages

Long list.

Compared to all programming languages? Not really. The HOPL data
base reports 8945 languages, and Landin already wrote "The next 700
programming languages" (probably based on the idea that there were 700
up to that point) in 1967.

The first part points out that while only a little over 1/3 of the
programming languages were designed in countries where the primary
language is English, the share of languages that use English-based
keywords is far larger. And that's especially true for languages that
achieved some popularity.

My guess is that if you take the proportion of lines of code written
in languages where the buitins and (if present) reserved words are
based on English, you would get a result that's very close to 100%.
Code where identifiers are all based on English may have a
significantly lower percentage, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Mon May 20 17:44:48 2024

John Savard wrote:

Historical note: Algol was originally called IAL; remember what JOVIAL
stood for.

Who was Joe ?? in Jovial

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Mon May 20 19:26:39 2024

MitchAlsup1 wrote:

John Savard wrote:

Historical note: Algol was originally called IAL; remember what
JOVIAL stood for.

Who was Joe ?? in Jovial

Just in case you weren't joking,

Jules Own Version of the International Algorithmic Language

Jules was Jules Schwartz

https://en.wikipedia.org/wiki/Jules_Schwartz

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat May 18 08:29:12 2024

Anton Ertl <[email protected]> schrieb:

Stefan Monnier <[email protected]> writes:

Does Unicode even has the notion of "character", really?

AFAIK it does not. But applications like palindrome checkers care
about characters, not code points.

Considering the huge market for palindrome checkers, that is a
real concern, especially if they involve characters for which
UTF-32 is not sufficient, such as smileys.

Is there any language whose characters cannot be represented in
UTF-32?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Sat May 18 08:40:40 2024

[email protected] (MitchAlsup1) writes:

It seems to me (in my vast ignorance) that names for things should be
written in the most appropriate set of characters in the language of
the person/thing being named.

Then when such a name is "sent out to be displayed" that it is a property
of the display what character set(s) it can properly emit, and thereby
alter the string of characters as appropriate to its capabilities.

For example:: Take > "K\u0316\u0308nig" cr type ==> K̖̈nig
When displayed on a ASCII only line printer it would be written Koenig
When displayed on a enhanced ASCII printer it would be written König
When displayed on a full functional printer it would be written K̖̈nig

Why do you think that K̖̈nig should be written as Koenig or König?

However, for König Unicode specifies that the precomposed form is
König. And if you want a transcription into ASCII with the knowledge
that it's German, the result would be Koenig.

Only the display device needs to understand this mapping and NOT the >program/software/device holding the string.

Yes, that's why treating string data as opaque works for most of the
code.

I think people in Japan should be able to use printf by using プリントフ >There is way to much "english" in the way computers are being used.

I don't know how Japanese feel about that, but I certainly don't want
to have to use some Germanized form of C or Forth. This kind of
catering for different natural-language programmers has been tried and
has not taken over the world. I guess that's because

1) You need to learn a lot about what "printf" means and how it is
used; remembering the name is only a minor aspect.

2) Having a name common on all the world allows you to read programs
from all over the world, use reference material from all over the
world, etc.

A similar concept was implemented in COBOL, where the designers though
that having to write

ADD A TO B GIVING C

or somesuch makes programming easier than writing

C = A+B

in FORTRAN. Has not found many followers, either. Interestingly,
among the Algol descendents, the BCPL (and later B and C) syntax,
which, e.g., replaced 'or' with || or |, and was otherwise more
symbolic and less natural-language-oriented than its ancestor Algol
60, was the most successful syntax style among the Algol descendents,
including spreading to languages like Java that are closer to Algol 60
or Pascal in other respects.

I have seen programmers define their own names based on their native
language, however. But if they use names in their own language, these
names should not depend on the environment.

In the macro language of a game I play, you can refer to things
through their name or through their numeric id. Unfortunately, the
names are localized, so the only way to write portable macros is by
using the unmnemonic numeric ids:-(.

What is more common than localized programming languages is producing
error messages in localized languages. I find this annoying, too,
because it makes it harder to find out how others have solved the same
problem.

And, e.g., ENOTSUP in Unix, has such a specific meaning that the
lozalized text does not help the person unfamiliar with Unix, while it
makes life harder for people who know Unix enough to make sense of the
message; i.e., even though my native language is German, I find
"Operation not supported" easier to understand than "Operation wird
nicht unterstützt"; in the latter case I first have to guess what the
English error message would have been and then I can start analysing
the problem.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat May 18 10:14:44 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

Stefan Monnier <[email protected]> writes:

Does Unicode even has the notion of "character", really?

AFAIK it does not. But applications like palindrome checkers care
about characters, not code points.

Considering the huge market for palindrome checkers, that is a
real concern, especially if they involve characters for which
UTF-32 is not sufficient, such as smileys.

Is there any language whose characters cannot be represented in
UTF-32?

The goal of Unicode is to support all writng systems; AFAIK they are
not yet finished, but they expect that these writing systems will all
fit into the space provided by UTF-16 (i.e., a little over one million
code points), but they found it necessary to introduce the concept of
composing glyphs from multiple code points.

So if your question is: "Is there any language where one character
cannot be represented by a single Unicode code point?" The answer is
that the Unicode designers certainly expect that there are such
writing systems.

And looking at <https://en.wikipedia.org/wiki/Telugu_script> (just an
example), I see that the table of Unicode code points for Telugu <https://en.wikipedia.org/wiki/Telugu_script#Unicode> is much smaller
than the tables of glyphs in <https://en.wikipedia.org/wiki/Telugu_script#Articulation_of_consonants>
and <https://en.wikipedia.org/wiki/Telugu_script#Consonants_with_vowel_diacritics>, so the Telugu script seems to be one writing system that cannot be
represented with only precomposed characters.

I don't know if palindromes are a thing in Telugu, though.

But, as your reference to the size of the market for palindrome
checkers indicates, there is actually little code where dealing with
individual characters is relevant. For code where individual
characters are not relevant and opaque strings are sufficient, there
is no reason to use UTF-32. And for code where individual characters
are relevant, code points are not sufficient in general, so there is
no reason to use UTF-32 for that, either.

Interestingly, Emacs 27.1 manages to deal with "తెలుగు లిపి" (which
contains 6 characters composed of a total of 11 code points) just
fine, while it fails on König (with a decomposed Umlaut-o).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Wed May 22 02:16:21 2024

On Mon, 20 May 2024 19:26:39 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Savard wrote:

Historical note: Algol was originally called IAL; remember what
JOVIAL stood for.

Who was Joe ?? in Jovial

Just in case you weren't joking,

Jules Own Version of the International Algorithmic Language

Jules was Jules Schwartz

https://en.wikipedia.org/wiki/Jules_Schwartz

Not to be confused with Julius Schwartz.

https://en.wikipedia.org/wiki/Julius_Schwartz

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 22 15:38:51 2024

Assume you're implementing a language which has a function of setting
an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

I suspect "individual character" meant "code point" above.

I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings. 🙁

OTOH, most code can be implemented fine as working on strings, without knowing how many characters there are in the string (and it then does
not need to know about code points, either).

Indeed, most operations on strings are conversion of things to strings, concatenation of strings, search (typically for a substring or a regexp), extraction of substring where the boundaries result from an earlier
search, and parsing (which at the bottom relies often on some sort of
regexp or equivalent system).

All of those work just fine on a UTF-8 sequence of bytes.

Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁
It's really hard to get rid of it, even though it's used *very* rarely.
In ELisp, strings are represented internally as utf-8 (tho it pretends
to be an array opf code points), so an assignment that replaces a single
char can require reallocating the array!

One way forward might be to also provide a string-oriented API with
byte (code unit) indices, and recommend that people use that instead
of the inefficient code-point-indexed API.

I think the long term solution for ELisp will be to declare strings as basically immutable.

Because you know your string only contains "characters" made of a single
code point?

This incorrect "knowledge" may be the reason why Emacs 27.1 displays

K̖̈nig

as if the first three-code-point character actually was three characters.

No, the above seems like a problem in the redisplay code, and that code
is quite aware of combining characters and stuff. You're probably
seeing simply a missing rule to allow composition/shaping of your word.
(the composition/shaping library operates on whole strings at a time,
but Emacs tends to be quite conservative about the string-chunks it
sends to that library).

I recommend you `M-x report-emacs-bug`. The fix should be fairly simple.

E.g. your string contains the representation of the border of a table
(to be displayed in a tty), and you want to "move" the `+` of a column
separator (or a prettier version that takes advantage of the wider
choice offered by Unicode).

These kinds of things involve additional complications.

Very much so, indeed. It usually breaks down in many different ways
because of the common-but-not-guaranteed assumptions.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Sat May 25 15:48:07 2024

Stefan Monnier <[email protected]> writes:
[Anton Ertl:]

I meant character, not code point, as should have become clear from
the following. I think that Thomas Koenig meant "character", too, but
he may have been unaware of the difference between "character" and
"Unicode code point".

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings.

My experiments with Telugu suggest that Emacs understands the concept
of a character at least for the Telugu script (in contrast to
decomposed Umlauts). If I press a cursor key in Telugu text, Emacs
advances to the next character, not the next code point. However, if
I press DEL or BS, it delets a code point.

Here's some text again for playing around with it:

తెలుగు లిపి

Anyway, the Emacs Lisp functions right-char (and, after testing, also left-char, forward-char, and backward-char) support the notion of
character at least for some scripts. That may be the result of an
interaction with the redisplay code that you mention later, but in
that case it's that code that knows about characters in Unicode.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Savard on Sun May 26 03:50:46 2024

John Savard wrote:

On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

and
because it was supposedly "self documenting", easier for managers,
etc. to see how the program worked.

Of course, if they designed COBOL that way, why did they include a
statement that let you re-direct GOTO statements from elsewhere in a
program?

That feature (Alter GOTO) was also in Fortran, as the, long since
deprecated, assigned GOTO statement. I believe they were there to
support some older computers that didn't have indexed jump/branch
instructions, so achieved the effect by modifying the branch
destination in the instruction itself. And yes, it wwas ugly and made comprehension of the program, and also debugging it, much harder.

I mean, that was just asking for dishonest programmers to direct the
odd pennies into their bank accounts and so on.

Not really. You had to Alter the goto statement to some pre-existing
label, not just anywhere in the code.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Sun May 26 08:33:50 2024

Stephen Fuld <[email protected]d> schrieb:

John Savard wrote:

On Sat, 18 May 2024 17:11:32 -0000 (UTC), "Stephen Fuld"
<[email protected]d> wrote:

and
because it was supposedly "self documenting", easier for managers,
etc. to see how the program worked.

Of course, if they designed COBOL that way, why did they include a
statement that let you re-direct GOTO statements from elsewhere in a
program?

That feature (Alter GOTO) was also in Fortran, as the, long since
deprecated, assigned GOTO statement.

Assigned is

ASSIGN 10 to N

GOTO N (10, 20, 30, 40)

10 CONTINUE

which I don't think is what John S. is describing.

What old FORTRAN compilers had was, for debugging, an AT statement,
which sucked control from the statement into a DEBUG section, without visibility at the place where it came from. The proverbial COME FROM statement, used as a debugging aid; in the DEBUG section, variables
could be printed _or changed_.

Rumor has it that the AD statement was regularly abused, so there
were a lot of programs which did not run cocrrectly unless debugging
was enabled...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Thomas Koenig on Sun May 26 10:16:27 2024

Thomas Koenig <[email protected]> schrieb:

Rumor has it that the AD statement was regularly abused,

s/AD/AT

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon May 27 01:09:44 2024

On Sun, 12 May 2024 05:40:45 GMT, Anton Ertl wrote:

This is a nice demonstration of the unnecessary complexity that the
codepoint mistake leads to. ...

But if they had decided to just store the data as UTF-8 and use byte
indexes and lengths in their API, and adjusted the rest of their API accordingly, they could have avoided this complexity and
inefficiency ...

But UTF-8 is just a representation of code points, not characters. So I
don’t understand why one way leads to “unnecessary complexity” and the other way does not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon May 27 01:11:22 2024

On Sun, 12 May 2024 16:12:26 GMT, Anton Ertl wrote:

Plus at some point (not sure when) they decided that characters have to
be composable ...

I think that was true right from the beginning. Else you would have had a combinatorial explosion of alphabetic characters with diacritic marks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon May 27 06:20:33 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Sun, 12 May 2024 05:40:45 GMT, Anton Ertl wrote:

This is a nice demonstration of the unnecessary complexity that the
codepoint mistake leads to. ...

But if they had decided to just store the data as UTF-8 and use byte
indexes and lengths in their API, and adjusted the rest of their API
accordingly, they could have avoided this complexity and
inefficiency ...

But UTF-8 is just a representation of code points, not characters. So I >don’t understand why one way leads to “unnecessary complexity” and the >other way does not.

In UTF-32 a character is a sequence of code points. In UTF-8 it is a
sequence of code units. In either case, if you have to deal with
characters, you have to deal with sequences (and most of the code does
not have to deal with characters and even less code has to deal with
code points). So converting to UTF-32 buys you nothing and is
unnecessary complexity.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon May 27 06:25:28 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Sun, 12 May 2024 16:12:26 GMT, Anton Ertl wrote:

Plus at some point (not sure when) they decided that characters have to
be composable ...

I think that was true right from the beginning. Else you would have had a >combinatorial explosion of alphabetic characters with diacritic marks.

Unicode has precomposed variants of the Latin characters that are used
in normal text. It does not have a precomposed character for, e.g.,
K̖̈, but then such a character does not occur in normal text.

Unicode 1.0 with its expansion to 16-bit code units only makes sense
if the resulting code units are characters. If at that point they had
planned to have variable-width characters, they could have gone with
something like UTF-8 from the start and spared us a lot of pain.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon May 27 07:34:48 2024

On Sat, 18 May 2024 05:29:20 GMT, Anton Ertl wrote:

Stefan Monnier <[email protected]> writes:

Does Unicode even has the notion of "character", really?

AFAIK it does not.

It uses terms like “grapheme” and “text element” for the concept, leaving
“character” without a fixed meaning.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon May 27 07:36:50 2024

On Mon, 27 May 2024 06:20:33 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Sun, 12 May 2024 05:40:45 GMT, Anton Ertl wrote:

This is a nice demonstration of the unnecessary complexity that the
codepoint mistake leads to. ...

But if they had decided to just store the data as UTF-8 and use byte
indexes and lengths in their API, and adjusted the rest of their API
accordingly, they could have avoided this complexity and inefficiency
...

But UTF-8 is just a representation of code points, not characters. So I >>don’t understand why one way leads to “unnecessary complexity” and the >>other way does not.

In UTF-32 a character is a sequence of code points. In UTF-8 it is a sequence of code units.

UTF-8 is a sequence of bytes encoding code points.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Mon May 27 07:40:42 2024

On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings. 🙁

Surely a “character” (or “grapheme” I think is (one of) the Unicode terms)
is (represented by) a non-combining code point combined with all the immediately-following combining code points.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon May 27 07:42:32 2024

On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

Algol 60 does not standardize a program representation in characters (a
grave mistake fixed by most later programming languages ...

That would likely not have been considered feasible in 1960, given the
wide variation in character sets between computer systems. Even I/O was considered to be in the too-hard basket back then.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Mon May 27 07:43:42 2024

On Mon, 20 May 2024 17:44:48 +0000, MitchAlsup1 wrote:

John Savard wrote:

Historical note: Algol was originally called IAL; remember what JOVIAL
stood for.

Who was Joe ?? in Jovial

Jules Schwartz <http://bitsavers.trailing-edge.com/pdf/sdc/jovial/Schwartz_-_The_Development_of_JOVIAL_1978.pdf>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Mon May 27 07:45:59 2024

On Sun, 19 May 2024 15:32:49 -0600, John Savard wrote:

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Much of its syntax came from mathematics, which is international.

Semi-related question: are there non-English equivalents for mathematical operators like “grad”, “div” and “curl”?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon May 27 15:16:13 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings. 🙁

Surely a “character” (or “grapheme” I think is (one of) the Unicode terms)
is (represented by) a non-combining code point combined with all the >immediately-following combining code points.

Take another look at the table I referred to yesterday. When you have
ZWJ the rules of what combines with what gets awfully complicated.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to [email protected] on Mon May 27 16:41:26 2024

It appears that Lawrence D'Oliveiro <[email protected]d> said:

Much of its syntax came from mathematics, which is international.

Semi-related question: are there non-English equivalents for mathematical >operators like “grad”, “div” and “curl”?

Grad is written as a nabla, an upside down delta, div as nabla followed by a center dot,
and curl as nabla followed by a multiplication sign.

I'm reasonably sure my 1970 math textbook used them but I can't find it at the moment.

If you're asking how they're written in programming languages, I
expect they use the English names since we have the better part of a
century of anglophone numerical programming. Wikipedia says that curl
is often called "rot" for rotation outside North America.

I happen to have a copy of "Algol 60 Implementation" published in 1963
which describes the KDF9 Algol compiler in considerable detail. They
considered the translation of the Algol publication language to the
5-bit paper tape code their computer used so trivial that they don't
even describe it.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 28 01:08:06 2024

On Mon, 27 May 2024 15:16:13 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

I don't know of any language (or even library) that supports the
notion of "character" for Unicode strings. 🙁

Surely a “character” (or “grapheme” I think is (one of) the Unicode >> terms) is (represented by) a non-combining code point combined with all
the immediately-following combining code points.

Take another look at the table I referred to yesterday. When you have
ZWJ the rules of what combines with what gets awfully complicated.

ZWJ is classed as “punctuation”, and has no combining class. So it forms a “character” or “grapheme” it its own right.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue May 28 01:25:38 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Mon, 27 May 2024 15:16:13 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 22 May 2024 15:38:51 -0400, Stefan Monnier wrote:

I don't know of any language (or even library) that supports the
notion of "character" for Unicode strings. 🙁

Surely a “character” (or “grapheme” I think is (one of) the Unicode >>> terms) is (represented by) a non-combining code point combined with all
the immediately-following combining code points.

Take another look at the table I referred to yesterday. When you have
ZWJ the rules of what combines with what gets awfully complicated.

ZWJ is classed as “punctuation”, and has no combining class. So it forms a >“character” or “grapheme” it its own right.

Really, you need to look at that combined emoji table I told you about yesterday.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to [email protected] on Tue May 28 01:36:22 2024

It appears that Lawrence D'Oliveiro <[email protected]d> said:

On Tue, 28 May 2024 01:25:38 -0000 (UTC), John Levine wrote:

Really, you need to look at that combined emoji table I told you about
yesterday.

I’m just telling you what the official Unicode spec says.

Um, so am I. Those nine code point things are supposed to display
as a single little picture, regardless of what some other bit of
the spec may assert about ZWJ.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 28 01:22:46 2024

On Mon, 27 May 2024 16:41:26 -0000 (UTC), John Levine wrote:

It appears that Lawrence D'Oliveiro <[email protected]d> said:

Much of its syntax came from mathematics, which is international.

Semi-related question: are there non-English equivalents for
mathematical operators like “grad”, “div” and “curl”?

Grad is written as a nabla, an upside down delta, div as nabla followed
by a center dot, and curl as nabla followed by a multiplication sign.

That’s right, I’d forgotten about that.

I happen to have a copy of "Algol 60 Implementation" published in 1963
which describes the KDF9 Algol compiler in considerable detail. They considered the translation of the Algol publication language to the
5-bit paper tape code their computer used so trivial that they don't
even describe it.

Only 32 code symbols? It must have used shifts, à la Baudot code. It
probably was Baudot code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue May 28 01:34:36 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable
length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

What is the point, in this day and age, of having special machine >instructions to convert character encodings?

Presumably it makes some inner loop faster. They have instructions
to convert among all of UTF-8, UTF-16, and UTF-32, with an optional
bit (available at extra cost) to check that the incoming code points
are valid in the selected encoding.

zSeries has a lot of instructions like that. They even have packed
decimal vector instructions (not to be confused with decimal floating
point vector instructions, which they also have.) I can sort of guess
why but I don't really know.

They're almost certainly implemented in what they call millicode,
vertical microcode that uses the hardware implemented subset of the
instruction set plus a few extras to manage internal state info. So
it's not extra hardware, just extra microcode.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 28 01:29:31 2024

On Tue, 28 May 2024 01:25:38 -0000 (UTC), John Levine wrote:

Really, you need to look at that combined emoji table I told you about yesterday.

I’m just telling you what the official Unicode spec says.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to Lawrence D'Oliveiro on Tue May 28 15:43:25 2024

On 28/05/2024 02:22, Lawrence D'Oliveiro wrote:

On Mon, 27 May 2024 16:41:26 -0000 (UTC), John Levine wrote:

It appears that Lawrence D'Oliveiro <[email protected]d> said:

Much of its syntax came from mathematics, which is international.

Semi-related question: are there non-English equivalents for
mathematical operators like “grad”, “div” and “curl”?

Grad is written as a nabla, an upside down delta, div as nabla followed
by a center dot, and curl as nabla followed by a multiplication sign.

That’s right, I’d forgotten about that.

I happen to have a copy of "Algol 60 Implementation" published in 1963
which describes the KDF9 Algol compiler in considerable detail. They
considered the translation of the Algol publication language to the
5-bit paper tape code their computer used so trivial that they don't
even describe it.

Only 32 code symbols? It must have used shifts, à la Baudot code. It probably was Baudot code.

It was Ferranti 5-channel paper tape code: <http://www.findlayw.plus.com/KDF9/The%20KDF9%20Character%20Codes.pdf>
--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Tue May 28 17:04:20 2024

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Sun, 19 May 2024 15:32:49 -0600, John Savard wrote:

If Algol was supposed to be an _international_ algorithmic language,
why weren't its keywords taken from Latin or Esperanto, instead of
English?

Much of its syntax came from mathematics, which is international.

Semi-related question: are there non-English equivalents for mathematical operators like “grad”, “div” and “curl”?

German has "grad", "div" and "rot". People also use the nabla
operator, which I personally don't like.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue May 28 16:37:22 2024

Anyway, the Emacs Lisp functions right-char (and, after testing, also left-char, forward-char, and backward-char) support the notion of
character at least for some scripts. That may be the result of an interaction with the redisplay code that you mention later, but in
that case it's that code that knows about characters in Unicode.

Indeed, the concept is somewhat visible, but it's not really exposed in
the language. I think what you're seeing is implemented elsewhere than
in `forward-char`, it's a part of the interactive loop which sees that
after `forward-char` you end up "in the middle" of a composition and it
moves the point further, based on information that mostly belongs to the redisplay code.

Try `C-u 2 C-f` and I suspect you'll see that it doesn't always advance
by 2 characters but rather it advances by "2 code points + rounding up
to the next character boundary".

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue May 28 16:53:14 2024

Um, so am I. Those nine code point things are supposed to display
as a single little picture, regardless of what some other bit of
the spec may assert about ZWJ.

Maybe it's a good time to start taking bets for which will be the year
that Unicode becomes Turing complete?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to moi on Wed May 29 04:49:26 2024

On Tue, 28 May 2024 15:43:25 +0100, moi wrote:

On 28/05/2024 02:22, Lawrence D'Oliveiro wrote:

On Mon, 27 May 2024 16:41:26 -0000 (UTC), John Levine wrote:

I happen to have a copy of "Algol 60 Implementation" published in 1963
which describes the KDF9 Algol compiler in considerable detail. They
considered the translation of the Algol publication language to the
5-bit paper tape code their computer used so trivial that they don't
even describe it.

Only 32 code symbols? It must have used shifts, à la Baudot code. It
probably was Baudot code.

It was Ferranti 5-channel paper tape code: <http://www.findlayw.plus.com/KDF9/The%20KDF9%20Character%20Codes.pdf>

That doc says it’s a 6-bit code.

By the way, don’t you hate sites that block user agents like wget?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Wed May 29 06:59:55 2024

Stefan Monnier <[email protected]> writes:

Anyway, the Emacs Lisp functions right-char (and, after testing, also
left-char, forward-char, and backward-char) support the notion of
character at least for some scripts. That may be the result of an
interaction with the redisplay code that you mention later, but in
that case it's that code that knows about characters in Unicode.

Indeed, the concept is somewhat visible, but it's not really exposed in
the language. I think what you're seeing is implemented elsewhere than
in `forward-char`, it's a part of the interactive loop which sees that
after `forward-char` you end up "in the middle" of a composition and it
moves the point further, based on information that mostly belongs to the >redisplay code.

Try `C-u 2 C-f` and I suspect you'll see that it doesn't always advance
by 2 characters but rather it advances by "2 code points + rounding up
to the next character boundary".

Confirmed. So Emacs Lisp has a codepoint-oriented interface and then
needs to compensate for that elsewhere. This does not indicate that a codepoint-oriented interface is a good idea, rather the opposite.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to Lawrence D'Oliveiro on Wed May 29 08:32:17 2024

On 29/05/2024 05:49, Lawrence D'Oliveiro wrote:

On Tue, 28 May 2024 15:43:25 +0100, moi wrote:

On 28/05/2024 02:22, Lawrence D'Oliveiro wrote:

On Mon, 27 May 2024 16:41:26 -0000 (UTC), John Levine wrote:

I happen to have a copy of "Algol 60 Implementation" published in 1963 >>>> which describes the KDF9 Algol compiler in considerable detail. They
considered the translation of the Algol publication language to the
5-bit paper tape code their computer used so trivial that they don't
even describe it.

Only 32 code symbols? It must have used shifts, à la Baudot code. It
probably was Baudot code.

It was Ferranti 5-channel paper tape code:
<http://www.findlayw.plus.com/KDF9/The%20KDF9%20Character%20Codes.pdf>

That doc says it’s a 6-bit code.

KDF9 characters are 6 bits.
Ferranti paper tape characters are 5 bits.
When dealing with the latter, the KDF9 paper tape reader
sets the high bit of each input character to 1,
and the paper tape punch discards the high bit.

By the way, don’t you hate sites that block user agents like wget?

No.
I hate user agents like wget, which is why I block them.

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed May 29 08:07:50 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

Algol 60 does not standardize a program representation in characters (a
grave mistake fixed by most later programming languages ...

That would likely not have been considered feasible in 1960, given the
wide variation in character sets between computer systems.

COBOL did it. LISP did it. It was feasible in 1960. It's just that
the Algol 60 committee did not want to go there. And the Algol 68
committee did not want to go there even though ASCII was standardized
in 1963, and Algol 68 was only finished in 1974 AFAIK.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed May 29 08:20:03 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 27 May 2024 06:20:33 GMT, Anton Ertl wrote:

In UTF-32 a character is a sequence of code points. In UTF-8 it is a
sequence of code units.

UTF-8 is a sequence of bytes encoding code points.

Yes, but it is even rarer that code points are needed than that
characters are needed. Another, better way of stating this is:

In UTF-32 a character is a sequence of (32-bit) code units.
In UTF-8 a character is a sequence of (8-bit) code units.

Given that the data is present in files in UTF-8 form, any conversion
to and from UTF-32 is just an unnecessary complication.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 29 10:44:21 2024

Confirmed. So Emacs Lisp has a codepoint-oriented interface and then
needs to compensate for that elsewhere. This does not indicate that a codepoint-oriented interface is a good idea, rather the opposite.

Note that the "round to the next character boundary" is actually
generalized to non-Unicode concepts: you can mark a chunk of text as
being "intangible" or make it invisible and the "round up" will
correspondingly move to the next boundary to avoid the cursor being in
the middle of an invisible or intangible chunk of text.

I'm not sure the codepoint-oriented API is the best option, but it's not completely clear what *is* the best option. You mention a byte-oriented
API and you might be right that it's a better option, but in the case of
Emacs that's what we used in Emacs-20.1 but it worked really poorly
because of backward compatibility issues. I think if we started from
scratch now (i.e. without having to contend with backward compatibility,
and with a better understanding of Unicode (which barely existed back
then)) it might work better, indeed, but that's not been an option 🙁

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu May 30 02:50:33 2024

On Wed, 29 May 2024 08:07:50 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

Algol 60 does not standardize a program representation in characters
(a grave mistake fixed by most later programming languages ...

That would likely not have been considered feasible in 1960, given the
wide variation in character sets between computer systems.

COBOL did it. LISP did it.

And so did Fortran. They all did it by severely curtailing their allowed character sets.

It's just that the Algol 60 committee did not want to go there.

They wanted symbols like “÷”, “×”, “↑”, “≤”, “≥”, “≠”, “≡”, “⊃”, “∨”, “∧”,
“¬” ... you get the idea. I don’t any computer system on earth could provide all those symbols at the time, or even, say, 20 years later.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu May 30 02:53:28 2024

On Wed, 29 May 2024 08:20:03 GMT, Anton Ertl wrote:

In UTF-32 a character is a sequence of (32-bit) code units.
In UTF-8 a character is a sequence of (8-bit) code units.

The point being, there is a 1:1 correspondence between the two
representations of the same characters/code points. So your claim that use
of one is somehow a “mistake” while the other is not, is spurious.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to moi on Thu May 30 02:43:39 2024

On Wed, 29 May 2024 08:32:17 +0100, moi wrote:

I hate user agents like wget, which is why I block them.

Which is completely futile, which is why it’s so stupid to do.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 30 03:25:14 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler instructions?

The IBM z series are not RISCs.

Doesn’t matter. The principles of designing high-performance architectures >still apply: simpler instructions are better than more complex ones.

Nobody buys a mainframe just for its compute speed.

I do not entirely understand why IBM keeps adding special purpose
instructions to z. Maybe it's partly marketing, but they have a
largely captive audience so it has to be more than that. Given the
millicode design, a lot of the instructions are basically microcoded subroutines that may well run faster than the normal code equivalent
because the have access to more machine state. If anyone is about to
say than let all the instructions see all the state, see our
discussion a week or two ago about architecture vs. implementation.

If you want something that gives you more MIPS/$, IBM is happy to sell
you POWER systems.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Thu May 30 03:29:28 2024

John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler
instructions?

The IBM z series are not RISCs.

Doesn’t matter. The principles of designing high-performance architectures still apply: simpler instructions are better than
more complex ones.

Nobody buys a mainframe just for its compute speed.

I do not entirely understand why IBM keeps adding special purpose instructions to z. Maybe it's partly marketing, but they have a
largely captive audience so it has to be more than that. Given the
millicode design, a lot of the instructions are basically microcoded subroutines that may well run faster than the normal code equivalent
because the have access to more machine state. If anyone is about to
say than let all the instructions see all the state, see our
discussion a week or two ago about architecture vs. implementation.

Thanks John. Your post and my previous one "crossed in the night". I
think you answered my question.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Thu May 30 03:21:13 2024

Lawrence D'Oliveiro wrote:

snip

They wanted symbols like “÷”, “×”, “↑”, “≤”, “≥”, “≠”, “≡”, “⊃”, “∨”,
“∧”, “¬” ... you get the idea. I don’t any computer system on earth
could provide all those symbols at the time, or even, say, 20 years
later.

See APL. So many symbols that the language is almost impossible to
read without a significant investment in learning them.

https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

Please note that I am not advocating this. It is at the opposite end
of the spectrum from COBOL where you could get by with no special
characters beyond periods. Neither was a good choice.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Stephen Fuld on Wed May 29 21:47:52 2024

"Stephen Fuld" <[email protected]d> writes:

Lawrence D'Oliveiro wrote:

snip

They wanted symbols like [...]

See APL. So many symbols that the language is almost impossible to
read without a significant investment in learning them.

https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

The problem with learning APL is not the character set. APL without
any special characters (which I actually have some experience using)
is still unlike any other programming language that existed in the
1960s or 1970s.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Tim Rentsch on Thu May 30 06:12:11 2024

Tim Rentsch wrote:

"Stephen Fuld" <[email protected]d> writes:

Lawrence D'Oliveiro wrote:

snip

They wanted symbols like [...]

See APL. So many symbols that the language is almost impossible to
read without a significant investment in learning them.

https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

The problem with learning APL is not the character set. APL without
any special characters (which I actually have some experience using)
is still unlike any other programming language that existed in the
1960s or 1970s.

OK, but my main point was to show, by counter example, the error of
Lawrence's statement quoted below

I don�t any computer system on earth could
provide all those symbols at the time, or even, say, 20 years later.

If the part about the difficulty of learning APL was wrong, then I
apologise.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Stephen Fuld on Thu May 30 05:38:00 2024

"Stephen Fuld" <[email protected]d> writes:

Tim Rentsch wrote:

"Stephen Fuld" <[email protected]d> writes:

Lawrence D'Oliveiro wrote:

snip

They wanted symbols like [...]

See APL. So many symbols that the language is almost impossible to
read without a significant investment in learning them.

https://en.wikipedia.org/wiki/APL_syntax_and_symbols#Monadic_functions

The problem with learning APL is not the character set. APL without
any special characters (which I actually have some experience using)
is still unlike any other programming language that existed in the
1960s or 1970s.

OK, but my main point was to show, by counter example, the error of Lawrence's statement quoted below

I see. I misunderstood the point of what you were saying. Sorry
about that.

I don't any computer system on earth could provide all those
symbols at the time, or even, say, 20 years later.

If the part about the difficulty of learning APL was wrong, then I
apologise.

No apology needed. Even if the APL character set wasn't the main
source of the difficulty, there is no question that the unusual
choice of operator characters used contributed to the effort needed
to understand and use APL.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Thu May 30 12:27:17 2024

John Levine <[email protected]> writes:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler instructions?

The IBM z series are not RISCs.

Doesn’t matter. The principles of designing high-performance architectures >>still apply: simpler instructions are better than more complex ones.

Nobody buys a mainframe just for its compute speed.

I do not entirely understand why IBM keeps adding special purpose >instructions to z. Maybe it's partly marketing, but they have a
largely captive audience so it has to be more than that.

It's still marketing. I have listened to several talks about
converting S/360 programs to C code that can be run on arbitrary
hardware, and IBM's audience hears about such things, too, so IBM's
sales force has to provide reasons for not jumping ship. And all
these new features that sound like they are useful are such reasons.
Things like decimal FP and CU14.

The fact that these feature provide no actual benefit is their best
property: When Intel and ARM evaluate whether they should implement
these features in their architectures, they find that the benefits of
these features do not justify their costs, so they refrain from adding
them to their architectures, preserving the marketing value of the
feature to IBM.

Given the
millicode design, a lot of the instructions are basically microcoded >subroutines that may well run faster than the normal code equivalent
because the have access to more machine state.

Maybe IBM adds a microarchitectural stream buffer to allow efficient implementation of CU14, but I doubt it. The marketing value of CU14
is there whether there is such a stream buffer or not, so why go to
the expense. If they already have such a stream buffer for other
features, they might as well use it, though.

Maybe they internally do the SIMDified RISCy variant I outlined, and
then have a microcode loop. The SIMDified RISCy variant should be
cheap enough to implement.

Or maybe they just have a microcode routine that does what a C program
would do. In that case there is no performance benefit to having a
separate instruction, but the marketing benefit is still there.

If you want something that gives you more MIPS/$, IBM is happy to sell
you POWER systems.

If you want something that gives you more MIPS (as well as more
MIPS/$), lots of companies will be happy to sell you gear with AMD or
Intel CPUs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu May 30 12:47:35 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Wed, 29 May 2024 08:20:03 GMT, Anton Ertl wrote:

In UTF-32 a character is a sequence of (32-bit) code units.
In UTF-8 a character is a sequence of (8-bit) code units.

The point being, there is a 1:1 correspondence between the two >representations of the same characters/code points. So your claim that use
of one is somehow a “mistake” while the other is not, is spurious.

If the data you are working on is provided in files containing UTF-8, conversion to UTF-32 does not provide any benefits and is therefore an unnecessary complication, and therefore a mistake.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Thu May 30 14:08:04 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

It's still marketing. I have listened to several talks about
converting S/360 programs to C code that can be run on arbitrary
hardware, and IBM's audience hears about such things, too, so IBM's
sales force has to provide reasons for not jumping ship. And all
these new features that sound like they are useful are such reasons.
Things like decimal FP and CU14.

The fact that these feature provide no actual benefit is their best
property:

No actual benefit?

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads
and can support your claims with benchmarks.

Note that the feature was introduced in Znext (2012). That it is
still there must indicate that it gets some usage.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Thu May 30 17:12:22 2024

On Thu, 30 May 2024 14:08:04 GMT
[email protected] (Scott Lurndal) wrote:

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

It's still marketing. I have listened to several talks about
converting S/360 programs to C code that can be run on arbitrary
hardware, and IBM's audience hears about such things, too, so IBM's
sales force has to provide reasons for not jumping ship. And all
these new features that sound like they are useful are such
reasons. Things like decimal FP and CU14.

The fact that these feature provide no actual benefit is their best
property:

No actual benefit?

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads
and can support your claims with benchmarks.

Note that the feature was introduced in Znext (2012). That it is
still there must indicate that it gets some usage.

Not necessarily.
After feature was given publicly documented opcode it's very hard to
remove it.
Naturally, I don't know if this particular feature got publicly
documented opcode and don't know where too look.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Thu May 30 13:41:58 2024

Anton Ertl <[email protected]> schrieb:

It's still marketing. I have listened to several talks about
converting S/360 programs to C code that can be run on arbitrary
hardware, and IBM's audience hears about such things, too, so IBM's
sales force has to provide reasons for not jumping ship. And all
these new features that sound like they are useful are such reasons.
Things like decimal FP and CU14.

The fact that these feature provide no actual benefit is their best
property:

No actual benefit?

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads
and can support your claims with benchmarks.

Care to show exactly what you did, and what the results were?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Thu May 30 14:53:28 2024

Michael S <[email protected]> schrieb:

Naturally, I don't know if this particular feature got publicly
documented opcode and don't know where too look.

Search for the famed "Principle of Operations" for zSystems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Thu May 30 15:28:51 2024

Michael S <[email protected]> writes:

On Thu, 30 May 2024 14:08:04 GMT
[email protected] (Scott Lurndal) wrote:

Note that the feature was introduced in Znext (2012). That it is
still there must indicate that it gets some usage.

Not necessarily.
After feature was given publicly documented opcode it's very hard to
remove it.

Even if this reason did not exist, the marketing reason for having
this instruction still exists, so why should they remove it?

Naturally, I don't know if this particular feature got publicly
documented opcode and don't know where too look.

These instructions have public opcodes, and I gave an URL and page
number in <[email protected]>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 30 15:49:24 2024

According to Anton Ertl <[email protected]>:

Concerning benchmarks, last I heard IBM forbids benchmarking z
hardware. Until they change this, I'll assume their z hardware is
abysmally slow and any benchmarking would result in embarrassment, IBM
knows this and that's why they forbid benchmarking.

My guess is that it's not so much that it's slow but that, even more
than usual, benchmarks show what you want them to show. For example,
if you benchmark a portable version of gzip or heapsort it will look
worse than one that knows to use the accelerator instructions.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Thu May 30 15:55:59 2024

Anton Ertl <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

It's still marketing. I have listened to several talks about
converting S/360 programs to C code that can be run on arbitrary
hardware, and IBM's audience hears about such things, too, so IBM's
sales force has to provide reasons for not jumping ship. And all
these new features that sound like they are useful are such reasons.
Things like decimal FP and CU14.

The fact that these feature provide no actual benefit is their best
property:

No actual benefit?

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads
and can support your claims with benchmarks.

Care to show exactly what you did, and what the results were?

It provides no actual benefit, because UTF-32 provides no actual
benefit.

In other words, you didnt't.

Thanks for the explanation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Thu May 30 15:04:35 2024

Thomas Koenig <[email protected]> writes:

Anton Ertl <[email protected]> schrieb:

It's still marketing. I have listened to several talks about
converting S/360 programs to C code that can be run on arbitrary
hardware, and IBM's audience hears about such things, too, so IBM's
sales force has to provide reasons for not jumping ship. And all
these new features that sound like they are useful are such reasons.
Things like decimal FP and CU14.

The fact that these feature provide no actual benefit is their best
property:

No actual benefit?

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads
and can support your claims with benchmarks.

Care to show exactly what you did, and what the results were?

It provides no actual benefit, because UTF-32 provides no actual
benefit. In nearly all code you don't need code points. Dealing with
data as mostly opaque strings in UTF-8 is less complicated *and* more
efficient than converting them to UTF-32, working with UTF-32 strings,
and converting back (even if the conversion was very cheap).

Of course there are API mistakes (like Python3) that lead to some
usage of UTF-32, but even on Intel and AMD CPUs where Python3 code
probably consumes more cycles than on other hardware, that usage has
not been enough to add instructions like CU14.

IBM z also has CU12 and CU21 (for converting between UTF-8 and
UTF-16), and such instructions could see some usage in environments
where UTF-16 is big, such as Java, JavaScript, and Windows, but even
in CPUs by Intel and AMD (with lots of Windows and JavaScript) and ARM (Android, i.e., Java) this has not led to instructions for converting
between UTF-8 and UTF-16.

Concerning benchmarks, last I heard IBM forbids benchmarking z
hardware. Until they change this, I'll assume their z hardware is
abysmally slow and any benchmarking would result in embarrassment, IBM
knows this and that's why they forbid benchmarking.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu May 30 16:25:46 2024

Stefan Monnier <[email protected]> writes:

I'm not sure the codepoint-oriented API is the best option, but it's not >completely clear what *is* the best option. You mention a byte-oriented
API and you might be right that it's a better option, but in the case of >Emacs that's what we used in Emacs-20.1 but it worked really poorly
because of backward compatibility issues. I think if we started from
scratch now (i.e. without having to contend with backward compatibility,
and with a better understanding of Unicode (which barely existed back
then)) it might work better, indeed, but that's not been an option

Plus, editors are among the very few uses where you have to deal with individual characters, so the "treat it as opaque string" approach
that works so well for most other code is not good enough there. The command-line editor of Gforth is one case where we use the xchar words
(those for dealing with code points of UTF-8).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu May 30 14:01:53 2024

The problem with learning APL is not the character set. APL without
any special characters (which I actually have some experience using)
is still unlike any other programming language that existed in the
1960s or 1970s.

There have been a few languages that took similar approaches, but the
most recent and successful I've heard of is [jq](https://en.wikipedia.org/wiki/Jq_%28programming_language%29).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to Lawrence D'Oliveiro on Thu May 30 19:01:11 2024

On 30/05/2024 03:43, Lawrence D'Oliveiro wrote:

On Wed, 29 May 2024 08:32:17 +0100, moi wrote:

I hate user agents like wget, which is why I block them.

Which is completely futile, which is why it’s so stupid to do.

What a know-all you are. And offensive with it.

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Thu May 30 22:22:34 2024

On Thu, 30 May 2024 02:50:33 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

And so did Fortran. They all did it by severely curtailing their allowed >character sets.

It's just that the Algol 60 committee did not want to go there.

They wanted symbols like ��, �ה, �?�, �?�, �?�, �?�, �?�, �?�, �?�, �?�, >�� ... you get the idea. I don�t any computer system on earth could
provide all those symbols at the time, or even, say, 20 years later.

Well, the 120 character chain for the STRETCH computer's printer
handled Algol's character set. And so did the punched card code for a
couple of Russian computers. So the attempt was made.

And then there was the LISP machine, which started life with the
infamous "Space Cadet" computer.

Today, of course, we have Unicode, but that doesn't mean the entire
Algol character set is conveniently accessible directly from the
keyboard.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Anton Ertl on Thu May 30 22:19:14 2024

On Wed, 29 May 2024 08:07:50 GMT, [email protected]
(Anton Ertl) wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 20 May 2024 11:46:20 GMT, Anton Ertl wrote:

Algol 60 does not standardize a program representation in characters (a
grave mistake fixed by most later programming languages ...

That would likely not have been considered feasible in 1960, given the
wide variation in character sets between computer systems.

COBOL did it. LISP did it. It was feasible in 1960. It's just that
the Algol 60 committee did not want to go there.

There was a famous article by Bob Bemer in 1960 in the Communications
of the ACM in which he gave a talbe of all this variation in character
sets between computers. This helped spur the adoption of ASCII.

Algol 60 was intended as an International Algorithmic Language. In
fact, that's what Algol was first called, hence JOVIAL. So it is _not_ particularly hard for me to believe that the international committee
behind Algol 60 wished to support a wider variety of computers than
the people behind COBOL and LISP. Yes, those languages, unlike
FORTRAN, weren't the creations of a single manufacturer.

But they _were_ fairly U.S. - centric, and Algol was *not*. For
example, there were British computer systems that offered Algol
compilers that based their character sets on modified 5-unit
teleprinters.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Thu May 30 22:25:47 2024

On Thu, 30 May 2024 06:12:11 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

If the part about the difficulty of learning APL was wrong, then I
apologise.

I would not say that it was wrong. APL "without special characters"
was achieved by way of a transliteration scheme, where short codes
represented the special characters. So instead of memorizing funny
shapes, you memorized cryptic abbreviations.

So the character set was _still_ the source of the difficulty of
learning APL even if you happened to be using an implementation that
didn't have any special characters.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Thu May 30 22:31:41 2024

On Thu, 30 May 2024 03:25:14 -0000 (UTC), John Levine
<[email protected]> wrote:

I do not entirely understand why IBM keeps adding special purpose >instructions to z. Maybe it's partly marketing, but they have a
largely captive audience so it has to be more than that.

One possibility is to _keep_ that audience captive even after all the
patents expire that are applicable to machines with the z/Architecture
in its current state, if you are reluctant to believe that these new instructions genuinely improve performance.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Savard on Fri May 31 12:59:42 2024

On Thu, 30 May 2024 22:19:14 -0600
John Savard <[email protected]d> wrote:

But they _were_ fairly U.S. - centric, and Algol was *not*. For
example,

U.S.-centric vs U.S. eccentric. http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Actually I am pretty sure that "eccentric" is not a fair
characterisation of his personality, but can't resist.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Fri May 31 14:23:36 2024

Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

It's still marketing. I have listened to several talks about
converting S/360 programs to C code that can be run on arbitrary
hardware, and IBM's audience hears about such things, too, so IBM's
sales force has to provide reasons for not jumping ship. And all
these new features that sound like they are useful are such reasons.
Things like decimal FP and CU14.

The fact that these feature provide no actual benefit is their best
property:

No actual benefit?

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads
and can support your claims with benchmarks.

Care to show exactly what you did, and what the results were?

I am pretty sure Anton is correct, at least for data residing in RAM,
since any reasonably efficient sw algorithm to do the same thing should
be able to keep up with memory bandwidth, right?

If the data is already in cache, then you have presumably already
converted to whatever format you need to use internally while loading.

It is only when working with smallish blocks (up to a few kB of data,
fitting in $L1) and needing to run some temporary operation on decoded codepoints, that this woudl be a significant win.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to John Savard on Fri May 31 09:47:58 2024

John Savard <[email protected]d> writes:

On Thu, 30 May 2024 06:12:11 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

If the part about the difficulty of learning APL was wrong, then I
apologise.

I would not say that it was wrong. APL "without special characters"
was achieved by way of a transliteration scheme, where short codes represented the special characters. So instead of memorizing funny
shapes, you memorized cryptic abbreviations.

So the character set was _still_ the source of the difficulty of
learning APL even if you happened to be using an implementation that
didn't have any special characters.

The character set was a source of some of the difficulty of
learning APL. Certainly not all of it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri May 31 17:21:53 2024

BGB wrote:

On 5/30/2024 11:25 AM, Anton Ertl wrote:

Stefan Monnier <[email protected]> writes:

I'm not sure the codepoint-oriented API is the best option, but it's
not
completely clear what *is* the best option. You mention a
byte-oriented
API and you might be right that it's a better option, but in the case
of
Emacs that's what we used in Emacs-20.1 but it worked really poorly
because of backward compatibility issues. I think if we started from
scratch now (i.e. without having to contend with backward
compatibility,
and with a better understanding of Unicode (which barely existed back
then)) it might work better, indeed, but that's not been an option

Plus, editors are among the very few uses where you have to deal with
individual characters, so the "treat it as opaque string" approach
that works so well for most other code is not good enough there. The
command-line editor of Gforth is one case where we use the xchar words
(those for dealing with code points of UTF-8).

Yeah.

For text editors, this is one of the few cases it makes sense to use 32

or 64 bit characters (say, combining the 'character' with some
additional metadata such as formatting).

Though, one thing that makes sense for text editors is if only the
"currently being edited" lines are fully unpacked, whereas the others
can remain in a more compact form (such as UTF-8), and are then
unpacked

as they come into view (say, treating the editor window as a 32-entry
modulo cache or similar).

For the rest, say, one can have, say, a big buffer, with an array of
lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..}
along with text from different fonts and different backgrounds on a per character basis.

If a line is modified, it can be reallocated at the end of the buffer,
and if the buffer gets full, it can be "repacked" and/or expanded as
needed. When written back to a file, the buffer lines can be emitted
in-order to the text file.

Not entirely sure how other text editors manage things here, not really

looked into it.

If you think about it with the above features, you quickly realize it
is not just text anymore.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri May 31 19:41:01 2024

According to Terje Mathisen <[email protected]>:

Read all about it: https://www.vm.ibm.com/library/other/22783213.pdf

It's on page 7-251.

Thanks!

I did read all of it, and it was pretty close to how I would have
designed a sw function to do the same, except for the very funky ABI:

Both source and destination _must_ be an even register number, with the >following odd register providing the count/length.

That's the way they've been handling address+length pairs since they
added long compare and move instructions in S/370. They're so common
I'd expect there to be hardware to deal with them.

Just from this little snippet I'm pretty sure this instruction has a
sizeable startup overhead, compiler support is probably in the form of
an intrinsic that knows about the need to allocate two pairs of
register, each pair starting at an even-numbered register.

Same register allocation would be needed for a string compare or move,
so that's nothing new.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri May 31 19:44:49 2024

According to John Savard <[email protected]d>:

On Thu, 30 May 2024 03:25:14 -0000 (UTC), John Levine
<[email protected]> wrote:

I do not entirely understand why IBM keeps adding special purpose >>instructions to z. Maybe it's partly marketing, but they have a
largely captive audience so it has to be more than that.

One possibility is to _keep_ that audience captive even after all the
patents expire that are applicable to machines with the z/Architecture
in its current state, if you are reluctant to believe that these new >instructions genuinely improve performance.

Back in the last millenium there were a bunch of companies that made
clones of IBM mainframes. They all failed. It's the whole ecosystem of
hardware and software, not just individual features that keep the
customers nor patents.

I have to say I'm somewhat surprised that IBM has put a lot of effort
into running linux on zSeries, since that's about as un-captive as you
can get. I would imagine that for some kinds of heavily threaded
workloads they could be competitive since the z machines have upwards
of a hundred CPUs with a shared mostly consistent cache.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri May 31 19:12:49 2024

BGB wrote:

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

For the rest, say, one can have, say, a big buffer, with an array of
lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..}
along with text from different fonts and different backgrounds on a per
character basis.

Errm, I think we call this a word processor, not a text editor.

So, you are calling AOL e-mail editor a word processor ??? !!?! Gasp !
And every modern forum editor (this one not included) word processors
!!

Me thinks your definition is overly inclusive.

Granted, text editors don't usually store font or formatting
information

in the text itself, but rather it exists temporarily for things like
"syntax highlighting".

If a line is modified, it can be reallocated at the end of the buffer,
and if the buffer gets full, it can be "repacked" and/or expanded as
needed. When written back to a file, the buffer lines can be emitted
in-order to the text file.

Not entirely sure how other text editors manage things here, not really

looked into it.

If you think about it with the above features, you quickly realize it
is not just text anymore.

But, word processors are their own category...

Typically, they also have their own specialized formats (though, "big
blob of XML inside a ZIP package" seems to have become popular).

Whereas text-editors typically use plain ASCII/UTF-8/UTF-16 files...
The great "feature creep" in text editors is mostly that modern ones
support syntax highlighting and emojis.

An intermediate option would be a wysiwyg editor that does MediaWiki or

Markdown. Though, annoyingly, there don't seem to be any that exist as standalone desktop programs (seemingly invariably they are written in JavaScript or similar and intended to operate inside a browser).

I might eventually need to get around to writing something like this
(mostly because I use MediaWiki notation for some of my own
documentation). Also arguably mode advanced than the system used by
"info" and "man", though a tool along these lines could make sense (but

possibly as an intermediate, with an interface more like "man" but able

to jump between documents more like "info").

Also, bug hunt is annoying. Find/fix one bug, but more bugs remain...
My project is seemingly in a rather buggy state right at the moment.

But, I guess, did add things like file redirection and similar, along
with a few more standard commands.

So, in the working version, technically things like "cat file1 > file2"

or "program > file" and similar are now technically possible...

But, also, everything has turned into a crapstorm of crashes...

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri May 31 19:47:36 2024

According to Michael S <[email protected]>:

U.S.-centric vs U.S. eccentric. >http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Actually I am pretty sure that "eccentric" is not a fair
characterisation of his personality, but can't resist.

He was my thesis advisor and he was pretty eccentric. In a nice way,
but still quite a character.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Fri May 31 21:03:30 2024

John Levine <[email protected]> writes:

According to John Savard <[email protected]d>:

On Thu, 30 May 2024 03:25:14 -0000 (UTC), John Levine
<[email protected]> wrote:

I do not entirely understand why IBM keeps adding special purpose >>>instructions to z. Maybe it's partly marketing, but they have a
largely captive audience so it has to be more than that.

One possibility is to _keep_ that audience captive even after all the >>patents expire that are applicable to machines with the z/Architecture
in its current state, if you are reluctant to believe that these new >>instructions genuinely improve performance.

Back in the last millenium there were a bunch of companies that made
clones of IBM mainframes. They all failed. It's the whole ecosystem of >hardware and software, not just individual features that keep the
customers nor patents.

I have to say I'm somewhat surprised that IBM has put a lot of effort
into running linux on zSeries, since that's about as un-captive as you
can get. I would imagine that for some kinds of heavily threaded
workloads they could be competitive since the z machines have upwards
of a hundred CPUs with a shared mostly consistent cache.

I had heard somewhere that the linux use cases on Z run
multiple VMs, rather than single large SMP.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Fri May 31 21:05:36 2024

John Levine wrote:

According to Michael S <[email protected]>:

U.S.-centric vs U.S. eccentric. >>http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Actually I am pretty sure that "eccentric" is not a fair
characterisation of his personality, but can't resist.

He was my thesis advisor and he was pretty eccentric. In a nice way,
but still quite a character.

Back in my day, eccentric was used in the British fashion to point out
a person with certain qualities that make him instantly memorable, but
not in any bad way. The Characters on Monty Python were eccentric !!

Now it means a person with creepy qualities.

My how the language has migrated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Fri May 31 21:01:13 2024

[email protected] (MitchAlsup1) writes:

BGB wrote:

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

For the rest, say, one can have, say, a big buffer, with an array of
lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..}
along with text from different fonts and different backgrounds on a per
character basis.

Errm, I think we call this a word processor, not a text editor.

So, you are calling AOL e-mail editor a word processor ???

Yep.

And every modern forum editor (this one not included) word processors

Yep. They're certainly not text editors along the lines of vim or emacs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Fri May 31 21:51:56 2024

On Fri, 31 May 2024 19:44:49 -0000 (UTC), John Levine
<[email protected]> wrote:

I have to say I'm somewhat surprised that IBM has put a lot of effort
into running linux on zSeries, since that's about as un-captive as you
can get.

You can buy a zSeries machine more cheaply if it can only run Linux,
but not any IBM operating systems. So this is presumably for the
purpose of expanding the popularity of the z/Architecture without in
any way threatening the profitability of their base market.

If they took it to its logical conclusion, and packaged zArchitecture
chips without the ability to run current IBM operating systems in the
same way as POWER chips, I might actually be interested.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Sat Jun 1 07:47:49 2024

John Savard <[email protected]d> schrieb:

On Fri, 31 May 2024 19:44:49 -0000 (UTC), John Levine
<[email protected]> wrote:

I have to say I'm somewhat surprised that IBM has put a lot of effort
into running linux on zSeries, since that's about as un-captive as you
can get.

You can buy a zSeries machine more cheaply if it can only run Linux,
but not any IBM operating systems. So this is presumably for the
purpose of expanding the popularity of the z/Architecture without in
any way threatening the profitability of their base market.

One of the main selling points is the hardware reliability, and
you get this with Linux, too. Plus, you can always run zOS in
parallel with Linux, either in LPAR mode or as a guest under VM
(or under KVM, if you're so inclined).

Software availability is probably the main driver. Even SAP made
SAP HANA Linux-only, and they have announced that other systems
will be dropped, so IBM is probably very glad they did that Linux port.

From what I heard, it actually started out as some people trying
out a port of Linux as an unofficial hobby project, and finding
it surprisingly easy.

If they took it to its logical conclusion, and packaged zArchitecture
chips without the ability to run current IBM operating systems in the
same way as POWER chips, I might actually be interested.

Would you like to buy one, then? That would be a large investment
of money and space in your home... but then again, an 18-year old
once bought a z890, see https://www.youtube.com/watch?v=45X4VP8CGtk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Sat Jun 1 08:50:22 2024

Terje Mathisen <[email protected]> schrieb:

Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

It's still marketing. I have listened to several talks about
converting S/360 programs to C code that can be run on arbitrary
hardware, and IBM's audience hears about such things, too, so IBM's
sales force has to provide reasons for not jumping ship. And all
these new features that sound like they are useful are such reasons.
Things like decimal FP and CU14.

The fact that these feature provide no actual benefit is their best
property:

No actual benefit?

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads
and can support your claims with benchmarks.

Care to show exactly what you did, and what the results were?

I am pretty sure Anton is correct, at least for data residing in RAM,
since any reasonably efficient sw algorithm to do the same thing should
be able to keep up with memory bandwidth, right?

I'm not sure that would be the case for text containing some
non-ASCII characters, where you cannot predict branches well
(consider Å, Ø and Æ, which together appear to make up around
a bit more than 2.5% according to a random statistic I just
grabbed off the Internet), or ä, ö and ü which have around 1.5%
occurrence together.

In Chinese or Japanese text, I assume the spaces and punctuation
are 7-bit ASCII (are they, actually?) so things would be even
worse for branch prediction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat Jun 1 17:08:53 2024

Thomas Koenig <[email protected]> writes:

I'm not sure that would be the case for text containing some
non-ASCII characters, where you cannot predict branches well
(consider Å, Ø and Æ, which together appear to make up around
a bit more than 2.5% according to a random statistic I just
grabbed off the Internet), or ä, ö and ü which have around 1.5%
occurrence together.

In Chinese or Japanese text, I assume the spaces and punctuation
are 7-bit ASCII (are they, actually?) so things would be even
worse for branch prediction.

Branch prediction for what purpose?

A typical usage is processing of csv files containing participant
lists of a course, and the results for the course. The participant
names contain various non-ASCII characters*. The participant names
are usually just copied literally from some inputs to some outputs. I
don't know if the tools I use (awk, join, sort (in the C.utf8 locale)
etc.) do some code point processing, but if they do, it's totally
unnecessary.

In one case I sort on the names to produce reports, so in that case a
different locale and actual knowledge of the characters for collating
order purposes might be a good idea, but none of the report users has complained yet about the sorting. And the question is which locale
one should use when some names are from Turkey, some from Hungary,
some from Austria, some from Croatia etc.; how would
LC_COLLATE=de_AT.UTF-8 deal with all the characters that don't occur
in German? In any case, sorting in a locale other than C involves
much more (and much more expensive operations) than just code point recognition. And the actual lion's share of the CPU time spent on
report processing is the conversion from .md to .pdf using pandoc. I
am sure that this would not be measurably faster if there were only
ASCII characters in the .md files.

Bottom line: Code point conversion instructions like CU14 solve a
problem that people imagine who have no experience working with UTF-8.

* I have never encountered names containing characters outside the
roman-based alphabets, though, they are probably all romanized at some
earlier administrative step (probably when they register for the
university, or earlier), but Cyrillic, Greek, or CJK characters would
make no difference to the scripts; they would make it harder for
course staff to pronounce the names, though, so it's probably good
that the names are all romanized.

BTW, the biggest problems stem from ASCII characters in names, in
particular " " and "'".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Sat Jun 1 20:00:48 2024

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig
<[email protected]> wrote:

John Savard <[email protected]d> schrieb:

If they took it to its logical conclusion, and packaged zArchitecture
chips without the ability to run current IBM operating systems in the
same way as POWER chips, I might actually be interested.

Would you like to buy one, then? That would be a large investment
of money and space in your home... but then again, an 18-year old
once bought a z890, see https://www.youtube.com/watch?v=45X4VP8CGtk

Well, when I said "packaged... in the same way as POWER chips", I
meant that they would make systems with fewer CPUs than a mainframe
which were in the category of ordinary desktop computers if they were
to do that... which, of course, they won't.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Sun Jun 2 06:58:37 2024

John Savard <[email protected]d> schrieb:

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig
<[email protected]> wrote:

John Savard <[email protected]d> schrieb:

If they took it to its logical conclusion, and packaged zArchitecture
chips without the ability to run current IBM operating systems in the
same way as POWER chips, I might actually be interested.

Would you like to buy one, then? That would be a large investment
of money and space in your home... but then again, an 18-year old
once bought a z890, see https://www.youtube.com/watch?v=45X4VP8CGtk

Well, when I said "packaged... in the same way as POWER chips", I
meant that they would make systems with fewer CPUs than a mainframe
which were in the category of ordinary desktop computers if they were
to do that... which, of course, they won't.

You can buy POWER9 machines from RaptorCS. The command prompt
does not look different from AMD64, but of course the coolness
factor is much higher. (Also the noise level, if you do not order
with soundproofing...)

But maybe you can also run a 360/30 on an FPGA board, somebody
has apparently implemented it in VHDL from the logic diagrams: https://github.com/ibm2030/IBM2030

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Sun Jun 2 12:01:11 2024

On Fri, 31 May 2024 19:44:49 -0000 (UTC)
John Levine <[email protected]> wrote:

According to John Savard <[email protected]d>:

On Thu, 30 May 2024 03:25:14 -0000 (UTC), John Levine
<[email protected]> wrote:

I do not entirely understand why IBM keeps adding special purpose >>instructions to z. Maybe it's partly marketing, but they have a
largely captive audience so it has to be more than that.

One possibility is to _keep_ that audience captive even after all the >patents expire that are applicable to machines with the
z/Architecture in its current state, if you are reluctant to believe
that these new instructions genuinely improve performance.

Back in the last millenium there were a bunch of companies that made
clones of IBM mainframes. They all failed. It's the whole ecosystem of hardware and software, not just individual features that keep the
customers nor patents.

I have to say I'm somewhat surprised that IBM has put a lot of effort
into running linux on zSeries, since that's about as un-captive as you
can get. I would imagine that for some kinds of heavily threaded
workloads they could be competitive since the z machines have upwards
of a hundred CPUs with a shared mostly consistent cache.

z15 appears to peak at 190/380 User-visible cores/threads.
That's less than quad-socket 56-core Intel Xeon. Quad-socket Xeons
are much less popular than they used to be 20 years ago, but
HP/Dell/Lenovo would still sell you one if you insist.
IBM's own Power System E980 can give you whooping 1536 threads in
maximal configuration.

May be, Telum pulls zArch ahead of those. I don't know much about it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to John Savard on Sun Jun 2 11:23:00 2024

In article <[email protected]>, [email protected]d (John Savard) wrote:

If they took it to its logical conclusion, and packaged
zArchitecture chips without the ability to run current
IBM operating systems in the same way as POWER chips,
I might actually be interested.

A deskside zSeries machine that would boot and run Linux (probably under
z/VM) reasonably simply would be interesting to me. A big-endian machine
with comprehensive hardware trapping has software QA uses in the current
era of machines that hardly trap on anything apart from SEGV.

I looked into Hercules, but the community for that is mostly interested
in running historical 31-bit MVS and other elderly OSes. Hercules needs
quite a lot of configuration set up, which requires using a lot of IBM mainframe terminology and concepts, and doesn't supply a configuration
file for Linux. Setting it up is hard if you've never been a mainframe operator, and the community isn't all that helpful to outsiders.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to All on Sun Jun 2 11:23:00 2024

In article <v3d9bh$s9a$[email protected]>, [email protected] (John Levine)
wrote:

I have to say I'm somewhat surprised that IBM has put a lot of
effort into running linux on zSeries, since that's about as
un-captive as you can get. I would imagine that for some kinds
of heavily threaded workloads they could be competitive since
the z machines have upwards of a hundred CPUs with a shared
mostly consistent cache.

It seems to have been the easiest way to get zSeries used for web serving
and other internet tasks. Getting Linux software running on zSeries that
way is /much/ easier than porting it to z/OS or z/VSE.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Dallman on Sun Jun 2 13:32:40 2024

John Dallman <[email protected]> schrieb:

In article <[email protected]>, [email protected]d (John Savard) wrote:

If they took it to its logical conclusion, and packaged
zArchitecture chips without the ability to run current
IBM operating systems in the same way as POWER chips,
I might actually be interested.

A deskside zSeries machine that would boot and run Linux (probably under z/VM) reasonably simply would be interesting to me. A big-endian machine
with comprehensive hardware trapping has software QA uses in the current
era of machines that hardly trap on anything apart from SEGV.

There are POWER8 machines on sale on E-bay, on which you can run
either Linux or AIX, and bigendian too, if you want.

I also have a login shell open on such a machine right now, but that's
on the gcc compile farm, which is only for open-source projects.

$ cat foo.c
#include <stdio.h>
#include <string.h>

int main()
{
int a;
char c;
a = 0;
c = 1;
memcpy (&a,&c,1);
printf ("%d\n", a);
return 0;
}
$ gcc foo.c
$ ./a.out
16777216
$ uname -a
Linux cfarm203 6.0.0-6-powerpc64 #1 SMP Debian 6.0.12-1 (2022-12-09) ppc64 GNU/Linux

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Koenig on Sun Jun 2 18:29:00 2024

In article <v3hs9o$3c8gd$[email protected]>, [email protected] (Thomas Koenig) wrote:

There are POWER8 machines on sale on E-bay, on which you can run
either Linux or AIX, and bigendian too, if you want.

Yup. Considered that. Their trapping is not as comprehensive as zSeries,
and I could not justify them.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jun 2 19:44:57 2024

According to Thomas Koenig <[email protected]>:

But maybe you can also run a 360/30 on an FPGA board, somebody
has apparently implemented it in VHDL from the logic diagrams: >https://github.com/ibm2030/IBM2030

IBM made several S/360 and S/370 add-in boards for PCs. They worked
but were never very popular, probably because nobody bought a
mainframe for the CPU and PC peripherals are underpowered.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jun 2 19:43:18 2024

According to Anton Ertl <[email protected]>:

Bottom line: Code point conversion instructions like CU14 solve a
problem that people imagine who have no experience working with UTF-8.

The original instructions were CU12 and CU21 which convert between
UTF-8 and UTF-16. That really is useful, e.g., read a file of UTF-8
into a program in Java or Javascript which uses UTF-16. I agree the
UTF-32 versions added in zseries are less likely to be useful.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Mon Jun 3 05:47:50 2024

[email protected] (John Dallman) writes:

In article <v3hs9o$3c8gd$[email protected]>, [email protected] (Thomas >Koenig) wrote:

There are POWER8 machines on sale on E-bay, on which you can run
either Linux or AIX, and bigendian too, if you want.

Yup. Considered that. Their trapping is not as comprehensive as zSeries,
and I could not justify them.

SPARCs are big-endian and trap on unaligned access (at least that was
the case when I last used one long ago), while S/370 ff. does not trap
on unaligned access. What's wrong with SPARC? What other trapping do
you have in mind?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Mon Jun 3 08:06:01 2024

On Fri, 31 May 2024 12:14:19 -0500, BGB wrote:

Though, one thing that makes sense for text editors is if only the
"currently being edited" lines are fully unpacked, whereas the others
can remain in a more compact form (such as UTF-8), and are then unpacked
as they come into view (say, treating the editor window as a 32-entry
modulo cache or similar).

That may make sense if you are implementing a *text* editor, like the vi/
vim family. Remember that Emacs is usable for editing things other than
text.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Mon Jun 3 08:07:44 2024

On Fri, 31 May 2024 12:55:59 -0500, BGB wrote:

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..} along with text from different fonts and different backgrounds on a
per character basis.

Errm, I think we call this a word processor, not a text editor.

Emacs has things called “text attributes” and “overlays”, for doing precisely this sort of thing. You can even use these things to define
clickable buttons. Yet nobody would call Emacs a “word processor”.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Jun 3 08:11:10 2024

On Thu, 30 May 2024 12:47:35 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Wed, 29 May 2024 08:20:03 GMT, Anton Ertl wrote:

In UTF-32 a character is a sequence of (32-bit) code units.
In UTF-8 a character is a sequence of (8-bit) code units.

The point being, there is a 1:1 correspondence between the two
representations of the same characters/code points. So your claim that
use of one is somehow a “mistake” while the other is not, is spurious.

If the data you are working on is provided in files containing UTF-8, conversion to UTF-32 does not provide any benefits and is therefore an unnecessary complication, and therefore a mistake.

Assuming it does not provide any benefits is the mistake.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to moi on Mon Jun 3 08:10:10 2024

On Thu, 30 May 2024 19:01:11 +0100, moi wrote:

On 30/05/2024 03:43, Lawrence D'Oliveiro wrote:

On Wed, 29 May 2024 08:32:17 +0100, moi wrote:

I hate user agents like wget, which is why I block them.

Which is completely futile, which is why it’s so stupid to do.

What a know-all you are. And offensive with it.

You find it offensive that your block is so easy to bypass?

Sucks to be you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Mon Jun 3 08:16:06 2024

On Thu, 30 May 2024 22:22:34 -0600, John Savard wrote:

And then there was the LISP machine, which started life with the
infamous "Space Cadet" computer.

“Space Cadet” keyboard, you mean? <https://www.deviantart.com/default-cube/art/Space-Cadet-Keyboard-650629356> (my exercise in recreating it).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Mon Jun 3 08:20:48 2024

On Thu, 30 May 2024 14:42:14 -0000 (UTC), John Levine wrote:

The condition code tells you which it was. If it was an interrupt, you
just branch back and keep going.

Does it really hurt performance for the CPU to keep track of the fact that
an instruction has to be restarted after an interrupt?

On the old VAX, there was a processor status bit called “First Part Done”, which was used for interruptible instructions. When an interrupt happened
with such an instruction, the PC was not incremented past the instruction; instead, the saved PC pointed back at the instruction itself, while the
saved processor status had the FPD bit set.

So on a return from the interrupt, the CPU knew not to redo the
instruction setup, but just continue executing the instruction from the
current register state.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Mon Jun 3 09:20:00 2024

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

SPARCs are big-endian and trap on unaligned access (at least that
was the case when I last used one long ago), while S/370 ff. does
not trap on unaligned access.

OK, that shoots down S/370 for this job.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Dallman on Mon Jun 3 13:08:21 2024

On Mon, 3 Jun 2024 09:20 +0100 (BST)
[email protected] (John Dallman) wrote:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

SPARCs are big-endian and trap on unaligned access (at least that
was the case when I last used one long ago), while S/370 ff. does
not trap on unaligned access.

OK, that shoots down S/370 for this job.

John

What exactly is a job?
Is it for pure personal amusement or there are practical needs?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Jun 3 10:49:34 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Thu, 30 May 2024 14:42:14 -0000 (UTC), John Levine wrote:

The condition code tells you which it was. If it was an interrupt, you
just branch back and keep going.

Does it really hurt performance for the CPU to keep track of the fact that
an instruction has to be restarted after an interrupt?

I should have been clearer, it's not just an interrupt. The CPU does
some maximum amount of work for the instruction, and sets the
condition code if it didn't do the whole string. Maybe it was an
interrupt, maybe it just hit the limit. Many other instructions that
process long chunks of data work the same way.

On the old VAX, there was a processor status bit called “First Part Done”,

Actually that was the PDP-6 and -10 for the byte instructions,

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to Lawrence D'Oliveiro on Mon Jun 3 12:50:47 2024

On 03/06/2024 09:10, Lawrence D'Oliveiro wrote:

On Thu, 30 May 2024 19:01:11 +0100, moi wrote:

On 30/05/2024 03:43, Lawrence D'Oliveiro wrote:

On Wed, 29 May 2024 08:32:17 +0100, moi wrote:

I hate user agents like wget, which is why I block them.

Which is completely futile, which is why it’s so stupid to do.

What a know-all you are. And offensive with it.

You find it offensive that your block is so easy to bypass?

Sucks to be you.

You just cannot help yourself, can you? I am sorry for you,
but my tolerance has limits and you have just passed them.

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Jun 3 13:46:05 2024

According to OrangeFish <[email protected]d>:

On 2024-06-02 15:44, John Levine wrote:

IBM made several S/360 and S/370 add-in boards for PCs. They worked
but were never very popular, probably because nobody bought a
mainframe for the CPU and PC peripherals are underpowered.

Were they not marketed as a way of developing s/w on a PC without
chewing up mainframe time?

I heard it was software licensing. You were allowed to run stuff on
your PC/360 without paying for an extra seat as you would if you were
using a mainframe terminal.

They still weren't very popular, even though they were technically quite clever.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From OrangeFish@21:1/5 to John Levine on Mon Jun 3 09:29:25 2024

On 2024-06-02 15:44, John Levine wrote:

IBM made several S/360 and S/370 add-in boards for PCs. They worked
but were never very popular, probably because nobody bought a
mainframe for the CPU and PC peripherals are underpowered.

Were they not marketed as a way of developing s/w on a PC without
chewing up mainframe time?

OF

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Mon Jun 3 14:13:10 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 30 May 2024 14:42:14 -0000 (UTC), John Levine wrote:

The condition code tells you which it was. If it was an interrupt, you
just branch back and keep going.

Does it really hurt performance for the CPU to keep track of the fact that
an instruction has to be restarted after an interrupt?

Yes, of course. And it complicates the design, which makes it harder
to verify, particularly for an out-of-order design.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Jun 3 16:36:37 2024

Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 30 May 2024 14:42:14 -0000 (UTC), John Levine wrote:

The condition code tells you which it was. If it was an interrupt, you
just branch back and keep going.

Does it really hurt performance for the CPU to keep track of the fact
that
an instruction has to be restarted after an interrupt?

It is already a requirement that we have precise interrupts. Those Rqs
impose that the unfinished instruction is pointed at by IP on return.

Yes, of course. And it complicates the design, which makes it harder
to verify, particularly for an out-of-order design.

If you can backup mispredicted branches, you have all the OoO HW to
restart a long running instruction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Josh Vanderhoof@21:1/5 to Lawrence D'Oliveiro on Mon Jun 3 18:50:28 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 31 May 2024 12:55:59 -0500, BGB wrote:

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
..} along with text from different fonts and different backgrounds on a
per character basis.

Errm, I think we call this a word processor, not a text editor.

Emacs has things called “text attributes” and “overlays”, for doing precisely this sort of thing. You can even use these things to define clickable buttons. Yet nobody would call Emacs a “word processor”.

RMS did call it a word processor.

https://lists.gnu.org/archive/html/emacs-devel/2013-11/msg00515.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Tue Jun 4 01:21:56 2024

On Thu, 30 May 2024 13:41:58 -0000 (UTC), Thomas Koenig wrote:

Anton Ertl <[email protected]> schrieb:

The fact that these feature provide no actual benefit is their best
property:

No actual benefit?

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads and
can support your claims with benchmarks.

We already know the answer to that. It’s why RISC has taken over the computing world.

Remember that “mainframe workloads” are primarily I/O bound, not CPU- bound. The whole concept of a “mainframe” arose in the era when CPU time was scarce and expensive, so you had all these intelligent I/O peripherals
that could be given sequences of operations to perform, with minimal CPU intervention. It was all about maximizing throughput (batch operation),
not minimizing latency (interactive operation).

Nowadays, the whole concept is obsolete. So the only thing keeping it a
viable business has to be marketing, not technical, reasons.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Jun 4 01:26:20 2024

On Fri, 31 May 2024 19:44:49 -0000 (UTC), John Levine wrote:

Back in the last millenium there were a bunch of companies that made
clones of IBM mainframes. They all failed.

They didn’t “fail” as such. Companies like Amdahl and Wang were able to maintain profitable businesses for quite a few years, decades even. And
then there were other entirely separate companies, like CDC where Seymour
Cray invented the concept of the “supercomputer”, much to the surprise of his upper management who just wanted to sell “business” machines.

All the mainframe companies apart from IBM eventually went out of business because the whole mainframe concept is obsolete. The only reason IBM is
still going is because it was able to muster more marketing clout than all
its competitors put together. But even that part of its business is in
decline. The only part of the company currently making money would be its
Red Hat acquisition. The rest will eventually wither away.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Tue Jun 4 01:30:49 2024

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig wrote:

One of the main selling points [of zSeries] is the hardware
reliability ...

Quite an expensive way to get reliability. How does an outfit like Google achieve essentially 0% downtime? By running a swarm of half a million
commodity servers, that’s how. Every part has been built to the lowest
cost, except the power supply. And they discovered they can run their data centres a little hot, to save on cooling costs, at the expense of a
slightly higher failure rate. Because if a few thousand servers are down
at any particular time, none of their users even notices.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Jun 4 01:32:58 2024

Lawrence D'Oliveiro wrote:

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads and
can support your claims with benchmarks.

We already know the answer to that. It’s why RISC has taken over the computing world.

Oh Wait !?!

Remember that “mainframe workloads” are primarily I/O bound, not CPU- bound. The whole concept of a “mainframe” arose in the era when CPU
time
was scarce and expensive, so you had all these intelligent I/O
peripherals
that could be given sequences of operations to perform, with minimal
CPU
intervention. It was all about maximizing throughput (batch operation),

not minimizing latency (interactive operation).

One of the reasons those CPUs were microcoded was to allow I/O
activities
to have 50% of the compute power and 50% of the memory bandwidth. Thus,

from one set of HW logic one got 2 different computers, one designed
for
COBOL the other designed for I/O (of that era) sharing the same
expensive
lump of circuits.

Nowadays, the whole concept is obsolete. So the only thing keeping it a

viable business has to be marketing, not technical, reasons.

Microcode that "runs the instruction pipeline" is obsolete. And if
anyone
slugged through the Nick Tredenic book they would understand why.

Microcode is still viable at the function unit level in converting FMUL
logic into performing FDIV and SQRT calculations at low added cost.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Josh Vanderhoof on Tue Jun 4 01:36:52 2024

On Mon, 03 Jun 2024 18:50:28 -0400, Josh Vanderhoof wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 31 May 2024 12:55:59 -0500, BGB wrote:

On 5/31/2024 12:21 PM, MitchAlsup1 wrote:

In a modern text editor, one can paste in {*.xls tables, *.jpg,
*.gif, ..} along with text from different fonts and different
backgrounds on a per character basis.

Errm, I think we call this a word processor, not a text editor.

Emacs has things called “text attributes” and “overlays”, for doing >> precisely this sort of thing. You can even use these things to define
clickable buttons. Yet nobody would call Emacs a “word processor”.

RMS did call it a word processor.

https://lists.gnu.org/archive/html/emacs-devel/2013-11/msg00515.html

No he didn’t: “more features are still needed” to “extend Emacs to do WYSIWYG word processing”. So he admits it’s not doing that yet.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Tue Jun 4 01:51:31 2024

On Sun, 2 Jun 2024 06:58:37 -0000 (UTC), Thomas Koenig wrote:

You can buy POWER9 machines from RaptorCS. The command prompt does not
look different from AMD64, but of course the coolness factor is much
higher.

Linux is Linux. There was an article on theinquirer.net (defunct now) some years ago where a guy from SGI was giving an interactive demo (remotely,
via SSH) on a thousand-core Altix super. It still looked like a Linux
system. Though commands like “lspci” and “lscpu” produced output that went
on ... and on ... and on ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Lawrence D'Oliveiro on Mon Jun 3 16:42:17 2024

Lawrence D'Oliveiro <[email protected]d> writes:

Remember that “mainframe workloads” are primarily I/O bound, not CPU- bound. The whole concept of a “mainframe” arose in the era when CPU time was scarce and expensive, so you had all these intelligent I/O peripherals that could be given sequences of operations to perform, with minimal CPU intervention. It was all about maximizing throughput (batch operation),
not minimizing latency (interactive operation).

1980, I was con'ed into helping IBM STL lab that was overcrowded and
moving 300 people to offsite bldg with dataprocessing back to STL
datacenter. I was asked to do channel-extender support to place "local"
channel "attached" controllers at the remote bldg (cutting various
protocol round-trip latencies). Part of the issue was that the mainframe
60s era also had limited memory and so there is enormous protocol
round-trips utilizing data back in mainframe memory.

1988, local IBM branch asks if I could help LLNL national lab
standardize some serial stuff they had been playing ... which quickly
becomes fibre-channel standard (FCS). Some time later, some IBM
engineers become involved with FCS and define a heavy-weight protocol
that radically cuts the native throughput ... which was eventually
released as FICON (used for mainframe I/O, w/extensive protocol
round-trip latencies, significant impact for even short distrances at
gbit rates).

Most recent public benchmark that I've found is IBM z196 "Peak I/O"
benchmark which had 104 FICON (running over 104 FCS) getting 2M IOPS.
About the same time, a native FCS was announced for E5-2600 blade
claiming over million IOPS (two such FCS have higher throughput than
104 FICON (running over 104 FCS). Note IBM docs recommend that SAPs
(CPUs dedicated for running I/O) be kept to no more than 70% CPU
... which would be more like 1.5M (rather than 2M) IOPS.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Jun 4 13:11:27 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 31 May 2024 19:44:49 -0000 (UTC), John Levine wrote:

All the mainframe companies apart from IBM eventually went out of business >because the whole mainframe concept is obsolete.

Is that a fact?

https://www.unisys.com/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Jun 4 14:18:56 2024

According to Scott Lurndal <[email protected]>:

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 31 May 2024 19:44:49 -0000 (UTC), John Levine wrote:

All the mainframe companies apart from IBM eventually went out of business >>because the whole mainframe concept is obsolete.

Is that a fact?

https://www.unisys.com/

I think that IBM is the only one that still makes CPUs. Aren't the
Unisys machines all emulated on commodity microprocessors now?

That doesn't keep them from working perfectly well, of course.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Tue Jun 4 14:55:39 2024

John Levine <[email protected]> writes:

According to Scott Lurndal <[email protected]>:

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 31 May 2024 19:44:49 -0000 (UTC), John Levine wrote:

All the mainframe companies apart from IBM eventually went out of business >>>because the whole mainframe concept is obsolete.

Is that a fact?

https://www.unisys.com/

I think that IBM is the only one that still makes CPUs. Aren't the
Unisys machines all emulated on commodity microprocessors now?

Yes, although many of the custom CMOS systems are still operational.

That doesn't keep them from working perfectly well, of course.

Indeed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jun 4 16:03:39 2024

For text editors, this is one of the few cases it makes sense to use 32 or
64 bit characters (say, combining the 'character' with some additional metadata such as formatting).

Even just 64bit is very tight to encode all the information in an emoji.

Though, one thing that makes sense for text editors is if only the
"currently being edited" lines are fully unpacked, whereas the others can remain in a more compact form (such as UTF-8), and are then unpacked as they come into view (say, treating the editor window as a 32-entry modulo cache
or similar).

You sufficiently rarely need to care about "character boundaries" that
such encoding/decoding is probably not worthwhile (especially if you
consider the case of multi-MB lines).

It's easy enough to move through UTF-8 itself.

Not entirely sure how other text editors manage things here, not really looked into it.

Several different options.
Emacs uses a gap buffer, which is a quite primitive approach which in
theory has poor worst case behavior but works surprisingly well in
practice (especially with the speed at which current CPUs can copy/move
large chunks of memory).
Others use structures like ropes.

https://coredumped.dev/2023/08/09/text-showdown-gap-buffers-vs-ropes/

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jun 4 16:58:06 2024

Bottom line: Code point conversion instructions like CU14 solve a
problem that people imagine who have no experience working with UTF-8.

The original instructions were CU12 and CU21 which convert between
UTF-8 and UTF-16. That really is useful, e.g., read a file of UTF-8
into a program in Java or Javascript which uses UTF-16. I agree the
UTF-32 versions added in zseries are less likely to be useful.

It's all really instances of the same: conversion between UTF-N1 and
UTF-N2 is only every worthwhile if you receive something using UTF-N1
and you have to return something that uses UTF-N2.

If your task is described at a higher level and you're not constrained
by some arbitrary choices in intermediate APIs then you're almost always
better off working straight from the encoding you receive.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jun 4 17:01:16 2024

If you make such a strong statement, I assume that you have done a
thorough analysis of this feature for typical mainframe workloads and
can support your claims with benchmarks.

We already know the answer to that. It’s why RISC has taken over the computing world.

🙂

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Tue Jun 4 23:30:45 2024

On Tue, 04 Jun 2024 16:03:39 -0400, Stefan Monnier wrote:

Emacs uses a gap buffer, which is a quite primitive approach which in
theory has poor worst case behavior but works surprisingly well in
practice (especially with the speed at which current CPUs can copy/move
large chunks of memory).
Others use structures like ropes.

https://coredumped.dev/2023/08/09/text-showdown-gap-buffers-vs-ropes/

Interesting. Most of the article seems to be about constructing
benchmarks, measuring them, discovering that gap buffers are just as good
if not the best, and then trying to handwave that away before rinsing and repeating.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Tue Jun 4 23:25:18 2024

On Tue, 04 Jun 2024 13:11:52 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig wrote:

One of the main selling points [of zSeries] is the hardware
reliability ...

Quite an expensive way to get reliability. How does an outfit like
Google achieve essentially 0% downtime? By running a swarm of half a >>million commodity servers, that’s how.

And that's not expensive?

Consider the equivalent number of mainframes, with their inbuilt
diagnostics capabilities etc, to match that reliability.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Jun 4 23:56:46 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 04 Jun 2024 13:11:52 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig wrote:

One of the main selling points [of zSeries] is the hardware
reliability ...

Quite an expensive way to get reliability. How does an outfit like
Google achieve essentially 0% downtime? By running a swarm of half a >>>million commodity servers, that’s how.

And that's not expensive?

Consider the equivalent number of mainframes, with their inbuilt
diagnostics capabilities etc, to match that reliability.

Tandem and Stratus did it three decades ago.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Wed Jun 5 04:10:52 2024

On Tue, 04 Jun 2024 23:56:46 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 04 Jun 2024 13:11:52 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig wrote:

One of the main selling points [of zSeries] is the hardware
reliability ...

Quite an expensive way to get reliability. How does an outfit like >>>>Google achieve essentially 0% downtime? By running a swarm of half a >>>>million commodity servers, that’s how.

And that's not expensive?

Consider the equivalent number of mainframes, with their inbuilt >>diagnostics capabilities etc, to match that reliability.

Tandem and Stratus did it three decades ago.

At a high cost.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Wed Jun 5 09:40:00 2024

In article <[email protected]>, [email protected] (Michael S) wrote:

[email protected] (John Dallman) wrote:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

SPARCs are big-endian and trap on unaligned access (at least
that was the case when I last used one long ago), while S/370
ff. does not trap on unaligned access.

OK, that shoots down S/370 for this job.

What exactly is a job?
Is it for pure personal amusement or there are practical needs?

I would like to keep testing the commercial product I work on in a
big-endian, alignment-trapping environment. However, there isn't much
budget available for this. We have a SPARC box doing it, left over from
when we actually supported Solaris, but as testing grows, its CPU power
becomes less adequate for the job.

New SPARC boxes are expensive, dealing with Oracle is hard work, and the architecture has no future.

I've never been very serious about using Linux on IBM Z for this - it's expensive and dealing with IBM is hard work, although the architecture
still seems to have a future - but if it doesn't trap misaligned accesses,
it's disqualified.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Wed Jun 5 09:40:00 2024

In article <v3lqfs$48om$[email protected]>, [email protected]d (Lawrence
D'Oliveiro) wrote:

. . . there were other entirely separate companies, like CDC
where Seymour Cray invented the concept of the _supercomputer_,
much to the surprise of his upper management who just wanted
to sell _business_ machines.

Another view is that the supercomputer was implicit in the needs of the
US nuclear weapons laboratories to do simulations of their designs.
Computers are much cheaper than nuclear testing.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Dallman on Wed Jun 5 12:49:57 2024

On Wed, 5 Jun 2024 09:40 +0100 (BST)
[email protected] (John Dallman) wrote:

In article <[email protected]>,
[email protected] (Michael S) wrote:

[email protected] (John Dallman) wrote:

In article <[email protected]>, [email protected] (Anton Ertl) wrote:

SPARCs are big-endian and trap on unaligned access (at least
that was the case when I last used one long ago), while S/370
ff. does not trap on unaligned access.

OK, that shoots down S/370 for this job.

What exactly is a job?
Is it for pure personal amusement or there are practical needs?

I would like to keep testing the commercial product I work on in a big-endian, alignment-trapping environment.

May be, now is a time to stop to like to keep it?
If I was you, I'd stop carrying not only about big-endian
alignment-trapping environment, but about any alignment-trapping
environment.

However, there isn't much
budget available for this. We have a SPARC box doing it, left over
from when we actually supported Solaris, but as testing grows, its
CPU power becomes less adequate for the job.

New SPARC boxes are expensive, dealing with Oracle is hard work, and
the architecture has no future.

I've never been very serious about using Linux on IBM Z for this -
it's expensive and dealing with IBM is hard work, although the
architecture still seems to have a future - but if it doesn't trap
misaligned accesses, it's disqualified.

John

One of the reasons it has the future is because it doesn't trap
misaligned accesses.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Dallman on Wed Jun 5 13:07:39 2024

On Wed, 5 Jun 2024 09:40 +0100 (BST)
[email protected] (John Dallman) wrote:

In article <v3lqfs$48om$[email protected]>, [email protected]d (Lawrence D'Oliveiro) wrote:

. . . there were other entirely separate companies, like CDC
where Seymour Cray invented the concept of the _supercomputer_,
much to the surprise of his upper management who just wanted
to sell _business_ machines.

Another view is that the supercomputer was implicit in the needs of
the US nuclear weapons laboratories to do simulations of their
designs. Computers are much cheaper than nuclear testing.

John

Another view is that Lawrence D'Oliveiro made it up.
Reading Wikipedia article, it looks like CDC never had much of the
"business machines" business. What they had were "business machine's peripherals" business and government/scientific machines business. Also
they offered public cloud services, but that part of the company was
losing money earned by other divisions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Wed Jun 5 10:32:25 2024

[email protected] (John Dallman) writes:

I would like to keep testing the commercial product I work on in a >big-endian, alignment-trapping environment.

Computer architecture exhibits convergence. Starting in the 1960s it
converged on byte addressing with 8-bit bytes and on 2s-complement,
starting in the 1980s it converged on IEEE FP, and ending in the 2010s
it converged on supporting unaligned accesses and on little-endian
byte order. Your difficulties in getting hardware for testing whether
software can work with alignment restrictions and with big-endian byte
order is a result of that convergence. Maybe your desire to keep your
software ready for big-endian hardware and hardware with alignment
restrictions is misguided.

New SPARC boxes are expensive, dealing with Oracle is hard work, and the >architecture has no future.

Ebay?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Wed Jun 5 13:20:16 2024

Michael S <[email protected]> writes:

On Wed, 5 Jun 2024 09:40 +0100 (BST)
[email protected] (John Dallman) wrote:

In article <[email protected]>,
[email protected] (Michael S) wrote:

[email protected] (John Dallman) wrote:

In article <[email protected]>,
[email protected] (Anton Ertl) wrote:

SPARCs are big-endian and trap on unaligned access (at least
that was the case when I last used one long ago), while S/370
ff. does not trap on unaligned access.

OK, that shoots down S/370 for this job.

What exactly is a job?
Is it for pure personal amusement or there are practical needs?

I would like to keep testing the commercial product I work on in a
big-endian, alignment-trapping environment.

May be, now is a time to stop to like to keep it?

Or he can use an ARM64 chip. They can be configured to
trap all unaligned accesses and can be configured to run
in big-endian.

It's pretty easy to build a big-endian linux for it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Jun 5 17:00:09 2024

Anton Ertl wrote:

[email protected] (John Dallman) writes:

I would like to keep testing the commercial product I work on in a >>big-endian, alignment-trapping environment.

Computer architecture exhibits convergence. Starting in the 1960s it converged on byte addressing with 8-bit bytes and on 2s-complement,
starting in the 1980s it converged on IEEE FP, and ending in the 2010s

Although we did not converge on doing denorms properly until the mid
2000s.

GPUs followed a more meandering path:: starting out with crappy but
fast
FP, then adopting IEEE containers, then over several generations
adopting
more and more of IEEE 754 semantics.

Then there are the SW (and a few HW) holdouts that still believe that
denorms are hard/slow and we need mechanisms to flush them from the
numerics. No, we don't, we need circuitry where denorms are not slower
than norms without having slowed down the norms.

it converged on supporting unaligned accesses and on little-endian
byte order. Your difficulties in getting hardware for testing whether software can work with alignment restrictions and with big-endian byte
order is a result of that convergence. Maybe your desire to keep your software ready for big-endian hardware and hardware with alignment restrictions is misguided.

New SPARC boxes are expensive, dealing with Oracle is hard work, and the >>architecture has no future.

Ebay?

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Thu Jun 6 01:23:57 2024

On Wed, 5 Jun 2024 09:40 +0100 (BST), John Dallman wrote:

Another view is that the supercomputer was implicit in the needs of the
US nuclear weapons laboratories to do simulations of their designs.

And in code cracking. All very much a function of the Cold War.

No coincidence that Cray’s fortunes took a downturn when that ended.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Jun 6 02:42:20 2024

Lawrence D'Oliveiro wrote:

On Wed, 5 Jun 2024 09:40 +0100 (BST), John Dallman wrote:

Another view is that the supercomputer was implicit in the needs of the
US nuclear weapons laboratories to do simulations of their designs.

And in code cracking. All very much a function of the Cold War.

No coincidence that Cray’s fortunes took a downturn when that ended.

Cray sold the first CRAY-1 for $60M this is what the nuclear physicists
could afford; writing off the entire development costs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Thu Jun 6 00:42:43 2024

On Tue, 4 Jun 2024 23:25:18 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Tue, 04 Jun 2024 13:11:52 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig wrote:

One of the main selling points [of zSeries] is the hardware
reliability ...

Quite an expensive way to get reliability. How does an outfit like
Google achieve essentially 0% downtime? By running a swarm of half a >>>million commodity servers, that’s how.

And that's not expensive?

Consider the equivalent number of mainframes, with their inbuilt
diagnostics capabilities etc, to match that reliability.

Can't find it now and don't remember many details, but ...

A long time ago, there was a story going around about Microsoft vs IBM regarding the day-to-day operation of their company web sites. It
claimed that Microsoft was running a ~1000 machine server farm with a
crew of ~100, whereas IBM was running 3 mainframes with a crew of ~10.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu Jun 6 07:50:48 2024

On Thu, 6 Jun 2024 02:42:20 +0000, MitchAlsup1 wrote:

Cray sold the first CRAY-1 for $60M this is what the nuclear physicists
could afford; writing off the entire development costs.

I think the Cray-1 line was the only product family from Cray Research/
Cray Computer that made money. I don’t think the Cray-2 machines were profitable; only two (I think) Cray-3 units were built, and Seymour gave
them both away; and the Cray-4 was never finished.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Jun 6 11:55:22 2024

According to George Neuner <[email protected]>:

Consider the equivalent number of mainframes, with their inbuilt >>diagnostics capabilities etc, to match that reliability.

Can't find it now and don't remember many details, but ...

A long time ago, there was a story going around about Microsoft vs IBM >regarding the day-to-day operation of their company web sites. It
claimed that Microsoft was running a ~1000 machine server farm with a
crew of ~100, whereas IBM was running 3 mainframes with a crew of ~10.

It depends on what you want to do.

If you're doing something that is mostly read-only and easy to
parallelize, then it makes sense to use a farm of cheap PCs. But if
you are a bank or an airline, you need to be able to lock your
database so that you debit a bank account or sell a plane seat exactly
once. There is a rule of thumb that the cost of locking something
grows roughly as the square of the number of things contending for
the lock.

For example, airline reservation systems are the classic example of a
mainframe database. About 25 years ago, ITA Software had the bright
idea to do searches for seats and prices on racks of cheap PCs, which
worked great since it's read only, and if they suggest a seat or fare
that turns out to have just sold out, too bad, try again. But when
travel agents and airlines used it, they kept the ticketing info in a
regular database because it has to work.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to George Neuner on Thu Jun 6 13:49:48 2024

George Neuner <[email protected]> writes:

On Tue, 4 Jun 2024 23:25:18 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Tue, 04 Jun 2024 13:11:52 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig wrote:

One of the main selling points [of zSeries] is the hardware
reliability ...

Quite an expensive way to get reliability. How does an outfit like >>>>Google achieve essentially 0% downtime? By running a swarm of half a >>>>million commodity servers, that’s how.

And that's not expensive?

Consider the equivalent number of mainframes, with their inbuilt >>diagnostics capabilities etc, to match that reliability.

Can't find it now and don't remember many details, but ...

A long time ago, there was a story going around about Microsoft vs IBM >regarding the day-to-day operation of their company web sites. It
claimed that Microsoft was running a ~1000 machine server farm with a
crew of ~100, whereas IBM was running 3 mainframes with a crew of ~10.

In 2010, when the City of Santa Ana decommissioned their Unisys V380[*],
they replaced it with 21 windows servers. At the time, the V380
had been running production for almost thirty years.

[*] Penultimate descendent of the Burroughs B3500.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to George Neuner on Thu Jun 6 09:11:06 2024

George Neuner <[email protected]> writes:

On Tue, 4 Jun 2024 23:25:18 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Tue, 04 Jun 2024 13:11:52 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig wrote:

One of the main selling points [of zSeries] is the hardware
reliability ...

Quite an expensive way to get reliability. How does an outfit like >>>>Google achieve essentially 0% downtime? By running a swarm of half a >>>>million commodity servers, that’s how.

And that's not expensive?

Consider the equivalent number of mainframes, with their inbuilt >>diagnostics capabilities etc, to match that reliability.

Can't find it now and don't remember many details, but ...

A long time ago, there was a story going around about Microsoft vs IBM regarding the day-to-day operation of their company web sites. It
claimed that Microsoft was running a ~1000 machine server farm with a
crew of ~100, whereas IBM was running 3 mainframes with a crew of ~10.

microsoft had hundreds of millions of customers that were more internet oriented, while IBM had thousands of customers that were much less
internet oriented (and rate of changing information was much lower) ...
and IBM number may have only been for the web operation, as opposed to
total support people.

Jan1979, I was con'ed into doing benchmark for national lab that was
looking at getting 70 4341s for compute farm (sort of leading edge of
the coming cluster supercomputing tsunami). 4341s were also selling into
the same mid-range market as VAX and in about same numbers for small
unit orders. Big difference was large companies were ordering hundreds
of vm/4341s at a time for deployment out into departmental areas (sort
of the leading edge of the coming distributed computing tsunami).

The IBM batch system (MVS) was looking at the exploding distributed
computing market. First problem was only disk product for non-datacenter environment was FBA (fixed-block architecture) and MVS only supported
CKD. Eventually there was CKD simulation made available on FBA disks
(currently no CKD disks have been made for decades, all being simulated
on industry standard fixed block disks). It didn't do MVS much good
because distributed operation was looking at dozens of systems per
support person while MVS still required dozens of support people per
system.

admittedly 14 year old comparison, max configured z196 mainframe
benchmarked at 50BIPS ... still dozens of support people. Equivalent
cloud megadatacenter was half million or more E5-2600 blades that each benchmarked at 500BIPS with enormous automation requiring 70-80 support
people (per megadatacenter, at least 6000-7000 systems per person and
each system ten times max configured mainframe) ... also the megacenter comparison was linux (not windows).

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From OrangeFish@21:1/5 to George Neuner on Thu Jun 6 16:24:03 2024

On 2024-06-06 00:42, George Neuner wrote:

On Tue, 4 Jun 2024 23:25:18 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Tue, 04 Jun 2024 13:11:52 GMT, Scott Lurndal wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Sat, 1 Jun 2024 07:47:49 -0000 (UTC), Thomas Koenig wrote:

One of the main selling points [of zSeries] is the hardware
reliability ...

Quite an expensive way to get reliability. How does an outfit like
Google achieve essentially 0% downtime? By running a swarm of half a
million commodity servers, that’s how.

And that's not expensive?

Consider the equivalent number of mainframes, with their inbuilt
diagnostics capabilities etc, to match that reliability.

Can't find it now and don't remember many details, but ...

A long time ago, there was a story going around about Microsoft vs IBM regarding the day-to-day operation of their company web sites. It
claimed that Microsoft was running a ~1000 machine server farm with a
crew of ~100, whereas IBM was running 3 mainframes with a crew of ~10.

Not the story but this reminds me of Microsoft Scalability Day: https://www.cnet.com/tech/tech-industry/scalability-day-falls-short/

OF.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Fri Jun 7 02:26:01 2024

On Thu, 6 Jun 2024 11:55:22 -0000 (UTC), John Levine wrote:

If you're doing something that is mostly read-only and easy to
parallelize, then it makes sense to use a farm of cheap PCs. But if you
are a bank or an airline, you need to be able to lock your database so
that you debit a bank account or sell a plane seat exactly once. There
is a rule of thumb that the cost of locking something grows roughly as
the square of the number of things contending for the lock.

Remember that the number of users actually buying a product at any given
time is only a small proportion (say 1%) of the number of users currently accessing the site.

So, by that same square law, the locking problem is only 1/10,000 as bad
as one might think.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Fri Jun 7 02:19:46 2024

A long time ago, there was a story going around about Microsoft vs IBM >>regarding the day-to-day operation of their company web sites. It
claimed that Microsoft was running a ~1000 machine server farm with a
crew of ~100, whereas IBM was running 3 mainframes with a crew of ~10.

Those mainframes were probably running Linux.

Not sure why a comparison with servers running Windows is relevant to the
point I was making, anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Fri Jun 7 03:13:54 2024

Lawrence D'Oliveiro wrote:

On Thu, 6 Jun 2024 11:55:22 -0000 (UTC), John Levine wrote:

If you're doing something that is mostly read-only and easy to
parallelize, then it makes sense to use a farm of cheap PCs. But if
you are a bank or an airline, you need to be able to lock your
database so that you debit a bank account or sell a plane seat
exactly once. There is a rule of thumb that the cost of locking
something grows roughly as the square of the number of things
contending for the lock.

Remember that the number of users actually buying a product at any
given time is only a small proportion (say 1%) of the number of users currently accessing the site.

I don't know where you got that number, but even if it is true for a
retail storefront type site, I am pretty sure it isn't true for a bank
(what John was talking about, and a substantial part of mainframes's
user base). Few people "browse" bank's the products. :-) Even for an
airline (the other example John gave.) I suspect that far more than 1%
of the accesses are updates.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Fri Jun 7 03:32:53 2024

On Fri, 7 Jun 2024 03:13:54 -0000 (UTC), Stephen Fuld wrote:

Lawrence D'Oliveiro wrote:

Remember that the number of users actually buying a product at any
given time is only a small proportion (say 1%) of the number of users
currently accessing the site.

I don't know where you got that number ...

From actual experience.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Fri Jun 7 03:48:05 2024

Lawrence D'Oliveiro wrote:

On Fri, 7 Jun 2024 03:13:54 -0000 (UTC), Stephen Fuld wrote:

Lawrence D'Oliveiro wrote:

Remember that the number of users actually buying a product at any
given time is only a small proportion (say 1%) of the number of

users >> currently accessing the site.

I don't know where you got that number ...

From actual experience.

OKK. Was that experience with a bank or airline (what John was
discussing)?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to John Levine on Fri Jun 7 11:06:47 2024

John Levine wrote:

According to George Neuner <[email protected]>:

Consider the equivalent number of mainframes, with their inbuilt
diagnostics capabilities etc, to match that reliability.

Can't find it now and don't remember many details, but ...

A long time ago, there was a story going around about Microsoft vs IBM
regarding the day-to-day operation of their company web sites. It
claimed that Microsoft was running a ~1000 machine server farm with a
crew of ~100, whereas IBM was running 3 mainframes with a crew of ~10.

It depends on what you want to do.

If you're doing something that is mostly read-only and easy to
parallelize, then it makes sense to use a farm of cheap PCs. But if
you are a bank or an airline, you need to be able to lock your
database so that you debit a bank account or sell a plane seat exactly
once. There is a rule of thumb that the cost of locking something
grows roughly as the square of the number of things contending for
the lock.

Which is why you use my trick (probably old?) of setting up an array of
N preliminary locks, as gate-keepers: N would be approx sqrt(number_of_competing_users), and only after winning that first stage
are you allowed to compete for the "real" lock.

I've showed a way here in c.arch to make this adaptive, so it would only
kick in after a given amount of contention.

For example, airline reservation systems are the classic example of a mainframe database. About 25 years ago, ITA Software had the bright
idea to do searches for seats and prices on racks of cheap PCs, which
worked great since it's read only, and if they suggest a seat or fare
that turns out to have just sold out, too bad, try again. But when
travel agents and airlines used it, they kept the ticketing info in a
regular database because it has to work.

The main problem here is how long you are allowed to "soft lock" a set
of seats that you are contemplating buying.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	54:09:01
Calls:	12,445
Files:	15,192
Messages:	6,537,326

Re: python text, Byte Addressability And Beyond

Who's Online

System Info