On Tue, 21 Jul 2020 01:40:05 +0200, Daniele Futtorovic wrote:
1. Those are POSIX features, right? They don't have a ready equivalent
in other systems, right? Then Java, which is supposed to be platform-independent, won't have them built-in.
Right, and you often need to go deeper - that is once you've used the UNIX/Linux time command to help determine that its not something dumb in
your own code that's gobbling cpu or wasting time waiting.
Here's a case from around 2000 when DEC were still a thing and their big
iron was running UNIX on Alpha chips.
At the time I was involved in a high performance system for a telco: its
job was to ingest logs from their switches and populate a datawarehouse
with the data - it was like drinking from a firehose. We used a
conventional star schema for the DW, but some of the dimensions contained
a lot of data, so we needed to compress it. The initial approach was was
to use the dimensions themselves as indexes: each holding data, e.g. a
phone number (19 characters is needed to hold a full international
number) while the key was an integer. The conversion was initially by SQL lookup on the DW, assigning a new key and adding the resulting row to the dimension each time the lookup scored a miss.
This allowed log items to be loaded at 350 items a sec, max - limited by
the disk system. We needed to load 2000 items/sec, with 8 dimension
lookups needed per item loaded. Merely by replacing dimension data in the
fact table with these integer keys we got something like a 3x data
compression ratio, which is why we did it.
Some detective work here showed that DECs state of the art SCSI
controllers were single threaded and had all SCSI drive chains connected
to a common bus, so no simultaneous access to disks on different threads.
On top of that, the controller couldn't interleave commands and responses
so couldn't do only sequentially access one disk at a time. Each
controller had something like 8 strings of 8 disks attached to it, but
the whole thing ran at the speed of a single disk.
IOW, we needed to do the key generation using in-memory lookups.
Fortunately we had lots of RAM, so built a configurable process that used Red-Black B-trees to perform the lookups and then passed the log items, complete with indexes, to the DW loader. We had a separate lookup process
for each key so they could all run in parallel, which meant in practise
that each lookup was chewing through its configured task on a different
batch of input: each match was loaded into memory and passed from task to
task until it was finally handed to the DW loader and removed from memory
when it had been loaded into the DW. The DW loading process was now fast enough, but the lookups bottlenecked at around 800 lookups/sec in a
growing B-tree. More detective work needed.
It turned out that this DEC UNIX had a Mach-based kernel with a single
request queue that serialised malloc() calls. Yes, I know this is a Java
Group, but this system was written in C. Worse, it turned out that adding
a single node to the B-tree required three mallocs() for each node root, left+right pointer and node data. So I wrote my own B-tree management
function (cribbed from Sedgewick: "Algorithms") and that gave 2100
lookups/sec. Still not enough: we needed at least around 7500 lookups/sec
to handle the projected log volumes.
It turned out that sbrk(), which allocated chunks of heap storage, was
yet another process that the Mach kernel serialised, so I rewrote the B-
tree algorithm again, this time grabbing several MB of RAM at a time and
doing its own storage allocation inside these chunks. This did the trick:
it now ran at 25,000 lookups/sec.
The point of this story is that merely looking at elapsed time and CPU
time usage couldn't tell us much: the user process was too slow, but we
already knew that. the UNIX 'time' command showed that it wasn't CPU-
bound and nor were any of the system processes, so we needed to look at
other stuff to work out where the problems were. This is what we found:
- disc controller schematics showed that all disk strings were
attached to a common bus, serialising access to them
- the tech specs for the disk controller told us that it would not
interleave commands to disks on a string even though SCSI disk drives
were capable of handling interleaved commands
Now we knew why the Data Warehouse performance couldn't be improved
beyond what we'd already achieved by tuning it.
- reading the source code of the tsearch() function told us about the
multiple malloc()s needed to add a node to the tree
FWIW the Java TreeMap class is also a Red-Black binary tree
implementation, so similar to the C library tsearch() function
and friends, but seems to have somewhat better performance than the
DEC implementation.
- the bottleneck causes by the way the Mach kernel serialised each type
of system call was found by looking at kernel statistics and kernel
documentation.
2. System and user time are process-level notions, aren't they?
Yes.
I don't know the C world that well, but could you get them at the method-level if you limited the scope to native code on a
POSIX-compliant system? If not, you can be certain they won't
be available in Java.
No. UNIX/Linux only aggregates elapsed and mill time per process, so
even for a multi-threaded process you only get process totals. Under UNIX/ Linux the JVM is a multi-threaded process with a single thread handling
user program execution: IIRC it spawns threads to handle garbage
collection and Swing/awt but nothing external to the JVM process will see either these or any user-defined threads that your program may execute.
If you watch 'top' while a Java program is running, you only see one
entry for it.
C programs can launch multiple processes and share data between them:
these will show up in 'top' as separate processes accumulating their own
totals for clock and mill time and with their own memory allocations.
IIRC shared memory is credited to the process that allocated it.
3. How would even this work? What if the thread executing your method is suspended and another takes over for a few loops? What about parallel execution? How would you sort those out in terms of method-level user
and system time?
All accumulated into the process totals. See above.
--
Martin | martin at
Gregorie | gregorie dot org
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)