Elhana <
[email protected]> writes:
Keith Thompson:
Are you using multiple encodings of the same text?
Yes.
distribution of byte counts for your input text?
A natural language one.
What are the numbers...
The input text (in UTF-8 form) had 4023k bytes in 2252k
characters. The DEFLATE algorithm reduced those to 892 or 1061 bytes correspondingly.
That's not enough information, or at least is unclear. And did you
really mean 892 and 1061 bytes, or 892k and 1061k bytes? (I suggest
quoting exact character/byte counts. "4023k" is both approximate and ambiguous; "k" could be either 1000 or 1024.)
Here's my best guess at what you're saying:
You have two files containing different encodings of the same text:
- utf8.txt is 4023k bytes (averaging about 1.79 bytes per character).
- latin1.txt is 2252k bytes.
All characters have code points in the range 0..255 (otherwise a Latin-1 encoding would not be possible).
Compressing utf8.txt with the DEFLATE algorithm (using what program?)
yields 892k bytes of compressed output.
Compressing latin1.txt with the DEFLATE algorithm yields 1061k bytes of compressed output.
Since both utf8.txt and latin1.txt contain very nearly the same
information, ideally a compression algorithm *should* yield outputs of
similar size for both input files, but you're seeing a 19% difference,
and you're wondering why.
Is my description correct?
(BTW, I got roughly similar results with a randomly generated chunk of
text and the gzip command.)
--
Keith Thompson (The_Other_Keith)
[email protected] <
http://www.ghoti.net/~kst> Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)