Forum: >>> Magnum BBS <<<

realloc() - frequency, conditions, or experiences about relocation?

From Janis Papanagnou@21:1/5 to All on Mon Jun 17 08:08:07 2024

In a recent thread realloc() was a substantial part of the discussion. "Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Janis Papanagnou on Mon Jun 17 10:18:40 2024

Janis Papanagnou <[email protected]> writes:

In a recent thread realloc() was a substantial part of the discussion. "Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Malcolm McLean on Mon Jun 17 10:55:33 2024

Malcolm McLean <[email protected]> writes:

On 17/06/2024 10:18, Ben Bacarisse wrote:

Janis Papanagnou <[email protected]> writes:

In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.

So can we work it out?

What is "it"?

Let's assume for the moment that the allocations have a semi-normal distribution,

What allocations? The allocations I talked about don't have that
distribution.

with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the value by integrating the distribution from 1k to infinity and taking the mean.

I have no idea what you are talking about. What "value" are you looking
to calculate?

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Janis Papanagnou on Mon Jun 17 14:15:11 2024

On 17/06/2024 08:08, Janis Papanagnou wrote:

In a recent thread realloc() was a substantial part of the discussion. "Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

Janis

Consider your target audience and their hardware, the target OS, and the realistic size of your data. If the target is a PC, you can happily
malloc tens of MB at the start without a care, and for systems that do
not actually allocate system memory until you try to access the area,
there is no cost to this.

So in many situations where you are reading and parsing data from a
file, you can just do the initial malloc with more than enough space for
any realistic input file. You might still implement a realloc solution
for occasional extreme uses, and because it is nice to avoid artificial
limits for programs, but efficiency matters a lot less in those cases.

It may also be the case that even if realloc returns a different address
and logically copies a lot of data, that this is done by smarter virtual
memory mapping so that only the mapping changes, and the underlying
physical ram does not need to be copied. But I don't know if OS's and
realloc implementations are smart enough to do that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Janis Papanagnou on Mon Jun 17 15:21:24 2024

On 17.06.2024 08:08, Janis Papanagnou wrote:

In a recent thread realloc() was a substantial part of the discussion. "Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

Let me add...

I'd assume that there's some basic allocation size defined; some
simple test sample with a handful of bytes didn't relocate the data.
Yet I don't know whether allocated memory is managed sequentially
or has linked blocks. A peek info the source code might help. What
I found is this comment for extending chunks:[*]
* Extending forward into following adjacent free chunk.
* Shifting backwards, joining preceding adjacent space
* Both shifting backwards and extending forward.
* Extending into newly sbrked space
Going to investigate that source code[*] later...

Janis

[*] https://elixir.bootlin.com/glibc/glibc-2.1.2/source/malloc/malloc.c
(line 3077 ff)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Malcolm McLean on Mon Jun 17 15:33:47 2024

Malcolm McLean <[email protected]> writes:

On 17/06/2024 10:55, Ben Bacarisse wrote:

Malcolm McLean <[email protected]> writes:

On 17/06/2024 10:18, Ben Bacarisse wrote:

Janis Papanagnou <[email protected]> writes:

In a recent thread realloc() was a substantial part of the discussion. >>>>> "Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete >>>>> factors about the conditions when this relocation happens? - I could >>>>> imagine that it's no issue as long as you're in some kB buffer range, >>>>> but if, say, we're using realloc() to substantially increase buffers >>>>> often it might be an issue to consider. It would be good to get some >>>>> feeling about that internal.

There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by >>>> some factor greater than 1 (I often use 3/2, but doubling is often used >>>> as well). This results in O(log(N)) rather than O(N) allocations as in >>>> your code that added a constant to the size. Of course, some storage is >>>> wasted (that /might/ be retrieved by a final realloc down to the final >>>> size) but that's rarely significant.

So can we work it out?

What is "it"?

Let's assume for the moment that the allocations have a semi-normal
distribution,

What allocations? The allocations I talked about don't have that
distribution.

with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the >>> value by integrating the distribution from 1k to infinity and taking the >>> mean.

I have no idea what you are talking about. What "value" are you looking
to calculate?

We have a continuously growing buffer, and we want the best strategy for reallocations as the stream of characters comes at us. So, given we now how many characters have arrived, can we predict how many will arrive, and therefore ask for the best amount when we reallocate, so that we neither
make too many reallocation (reallocate on every byte received) or ask for
too much (demand SIZE_MAX memory when the first byte is received).?

Obviously not, or we'd use the prediction. You question was probably rhetorical, but it didn't read that way.

Your strategy for avoiding these extremes is exponential growth.

It's odd to call it mine. It's very widely know and used. "The one I mentioned" might be less confusing description.

You
allocate a small amount for the first few bytes. Then you use exponential growth, with a factor of ether 2 or 1.5. My question is whether or not we
can be cuter. And of course we need to know the statistical distribution of the input files. And I'm assuming a semi-normal distribution, ignoring the files with small values, which we will allocate enough for anyway.

And so we integrate the distribution between the point we are at and infinity. Then we tkae the mean. And that gives us a best estimate of how many bytes are to come, and therefore how much to grow the buffer by.

I would be surprised if that were worth the effort at run time. A
static analysis of "typical" input sizes might be interesting as that
could be used to get an estimate of good factors to use, but anything
more complicated than maybe a few factors (e.g. doubling up to 1MB then
3/2 thereafter) is likely to be too messy to useful.

Also, the cost of reallocations is not constant. Larger ones are
usually more costly than small ones, so if one were going to a lot of
effort to make run-time guesses, that cost should be factored in as
well.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Mon Jun 17 18:02:49 2024

XPost: sci.stat.math

[cross-posted to: ci.stat.math]

Malcolm McLean:

We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive,

Do you mean in the next bunch, or in total (till the end of
the buffer's lifetime)?

and therefore ask for the best amount when we reallocate,
so that we neither make too many reallocation (reallocate
on every byte received) or ask for too much (demand
SIZE_MAX memory when the first byte is received).?

Your strategy for avoiding these extremes is exponential
growth. You allocate a small amount for the first few
bytes. Then you use exponential growth, with a factor of
ether 2 or 1.5.

This strategy ensures a constant ratio between the amount of
reallocated data to the length of the buffer by making
reallocations less frequent as the buffer grows.

And so we integrate the distribution between the point we
are at and infinity. Then we tkae the mean. And that gives
us a best estimate of how many bytes are to come, and
therefore how much to grow the buffer by.

You have an apriori distribution of the buffer size (can be
tracked on-the-fly, if unknown beforehand) and a partially
filled buffer. The task is to calculate the a-posteriori
distribution of /that/ buffer's final size, and then to
allocate the predicted value based on a good percentile.

How about using a percentile instead of the mean, e.g. if
the current size corresponds to percentile p, you allocate a
capacity corresponding to percentile 1-(1-p)/k , where k>1
denotes the balance between space and time efficency. For
example, if the 60th percentile of the buffer is required
and k=2, you allocate a capacity sufficient to hold
100-(100-60)/2=80% of buffers.

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Mon Jun 17 18:10:34 2024

Ben Bacarisse to Malcolm McLean:

We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive, and therefore ask for the best amount when we
reallocate, so that we neither make too many
reallocation (reallocate on every byte received) or ask
for too much (demand SIZE_MAX memory when the first byte
is received).?

Obviously not, or we'd use the prediction.

Not so obvious to me, for the exponential algorithm may be
the best when the distribution of buffer size is /not/
known, whereas Malcolm is interested in the cases when we
know it.

Your strategy for avoiding these extremes is exponential
growth.

It's odd to call it mine. It's very widely know and used.
"The one I mentioned" might be less confusing description.

I think it is a modern English idiom, which I dislike as
well. StackOverflow is full of questions starting like:
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do this?"
or "What is the way to do that?"

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Harnden@21:1/5 to Ben Bacarisse on Mon Jun 17 16:15:27 2024

On 17/06/2024 15:33, Ben Bacarisse wrote:

Malcolm McLean <[email protected]> writes:

On 17/06/2024 10:55, Ben Bacarisse wrote:

Malcolm McLean <[email protected]> writes:

On 17/06/2024 10:18, Ben Bacarisse wrote:

Janis Papanagnou <[email protected]> writes:

In a recent thread realloc() was a substantial part of the discussion. >>>>>> "Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a >>>>>> performance factor. Is there any experience or are there any concrete >>>>>> factors about the conditions when this relocation happens? - I could >>>>>> imagine that it's no issue as long as you're in some kB buffer range, >>>>>> but if, say, we're using realloc() to substantially increase buffers >>>>>> often it might be an issue to consider. It would be good to get some >>>>>> feeling about that internal.

There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential >>>>> allocation scheme: at every reallocation multiply the storage space by >>>>> some factor greater than 1 (I often use 3/2, but doubling is often used >>>>> as well). This results in O(log(N)) rather than O(N) allocations as in >>>>> your code that added a constant to the size. Of course, some storage is >>>>> wasted (that /might/ be retrieved by a final realloc down to the final >>>>> size) but that's rarely significant.

So can we work it out?

What is "it"?

Let's assume for the moment that the allocations have a semi-normal
distribution,

What allocations? The allocations I talked about don't have that
distribution.

with negative values disallowed. Now ignoring the first few
values, if we have allocated, say, 1K, we ought to be able to predict the >>>> value by integrating the distribution from 1k to infinity and taking the >>>> mean.

I have no idea what you are talking about. What "value" are you looking >>> to calculate?

We have a continuously growing buffer, and we want the best strategy for
reallocations as the stream of characters comes at us. So, given we now how >> many characters have arrived, can we predict how many will arrive, and
therefore ask for the best amount when we reallocate, so that we neither
make too many reallocation (reallocate on every byte received) or ask for
too much (demand SIZE_MAX memory when the first byte is received).?

Obviously not, or we'd use the prediction. You question was probably rhetorical, but it didn't read that way.

Your strategy for avoiding these extremes is exponential growth.

It's odd to call it mine. It's very widely know and used. "The one I mentioned" might be less confusing description.

You
allocate a small amount for the first few bytes. Then you use exponential
growth, with a factor of ether 2 or 1.5. My question is whether or not we
can be cuter. And of course we need to know the statistical distribution of >> the input files. And I'm assuming a semi-normal distribution, ignoring the >> files with small values, which we will allocate enough for anyway.

And so we integrate the distribution between the point we are at and
infinity. Then we tkae the mean. And that gives us a best estimate of how
many bytes are to come, and therefore how much to grow the buffer by.

I would be surprised if that were worth the effort at run time. A
static analysis of "typical" input sizes might be interesting as that
could be used to get an estimate of good factors to use, but anything
more complicated than maybe a few factors (e.g. doubling up to 1MB then
3/2 thereafter) is likely to be too messy to useful.

Also, the cost of reallocations is not constant. Larger ones are
usually more costly than small ones, so if one were going to a lot of
effort to make run-time guesses, that cost should be factored in as
well.

I usually keep track:

struct
{
size_t used;
size_t allocated;
void *data;
};

Then, if used + new_size is more than what's already been allocated then
a realloc will be required.

Start with an initial allocated size that's 'resonable' - the happy path
will never need any reallocs.

Otherwise multiply by some factor. Typicall I just double it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Janis Papanagnou on Mon Jun 17 16:50:07 2024

Janis Papanagnou <[email protected]> writes:

In a recent thread realloc() was a substantial part of the discussion. >"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a >performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

I've not found a use for realloc in the last forty five years, myself.

I suspect that the performance issues are not an issue for relatively
small datasets, and are often exhibited during the non-performance critical 'setup' phase of an algorithm.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Malcolm McLean on Mon Jun 17 16:58:52 2024

Malcolm McLean <[email protected]> writes:

On 17/06/2024 10:55, Ben Bacarisse wrote:

Malcolm McLean <[email protected]> writes:

I have no idea what you are talking about. What "value" are you looking
to calculate?

We have a continuously growing buffer,

At this point, you should be asking yourself
if there are better alternatives for storing
the incoming data than to a continuously growing
dynamically allocated piecemeal buffer.

C character stdio tends to work well for streaming applications
(i.e. pipelines where the input is (minimally) processed and forwarded
to the output), but not so efficiently for applications that need to
look at the data en masse.

Personnally, I'd mmap the input file and eschew stdio completely
and just walk through memory with the appropriate pointer.

(mmap showed up in the late 80s, so you can pretend it
is C90 if you like).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Malcolm McLean on Mon Jun 17 20:11:48 2024

On 17/06/2024 11:31, Malcolm McLean wrote:

On 17/06/2024 10:18, Ben Bacarisse wrote:

Janis Papanagnou <[email protected]> writes:

In a recent thread realloc() was a substantial part of the discussion.
"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a
performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

There is obviously a cost, but there is (usually) no alternative if
contiguous storage is required. In practice, the cost is usually
moderate and can be very effectively managed by using an exponential
allocation scheme: at every reallocation multiply the storage space by
some factor greater than 1 (I often use 3/2, but doubling is often used
as well). This results in O(log(N)) rather than O(N) allocations as in
your code that added a constant to the size. Of course, some storage is
wasted (that /might/ be retrieved by a final realloc down to the final
size) but that's rarely significant.

So can we work it out?

Let's assume for the moment that the allocations have a semi-normal distribution, with negative values disallowed. Now ignoring the first
few values, if we have allocated, say, 1K, we ought to be able to
predict the value by integrating the distribution from 1k to infinity
and taking the mean.

First, there is no reason for assuming such a distribution, other than
saying "lots of things are roughly normal".

Secondly, knowing the distribution gives you /no/ information about any
given particular case. You know the distribution for the results of
rolling two die - does that mean you can predict the next roll?

Thirdly, not all distributions have a mean (look up the Cauchy
distribution if you like).

Fourthly, even if you know the mean, it tells you nothing of use.

Knowing a bit about the distribution of file sizes can be useful, but
not nearly in the way you describe here. If you know that the files are
rarely or never bigger than 10 MB, malloc 10 MB and forget the realloc.
If you know they are often bigger than that, mmap the file and forget
the realloc.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Mon Jun 17 20:20:57 2024

On Mon, 17 Jun 2024 16:50:07 GMT
[email protected] (Scott Lurndal) wrote:

Janis Papanagnou <[email protected]> writes:

In a recent thread realloc() was a substantial part of the
discussion. "Occasionally" the increased data storage will be
relocated along with the previously stored data. On huge data sets
that might be a performance factor. Is there any experience or are
there any concrete factors about the conditions when this relocation >happens? - I could imagine that it's no issue as long as you're in
some kB buffer range, but if, say, we're using realloc() to
substantially increase buffers often it might be an issue to
consider. It would be good to get some feeling about that internal.

I've not found a use for realloc in the last forty five years, myself.

Did you find use for std::vector:resize()?
If yes, that could be major reason behind not finding use for realloc(). Another possible reason is coding for environments where dynamic
allocation either not used at all or used only during start up.

At least for me those are major reasons why I very rarely used realloc
since beginning of programming as a pro.

I suspect that the performance issues are not an issue for relatively
small datasets, and are often exhibited during the non-performance
critical 'setup' phase of an algorithm.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Mon Jun 17 19:02:13 2024

Michael S <[email protected]> writes:

On Mon, 17 Jun 2024 16:50:07 GMT
[email protected] (Scott Lurndal) wrote:

Janis Papanagnou <[email protected]> writes:

In a recent thread realloc() was a substantial part of the
discussion. "Occasionally" the increased data storage will be
relocated along with the previously stored data. On huge data sets
that might be a performance factor. Is there any experience or are
there any concrete factors about the conditions when this relocation
happens? - I could imagine that it's no issue as long as you're in
some kB buffer range, but if, say, we're using realloc() to
substantially increase buffers often it might be an issue to
consider. It would be good to get some feeling about that internal.

I've not found a use for realloc in the last forty five years, myself.

Did you find use for std::vector:resize()?

I'm pretty sure (checks) that I posted this reply to comp.lang.c.

std::vector::resize() doesn't work well from C (well, I can mangle
the names and use an explicit this pointer, but why bother?).

If yes, that could be major reason behind not finding use for realloc(). >Another possible reason is coding for environments where dynamic
allocation either not used at all or used only during start up.

Or because the algorithms used don't call for realloc. Or there
are better alternatives (like mmap).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Shepelev on Tue Jun 18 00:09:24 2024

Anton Shepelev <anton.txt@g{oogle}mail.com> writes:

Ben Bacarisse to Malcolm McLean:

[next is a comment from Malcolm]

Your strategy for avoiding these extremes is exponential
growth.

It's odd to call it mine. It's very widely know and used.
"The one I mentioned" might be less confusing description.

I think it is a modern English idiom, which I dislike as
well. StackOverflow is full of questions starting like:
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do this?"
or "What is the way to do that?"

I have a different take here. First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
(And incidentally is subtly insulting because of that,
whether it was meant that way or not.)

Second the use of "you" to mean an unspecified other person
is not idiom but standard usage. The word "you" is both a
definite pronoun and an indefinite pronoun, depending on
context. The word "they" also has this property. Consider
these two examples:

The bank downtown was robbed. They haven't been caught
yet.

They say the sheriff isn't going to run for re-election.

In the first example "they" is a definite pronoun, referring
to the people who robbed the bank. In the second example,
"they" is an indefinite pronoun, referring to unspecified
people in general (perhaps but not necessarily everyone).
The word "you" is similar: it can mean specifically the
listener, or it can mean generically anyone in a broader
audience, even those who never hear or read the statement
with "you" in it.

The word "one" used as a pronoun is more formal, and to me
at least often sounds stilted. In US English "one" is most
often an indefinite pronoun, either second person or third
person. But "one" can also be used as a first person
definite pronoun (referring to the speaker), which an online
reference tells me is chiefly British English. (I would
guess that this usage predominates in "the Queen's English"
dialect of English, but I have very little experience in
such things.)

Finally I would normally read "I" as a first person definite
pronoun, and not an indefinite pronoun. So I don't have any
problem with someone saying "how should I ..." when asking
for advice. They aren't asking how someone else should ...
but how they should ..., and what advice I might give could
very well depend on who is doing the asking.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rosario19@21:1/5 to [email protected] on Tue Jun 18 11:50:48 2024

On Mon, 17 Jun 2024 08:08:07 +0200, Janis Papanagnou <[email protected]> wrote:

In a recent thread realloc() was a substantial part of the discussion. >"Occasionally" the increased data storage will be relocated along
with the previously stored data. On huge data sets that might be a >performance factor. Is there any experience or are there any concrete
factors about the conditions when this relocation happens? - I could
imagine that it's no issue as long as you're in some kB buffer range,
but if, say, we're using realloc() to substantially increase buffers
often it might be an issue to consider. It would be good to get some
feeling about that internal.

Janis

the only problem i see it is the memory that is free is the first has
to be used, or be returned from malloc or realloc, because that memory
is already in a good position near the cpu

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Jones@21:1/5 to Anton Shepelev on Tue Jun 18 17:59:30 2024

XPost: sci.stat.math

Anton Shepelev wrote:

[cross-posted to: ci.stat.math]

Malcolm McLean:

We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive,

Do you mean in the next bunch, or in total (till the end of
the buffer's lifetime)?

and therefore ask for the best amount when we reallocate,
so that we neither make too many reallocation (reallocate
on every byte received) or ask for too much (demand
SIZE_MAX memory when the first byte is received).?

Your strategy for avoiding these extremes is exponential
growth. You allocate a small amount for the first few
bytes. Then you use exponential growth, with a factor of
ether 2 or 1.5.

This strategy ensures a constant ratio between the amount of
reallocated data to the length of the buffer by making
reallocations less frequent as the buffer grows.

And so we integrate the distribution between the point we
are at and infinity. Then we tkae the mean. And that gives
us a best estimate of how many bytes are to come, and
therefore how much to grow the buffer by.

You have an apriori distribution of the buffer size (can be
tracked on-the-fly, if unknown beforehand) and a partially
filled buffer. The task is to calculate the a-posteriori
distribution of that buffer's final size, and then to
allocate the predicted value based on a good percentile.

How about using a percentile instead of the mean, e.g. if
the current size corresponds to percentile p, you allocate a
capacity corresponding to percentile 1-(1-p)/k , where k>1
denotes the balance between space and time efficency. For
example, if the 60th percentile of the buffer is required
and k=2, you allocate a capacity sufficient to hold
100-(100-60)/2=80% of buffers.

Based on essentially no background to this question, not much can be
said. However, if one starts from the suggestion above to use the mean
of some distribution (or later some percentile), one notes that the
"mean" is just the minimum of a quadratic cast function ,,, so an
improvement would be to base the choice on some more realistic cost
function, chosen for the actual application. Given that the scenario
apparently involves a sequence of such decisions, the obvious extension
of the cost-based approach would be to employ some form of dynamic
programming. Of course, this might not be appealing, in which case one
might choose the theoretically-simple approach of tuning a policy based
on good stchastic simulations of the situation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Duffy@21:1/5 to Anton Shepelev on Wed Jun 19 06:48:17 2024

XPost: sci.stat.math

In sci.stat.math Anton Shepelev <anton.txt@g{oogle}mail.com> wrote:

[cross-posted to: ci.stat.math]

Malcolm McLean:

We have a continuously growing buffer, and we want the
best strategy for reallocations as the stream of
characters comes at us. So, given we now how many
characters have arrived, can we predict how many will
arrive,

Do you mean in the next bunch, or in total (till the end of
the buffer's lifetime)?

Isn't this a halting problem? Aren't the more important data:
how much memory the user is allowed to allocate, the properties of
the current system's memory allocation algorithm, when your stream
will have to go to disc or other slow large volume storage, how
the stream can be compressed on the fly (the latter might well give
strong predictions for future storage requirements based on what
has been read to date).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Wed Jun 19 15:20:00 2024

XPost: sci.stat.math

Malcolm McLean writes that, given the log-normal distribution
of file sizes with known parameters,

we can work out, given that a file is at least N
characters, what is the prbablity that an allocation of
any size will contain the whole file, and how many bytes,
on average will be wasted.

This is why I thought statisticians might help him: Malcolm
wants to find the aposteriori distribution of the size of a
file, after it has been found to exceed N bytes. Am I right
that if we take the remaining (N>20) part of the density
function and re-normalise it, we shall obtain the desired
distribution?

My proposition was as follows:

1. Find quantile q0 corresponding to the buffer size
currently requested.

2. Calculate new quantile q1 = 1-(1-q0)/k, where k>1 is
an adjustable parameter, and use its corresponding
value as the new allocation size.

For example, assuming for simplicity a uniform [0,20]
distribution of file sizez and k=2, a sequence of allocation
may look like this:

requested allocated
2 20-(20- 2)/2 = 11
12 20-(20-12)/2 = 16
18 20-(20-18)/2 = 19
--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Malcolm McLean on Wed Jun 19 16:36:01 2024

XPost: sci.stat.math

Malcolm McLean <[email protected]> writes:

No. We have to have some knowledge. And what we probaby know is that the input is a file stored on someone's personal computer. And someone has published on the statistical distribution of such files

That's not the case that matters (to me at least). If the input is a
file, we have a much better way of "guessing" the size than guessing and growing -- just ask for the size. Sure, we might need to make
adjustments if the file is changing, but there is always a better
measure than any statistical analysis.

To some extent this seems like a solution in search of a problem.
Growing the buffer exponentially is simple and effective.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Ben Bacarisse on Wed Jun 19 19:41:49 2024

XPost: sci.stat.math

On 19/06/2024 17:36, Ben Bacarisse wrote:

Malcolm McLean <[email protected]> writes:

No. We have to have some knowledge. And what we probaby know is that the
input is a file stored on someone's personal computer. And someone has
published on the statistical distribution of such files

That's not the case that matters (to me at least). If the input is a
file, we have a much better way of "guessing" the size than guessing and growing -- just ask for the size. Sure, we might need to make
adjustments if the file is changing, but there is always a better
measure than any statistical analysis.

To some extent this seems like a solution in search of a problem.

It seems more like a solution that doesn't exist in search of a problem
with absurdly unrealistic requirements. And even if Malcolm's solution existed, and the problem existed, it /still/ wouldn't work - knowing the distribution of file sizes tells us nothing about the size of any given
file.

Growing the buffer exponentially is simple and effective.

Yes, that's the general way to handle buffers when you don't know what
size they should be.

A better solutions for this sort of program is usually, as you say,
asking the OS for the file size (there is no standard library function
for getting the file size, but it's not hard to do for any realistic
target OS). And then for big files, prefer mmap to reading the file
into a buffer.

It's only really for unsized "files" such as piped input that you have
no way of getting the size, and then exponential growth is the way to
go. Personally, I'd start with a big size (perhaps 10 MB) that is
bigger than you are likely to need in practice, but small enough that it
is negligible on even vaguely modern computers. Then the realloc code is unlikely to be used (but it can still be there for completeness).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to David Brown on Wed Jun 19 22:24:35 2024

XPost: sci.stat.math

David Brown <[email protected]> writes:

On 19/06/2024 17:36, Ben Bacarisse wrote:

Growing the buffer exponentially is simple and effective.

Yes, that's the general way to handle buffers when you don't know what size they should be.

A better solutions for this sort of program is usually, as you say, asking the OS for the file size (there is no standard library function for getting the file size, but it's not hard to do for any realistic target OS). And then for big files, prefer mmap to reading the file into a buffer.

It's only really for unsized "files" such as piped input that you have no
way of getting the size, and then exponential growth is the way to go. Personally, I'd start with a big size (perhaps 10 MB) that is bigger than
you are likely to need in practice, but small enough that it is negligible
on even vaguely modern computers. Then the realloc code is unlikely to be used (but it can still be there for completeness).

There are other uses that have nothing to do with files. I have a small dynamic array library (just a couple of function) that I use for all
sorts of things. I can read a file or parse tokens or input a line just
by adding characters. Because of its rather general use, I don't start
with a large buffer (though the initial size can be set).

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Thu Jun 20 01:53:47 2024

XPost: sci.stat.math

Malcolm McLean:

We have to have some knowledge. And what we probaby know
is that the input is a file stored on someone's personal
computer. And someone has published on the statistical
distribution of such files And they have a log-normal
distribution with a mean and a median which he gives. So
with that informaton, we can work out, given that a file
is at least N characters, what is the prbablity that an
allocation of any size will contain the whole file, and
how many bytes, on average will be wasted.

Observe that the standard algorithm of exponential growth is
memoryless and self-similar in that in does not depend on
context, or the history of previous reallocations. These
properties belong to (or even identify?) the exponential
distribution. We can therefore assume that exponential-
growth strategy is ideal for exponentially distributed
buffer sizes, and under that assumption determine the
relation between the CDF values (p) corresponding to
consequent re-allcoations:

p = e^x/L ,
p0 = 1-e^(L*x0) ,
p1 = 1-e^(L*x1) ,
x1 = k*x0 (by our strategy), =>
p1 = 1-(1-p0)^k .

which does not depend on the distribution and lets us
generalise this approach for any distribution:

x1 = Q( 1 - ( 1 - CDF(x0) )^k )

where:

x0 : the required size
x1 : the new recommended capacity
Q(p) : the p-Quantile of the given distribution
CDF(x): the CDF of the given distribution
k>1 : balance between speed and space efficiency

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Ben Bacarisse on Thu Jun 20 13:22:31 2024

On 19/06/2024 23:24, Ben Bacarisse wrote:

David Brown <[email protected]> writes:

On 19/06/2024 17:36, Ben Bacarisse wrote:

Growing the buffer exponentially is simple and effective.

Yes, that's the general way to handle buffers when you don't know what size >> they should be.

A better solutions for this sort of program is usually, as you say, asking >> the OS for the file size (there is no standard library function for getting >> the file size, but it's not hard to do for any realistic target OS). And
then for big files, prefer mmap to reading the file into a buffer.

It's only really for unsized "files" such as piped input that you have no
way of getting the size, and then exponential growth is the way to go.
Personally, I'd start with a big size (perhaps 10 MB) that is bigger than
you are likely to need in practice, but small enough that it is negligible >> on even vaguely modern computers. Then the realloc code is unlikely to be
used (but it can still be there for completeness).

There are other uses that have nothing to do with files.

Of course. This comment was for the specific purposes being discussed
here. For other uses, there can be many other structures and algorithms
that fit better. Exponentially increasing the size when needed is a
good general-purpose method.

I have a small
dynamic array library (just a couple of function) that I use for all
sorts of things. I can read a file or parse tokens or input a line just
by adding characters. Because of its rather general use, I don't start
with a large buffer (though the initial size can be set).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Thu Jun 20 21:08:00 2024

On 17/06/2024 10:22, Bonita Montero wrote:

realloc() is just a convenience funciton. Usually the reallocation
can't happen in-place and a second malloc() followed by a copy and
a free() does the same.
For large data it would be nice if the pages being deallocated later
would be incrementally marked as discardable after copying a portion.
This would result in only a small portion of additional physical
memory being allocated since the newly allocated pages become asso-
ciated with phyiscal pages when they're touched first. Windows has VirtualAlloc() with MEM_RESET for that, Linux has madvise() with MADV_DONTNEED.

"Usually can't happen in place"?

Really? It's not something I use a lot, but when it's appropriate I
will. It's got the advantage over doing this myself that for some
portion of calls all the run time library needs to do is change the size
field in the structure.

Nothing else.

No copying, and no duplicate allocations.

What proportion of calls can be managed by changing the size field alone depends on your workload and the platform. But I doubt there are many
cases where it is 0%.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Mon Jun 24 08:40:03 2024

On Fri, 21 Jun 2024 21:12:12 +0200, Bonita Montero wrote:

Usually you don't resize the block with a few bytes ...

The usual way I use realloc is to maintain separate counts of the number
of array elements I have allocated, and the number I am actually using. A realloc call is only needed when the latter hits the former. Every time I
call realloc, I will extend by some minimum number of array elements (e.g. 128), roughly comparable to the sort of array size I typically end up
with.

And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume
such a call will never fail? Hmm ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Keith Thompson on Mon Jun 24 13:40:08 2024

On 24/06/2024 11:55, Keith Thompson wrote:

Something else that occurs to me: If a shrinking realloc() never fails
in practice, then any code you write to handle a failure won't be
tested.

That is always a problem with allocation functions. Have you ever known
a non-pathological malloc() to fail?

I think, in fact, there's a good argument for ignoring the possibility
of malloc (and calloc and realloc) failures for most PC code. There is virtually no chance of failure in reality, and if you get one, there is
almost never a sensible way to deal with it - you just kick the can down
the road by having functions return NULL until something gives up and
stops the program with an error message. You might as well just let the
OS kill the program when you try to access memory at address 0.

I've seen more than enough error handling code that has never been
tested in practice - including error handling code with bugs that lead
to far worse problems than just killing the program.

Of course such treatment is not appropriate for all allocations (or
other functions that could fail). But often I think it is better to
write clearer and fully testable (and tested!) code which ignores
hypothetical errors, rather than some of the untestable and untested
jumbles that are sometimes seen in an attempt to "handle" allocation
failures.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Malcolm McLean on Mon Jun 24 09:32:40 2024

Malcolm McLean <[email protected]> writes:

On 18/06/2024 08:09, Tim Rentsch wrote:

Anton Shepelev <anton.txt@g{oogle}mail.com> writes:

Ben Bacarisse to Malcolm McLean:

[next is a comment from Malcolm]

Your strategy for avoiding these extremes is exponential
growth.

It's odd to call it mine. It's very widely know and used.
"The one I mentioned" might be less confusing description.

I think it is a modern English idiom, which I dislike as
well. StackOverflow is full of questions starting like:
"How do you do this?" and "How do I do that?" They are
informal ways of the more literary "How does one do this?"
or "What is the way to do that?"

I have a different take here. First the "your" of "your
strategy" reads as a definite pronoun, meaning it refers
specifically to Ben and not to some unknown other party.
(And incidentally is subtly insulting because of that,
whether it was meant that way or not.)

Second the use of "you" to mean an unspecified other person
is not idiom but standard usage. The word "you" is both a
definite pronoun and an indefinite pronoun, depending on
context. The word "they" also has this property. Consider
these two examples:

The bank downtown was robbed. They haven't been caught
yet.

They say the sheriff isn't going to run for re-election.

In the first example "they" is a definite pronoun, referring
to the people who robbed the bank. In the second example,
"they" is an indefinite pronoun, referring to unspecified
people in general (perhaps but not necessarily everyone).
The word "you" is similar: it can mean specifically the
listener, or it can mean generically anyone in a broader
audience, even those who never hear or read the statement
with "you" in it.

The word "one" used as a pronoun is more formal, and to me
at least often sounds stilted. In US English "one" is most
often an indefinite pronoun, either second person or third
person. But "one" can also be used as a first person
definite pronoun (referring to the speaker), which an online
reference tells me is chiefly British English. (I would
guess that this usage predominates in "the Queen's English"
dialect of English, but I have very little experience in
such things.)

Finally I would normally read "I" as a first person definite
pronoun, and not an indefinite pronoun. So I don't have any
problem with someone saying "how should I ..." when asking
for advice. They aren't asking how someone else should ...
but how they should ..., and what advice I might give could
very well depend on who is doing the asking.

Ben said

Restore snipped Ben upthread

"In practice, the cost is usually moderate and can be very
effectively managed by using an exponential allocation scheme: at
every reallocation multiply the storage space by some factor greater
than 1 (I often use 3/2, but doubling is often used as well)."

So it's open and shut, and no two ways about it. Ben's strategy is exponential growth. And to be fair I use that strategy myself in
functions like fslutp(). It's only not Ben's strategy if we mean to
imply that Ben was the first person to use expoential growth, or the
first to understand the mathematical implications, and of course
that's not the case. It was all worked out by Euler long before any
of us were born. [...]

You have an annoying habit. Your writing often comes across as
authoritarian and somewhat condescending. Furthermore you tend not
to listen very well. Your response above is a case in point. You
ignore what I'm talking about (which is not whether Ben uses an
exponential growth strategy, or whether such a strategy is "Ben's"
or not), and instead talk about something that is irrelevant to what
I was saying. You have completely missed the point. Your comments
do nothing to extend the conversation. From where I sit all they do
is cause irritation and illustrate how muddled your thinking is.
I'm sure this isn't the first time you've heard comments along these
lines. It would be nice if you would make an effort to improve
your behavior in light of these repeated comments.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Tim Rentsch on Mon Jun 24 19:19:38 2024

On 24/06/2024 18:32, Tim Rentsch wrote:

You have an annoying habit. Your writing often comes across as
authoritarian and somewhat condescending. Furthermore you tend not
to listen very well.

The irony of that post is /astounding/.

I have met few people with a greater knowledge and insight in the C
language than you. And I have met few with less self-insight.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Malcolm McLean on Mon Jun 24 18:20:32 2024

Malcolm McLean <[email protected]> writes:

On 24/06/2024 12:40, David Brown wrote:

Of course such treatment is not appropriate for all allocations (or
other functions that could fail). But often I think it is better to
write clearer and fully testable (and tested!) code which ignores
hypothetical errors, rather than some of the untestable and untested
jumbles that are sometimes seen in an attempt to "handle" allocation
failures.

Baby X has bbx_malloc() which is guaranteed never to return NULL, and
never to return a pointer to an allocation which cannot be indexed by an
int.

What do you mean by 'indexed by an int'? So, what happens if I index
your allocation with -109235?

Or did you mean to say unsigned (or positive) int less than the
size of the allocation?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to David Brown on Mon Jun 24 22:59:23 2024

On Mon, 24 Jun 2024 13:40:08 +0200, David Brown wrote:

Have you ever known a non-pathological malloc() to fail?

I was once commissioned, many decades ago, to write a multispectral image viewer to run on old MacOS. I followed my usual memory-allocation
discipline. The client reported how he tried to open too many images at
once, and ran out of memory; my program reported one out-of-memory error,
gave up trying to open the rest of the files, and gracefully recovered
without crashing.

The program that had been supplied to him for Microsoft Windows, however,
gave an error for *each* file it failed to open.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Malcolm McLean on Tue Jun 25 07:06:41 2024

On Mon, 24 Jun 2024 18:50:15 +0100, Malcolm McLean wrote:

Baby X has bbx_malloc() which is guaranteed never to return NULL ...

Does it actually allocate the (physical) memory?

I wrote a memory-hog app for Android once, and found that allocating large amounts of memory space had very little impact on the system. Then when I
added code to actually write data into those allocated pages, that’s when
it really started to break into a sweat ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Keith Thompson on Tue Jun 25 07:02:39 2024

On Mon, 24 Jun 2024 02:55:39 -0700, Keith Thompson wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

The usual way I use realloc is to maintain separate counts of the
number of array elements I have allocated, and the number I am actually
using. A realloc call is only needed when the latter hits the former.
Every time I call realloc, I will extend by some minimum number of
array elements (e.g. 128), roughly comparable to the sort of array size
I typically end up with.

And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume
such a call will never fail? Hmm ...

It's not safe to assume that a shrinking realloc call will never fail.
It's possible that it will never fail in any existing implementation,
but the standard makes no such guarantee.

...

Having said all that, if realloc fails (indicated by returning a null pointer), you still have the original pointer to the object.

In other words, it’s safe to ignore any error from that last shrinking realloc? That’s good enough for me. ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Tue Jun 25 11:55:02 2024

On 25/06/2024 09:48, Bonita Montero wrote:

Test this code with your Linux installation. For my installation
glibc does all realloc()ations in-place. Really surprising for me.

#include <stdio.h>
#include <stdlib.h>

int main()
{
    void *p = malloc( 0x100000000 );
    printf( "%p\n", p );
    p = realloc( p, 1 );
    printf( "%p\n", p );
    malloc( 0x100000000 - 0x10000 );
    p = realloc( p, 0x100000000 );
    printf( "%p\n", p );
}

Try allocating a bunch of little items, and looking at where they are.
They'll likely be contiguous, or evenly spaced, depending on your implementation and what "little" is.

Then resize them all. Some will move.

Andy.
--
Your C++ comment up-thread BTW is off-topic here. My favourite C++
container is vector, and that has a reserve call so you can keep growing
the container without lots of reallocations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to Keith Thompson on Tue Jun 25 07:21:42 2024

On 6/25/24 6:05 AM, Keith Thompson wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Mon, 24 Jun 2024 02:55:39 -0700, Keith Thompson wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

The usual way I use realloc is to maintain separate counts of the
number of array elements I have allocated, and the number I am actually >>>> using. A realloc call is only needed when the latter hits the former.
Every time I call realloc, I will extend by some minimum number of
array elements (e.g. 128), roughly comparable to the sort of array size >>>> I typically end up with.

And then when the structure is complete, I do a final realloc call to
shrink it down so the size is actually that used. Is it safe to assume >>>> such a call will never fail? Hmm ...

It's not safe to assume that a shrinking realloc call will never fail.
It's possible that it will never fail in any existing implementation,
but the standard makes no such guarantee.

...

Having said all that, if realloc fails (indicated by returning a null
pointer), you still have the original pointer to the object.

In other words, it’s safe to ignore any error from that last shrinking
realloc? That’s good enough for me. ;)

What? No, that's not what I said at all.

Suppose you do something like:

some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);

If the realloc() succeeds and doesn't relocate and copy the object,
you're fine. If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory. If realloc() fails, it returns
a null pointer, but the original memory is still valid -- but again, the assignment clobbers your only pointer to it.

I presume you can write code that handles all three possibilities, but
you can't just ignore any errors.

The idiom I always learned for realloc was something like:

some_type *p = malloc(size);
if (!p) {
// allocation failed, do something about it. (might be just abort)
}

...

some_type *np = realloc(p, new_size);
if (np) {
p = np;
} else {
// p still points to old buffer, but you didn't get the new size
// so do what you can to handle the situation.
}

// p here points to the current buffer,
// might be the old size or the new.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From DFS@21:1/5 to Bonita Montero on Tue Jun 25 09:56:58 2024

On 6/25/2024 4:48 AM, Bonita Montero wrote:

Test this code with your Linux installation. For my installation
glibc does all realloc()ations in-place. Really surprising for me.

#include <stdio.h>
#include <stdlib.h>

int main()
{
    void *p = malloc( 0x100000000 );
    printf( "%p\n", p );
    p = realloc( p, 1 );
    printf( "%p\n", p );
    malloc( 0x100000000 - 0x10000 );
    p = realloc( p, 0x100000000 );
    printf( "%p\n", p );
}

$ gcc -Wall montera_test.c -o mt
montera_test.c: In function ‘main’:
montera_test.c:10:9: warning: ignoring return value of ‘malloc’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
10 | malloc( 0x100000000 - 0x10000 );
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

$ ./mt
0x7fb976f12010
0x7fb976f12010
0x7fb876f11010

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Bonita Montero on Wed Jun 26 00:51:40 2024

On Tue, 25 Jun 2024 10:38:28 +0200, Bonita Montero wrote:

Am 25.06.2024 um 09:06 schrieb Lawrence D'Oliveiro:

I wrote a memory-hog app for Android once, and found that allocating
large amounts of memory space had very little impact on the system.
Then when I added code to actually write data into those allocated
pages, that’s when it really started to break into a sweat ...

Then android is also doing overcommit.

It is running a Linux kernel, and that tends to be the default setup in
Linux.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Wed Jun 26 12:15:33 2024

On 25/06/2024 12:28, Bonita Montero wrote:

The interesting part is that after doing the first realloc()
the memory being freee isn't reused for the next malloc().

That's entirely implementation dependent.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Phil Carmody@21:1/5 to Keith Thompson on Fri Jun 28 11:01:38 2024

Keith Thompson <[email protected]> writes:

Suppose you do something like:

some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);

... If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory.

Surely some mistake?

However, such self-assignments are bad for the reasons you state later;
verify, then update.

Phil
--
We are no longer hunters and nomads. No longer awed and frightened, as we have gained some understanding of the world in which we live. As such, we can cast aside childish remnants from the dawn of our civilization.
-- NotSanguine on SoylentNews, after Eugen Weber in /The Western Tradition/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Lawrence D'Oliveiro on Fri Jun 28 06:36:45 2024

On 6/25/24 03:02, Lawrence D'Oliveiro wrote:

On Mon, 24 Jun 2024 02:55:39 -0700, Keith Thompson wrote:

...

Having said all that, if realloc fails (indicated by returning a null
pointer), you still have the original pointer to the object.

In other words, it’s safe to ignore any error from that last shrinking realloc? That’s good enough for me. ;)

No, you misunderstand:

q = realloc(p, SMALL_VALUE);

Then if q is null, p still points at the originally allocated memory. If
q is not null, then it may point at newly allocated memory, and p has in indeterminate value. You cannot go forward ignoring the possibility that
no new object was allocated, because if you do, you have no way of
knowing which of the two pointers you can safely dereference. You need,
at least,

if(q)
p = q;

then you can safely use p, regardless of whether realloc() allocated new memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Keith Thompson on Fri Jun 28 06:37:49 2024

On 6/25/24 06:05, Keith Thompson wrote:
...

Suppose you do something like:

some_type *p = malloc(BIG_VALUE);
// ...
p = realloc(p, SMALL_VALUE);

If the realloc() succeeds and doesn't relocate and copy the object,
you're fine. If realloc() succeeds and *does* relocate the object, p
still points to memory that has now been deallocated, and you don't have
a pointer to the newly allocated memory. ...

? I believe that, in that case, p does point to the newly allocated memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Malcolm McLean on Sat Jun 29 00:14:46 2024

On Tue, 18 Jun 2024 11:46:36 +0100, Malcolm McLean wrote:

Here are some real stats on file sizes, in case anone is interested.

Data set, / OS Log-normal median & mean, Arithmetic mean, 50% occupied
by (< mean)

whole data set, 9.0 KB, 730 KB, 1.5 MB < 5.4 KB
Mac OS 8.0 KB, 533 KB, 1.4 MB < 4.9 KB
Windows 11.5 KB, 1.0 MB, 1.7 MB < 8.3 KB
GNU/Linux 10.8 KB, 1.7MB, 2.2 MB < 4.8 KB

https://www.researchgate.net/publication/353066615_How_Big_Are_Peoples%27_Computer_Files_File_Size_Distributions_Among_User-managed_Collections

I don’t see any error bars. Without those, it hard to attach any
significance to the differences in figures.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Ulrich@21:1/5 to anton.txt@g{oogle}mail.com on Tue Jul 2 00:51:33 2024

XPost: sci.stat.math

On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev
<anton.txt@g{oogle}mail.com> wrote:

[cross-posted to: ci.stat.math]

Anton,

The post being responded to was originally to comp.lang.c
which I don't subscribe to.

I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.

Was this thread something posted 15 or 20 years ago?

I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.

My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.

I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Rich Ulrich on Tue Jul 2 03:02:07 2024

XPost: sci.stat.math

On 7/2/2024 12:51 AM, Rich Ulrich wrote:

On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev <anton.txt@g{oogle}mail.com> wrote:

[cross-posted to: ci.stat.math]

Anton,

The post being responded to was originally to comp.lang.c
which I don't subscribe to.

I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.

Was this thread something posted 15 or 20 years ago?

I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.

My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.

I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.

MID: <v4ojs8$gvji$[email protected]>

http://al.howardknight.net/

That gives this URL, as a copy of the message kicking off the thread.

http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cv4ojs8%24gvji%241%40dont-email.me%3E

Some USENET News clients can work from the MID directly, but Thunderbird does not.
A bare MID does not work for everyone.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Malcolm McLean on Tue Jul 2 16:39:21 2024

Malcolm McLean <[email protected]> writes:

On 29/06/2024 01:14, Lawrence D'Oliveiro wrote:

On Tue, 18 Jun 2024 11:46:36 +0100, Malcolm McLean wrote:

Here are some real stats on file sizes, in case anone is interested.

Data set, / OS Log-normal median & mean, Arithmetic mean, 50% occupied
by (< mean)

whole data set, 9.0 KB, 730 KB, 1.5 MB < 5.4 KB
Mac OS 8.0 KB, 533 KB, 1.4 MB < 4.9 KB
Windows 11.5 KB, 1.0 MB, 1.7 MB < 8.3 KB
GNU/Linux 10.8 KB, 1.7MB, 2.2 MB < 4.8 KB

https://www.researchgate.net/publication/353066615_How_Big_Are_Peoples%27_Computer_Files_File_Size_Distributions_Among_User-managed_Collections

I don’t see any error bars. Without those, it hard to attach any
significance to the differences in figures.

You don't need error bars becuase those fugures indicate a
distribution. The file are log-normally distributed wth given means
and median. So the spread is part of that data.

There are (or should be) two different distributions. Error bars are
intended to show the spread within the data. The log normal
distribution is across the data.

Now I suspect they didn't do it this way and just amalgamated all the
file save data into one, but that explains rather than excuses the lack
of error bars!

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Ulrich@21:1/5 to [email protected] on Tue Jul 2 11:45:25 2024

XPost: sci.stat.math

On Mon, 01 Jul 2024 22:10:00 -0700, Keith Thompson <[email protected]> wrote:

Rich Ulrich <[email protected]> writes:

On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev
<anton.txt@g{oogle}mail.com> wrote:

[cross-posted to: ci.stat.math]

Anton,

The post being responded to was originally to comp.lang.c
which I don't subscribe to.

I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.

Was this thread something posted 15 or 20 years ago?

I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.

My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.

I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.

The first article in this thread was posted to comp.lang.c by Janis >Papanagnou on 17 Jun 2024.

There were several followups on the same day. The diret parent of your >article was cross-posted to comp.lang.c and sci.stat.math by Anton
Shepelev (his was the first cross-posted article in the thread).

Thanks, so it looks like a failure by GigaNews to retrieve the
recent posts.

I did see a bunch of cross-posted followups. I don't know C,
and I thought there could be more context.

Looking at original, absent posts is something I've done dozens
of times over the years. Never a problem, except fora time or two
with posts from the 1990s. I've used GigaNews since my regular
ISP stopped providing Usenet access, maybe 15 years ago.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Ulrich@21:1/5 to [email protected] on Tue Jul 2 11:58:11 2024

XPost: sci.stat.math

On Tue, 02 Jul 2024 11:52:56 -0400, Rich Ulrich
<[email protected]> wrote:

Forte Agent invites me to click on the MID; asks if it is a
mail or MID; asks if it should search the net. It still works
when I test it on an old message in another group.

Now it occurs to me -- It actually does make sense,
economically, if what is searched online is limited the
groups I subscribe to, or (even) only the group that
is currently active.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Ulrich@21:1/5 to Paul on Tue Jul 2 11:52:56 2024

XPost: sci.stat.math

On Tue, 2 Jul 2024 03:02:07 -0400, Paul <[email protected]d> wrote:

On 7/2/2024 12:51 AM, Rich Ulrich wrote:

On Mon, 17 Jun 2024 18:02:49 +0300, Anton Shepelev
<anton.txt@g{oogle}mail.com> wrote:

[cross-posted to: ci.stat.math]

Anton,

The post being responded to was originally to comp.lang.c
which I don't subscribe to.

I have a question that I suppose reflects on my news source,
GigaNews, or else on my reader, Forte Agent.

Was this thread something posted 15 or 20 years ago?

I tried to call up the original post by clicking on the Message
ID when looking at headers; nothing comes up when Agent goes
online to look. The header shows multiple earlier messages;
none of them come up for me.

My clicking on Message ID works elsewhere. The logical and
simple explanation is that this is a thread old enough that
GigaNews does not have it.

I suppose that someone else might be able to tell me, if their
supplier goes back further or if GigaNews is somehow failing
to show me something that is recent.

MID: <v4ojs8$gvji$[email protected]>

http://al.howardknight.net/

That gives this URL, as a copy of the message kicking off the thread.

http://al.howardknight.net/?STYPE=msgid&MSGI=%3Cv4ojs8%24gvji%241%40dont-email.me%3E

Yes, that's the message I see when I plug the Message ID
into the program at http://al.howardknight.net/

Thanks. I'm saving that.

Some USENET News clients can work from the MID directly, but Thunderbird does not.
A bare MID does not work for everyone.

Forte Agent invites me to click on the MID; asks if it is a
mail or MID; asks if it should search the net. It still works
when I test it on an old message in another group.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Rich Ulrich on Tue Jul 2 15:09:25 2024

XPost: sci.stat.math

On 7/2/2024 11:58 AM, Rich Ulrich wrote:

On Tue, 02 Jul 2024 11:52:56 -0400, Rich Ulrich
<[email protected]> wrote:

Forte Agent invites me to click on the MID; asks if it is a
mail or MID; asks if it should search the net. It still works
when I test it on an old message in another group.

Now it occurs to me -- It actually does make sense,
economically, if what is searched online is limited the
groups I subscribe to, or (even) only the group that
is currently active.

Every device has "retention", but retention is limited.

Whether it's a search site, or a USENET server (even Forte had
their own news server, at one time), you need retention for
older articles to be search-able either as body text, or as
a <mid>.

Now that Google Groups is no longer connected to USENET,
that's one fewer places with a decent-sized archive. Google closed
their service, after the ThaiSpam incident. The Eternal-September
server, changed from one server to a two-server setup. The
Transit Server had Spam Assassin loaded on it, removing THaiSpam,
and the second server continued to offer normal ("filtered") service.

The comp.lang.c group was one of the groups under attack. Since
Google was letting the spam in, now that Google is disconnected,
the spam is gone, and the readership on CLC has gone up.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Paul on Tue Jul 2 16:58:14 2024

XPost: sci.stat.math

On 7/2/24 15:09, Paul wrote:
...

Now that Google Groups is no longer connected to USENET,

...

that's one fewer places with a decent-sized archive. Google closed
their service, after the ThaiSpam incident.

While they no longer store new messages, they still have one of the
largest archives of old messages, and it's still available for searching.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Paul on Tue Jul 2 16:54:46 2024

XPost: sci.stat.math

On 7/2/24 15:09, Paul wrote:
...

Now that Google Groups is no longer connected to USENET,

I just checked, and as I had expected, Google is still connected.

that's one fewer places with a decent-sized archive. Google closed
their service, after the ThaiSpam incident.

They didn't close their service. They just stopped adding new messages
to their archives. The messages that were stored prior to the closing
are still available for searching.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Malcolm McLean on Wed Jul 3 23:48:41 2024

On Tue, 2 Jul 2024 10:18:32 +0100, Malcolm McLean wrote:

The file are log-normally distributed wth given means and median. So the spread is part of that data.

That’s an assumption of the parametric fit, not a fact of the data. Error bars would indicate how close the fit is.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Mon Jul 8 19:34:56 2024

XPost: sci.stat.math

I had plumb forgot about this solution of mine:

p0 = 1-e^(L*x0) ,
p1 = 1-e^(L*x1) ,
x1 = k*x0 (by our strategy), =>
p1 = 1-(1-p0)^k .

which does not depend on the distribution and lets us
generalise this approach for any distribution:

x1 = Q( 1 - ( 1 - CDF(x0) )^k )
where:

x0 : the required size
x1 : the new recommended capacity
Q(p) : the p-Quantile of the given distribution
CDF(x): the CDF of the given distribution
k>1 : balance between speed and space efficiency

Let us test it with the exponential distribution, for which:

Q (p) = -Ln( 1 - p )/L
CDF(x) = 1 - e^(-Lx)

Substituting these into the equation for x1:

x1 = Q ( 1 - ( 1 - ( 1 - e^(-Lx0) ) )^k ) =
Q ( 1 - ( e^(-Lx0) )^k ) =
Q ( 1 - e^(-kLx0) ) =
-Ln( e^(-kLx0) )/L = k*x0 (QED)

That is, my solution is a/the generalisation of the
exponential growth strategy.

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Mon Jul 8 20:01:21 2024

XPost: sci.stat.math

Rich Ulrich:

Thanks, so it looks like a failure by GigaNews to retrieve
the recent posts.

I did see a bunch of cross-posted followups. I don't know
C, and I thought there could be more context.

Characters are being read in (say, from a file) sequentially
and stored in computer memory (RAM) in an "array" -- a
linear data structure storing elements (our characters) in a
sequential order, one ofter the other at addresses with
increasing indexes -- somewhat like a mathematical vector.

In order to store a character in an array, sufficient memory
has to be "allocated" for it, but while reading we do not
know beforehand the size of the file (or the total length of
the sequence), and therefore increase the allocated aray
size prospectively once the previous allocation is filled.
This operaion is called `realloc' and frequently involves
the tedious copying of the entire array onto a new location
in memory, taking a time in proportion to the number
elemennts so far allocated.

The question is to develop an optimal allcation strategy for
a given distribution of file sizes. The fasted solution is
to allocate a gigantic array beforehand, but it is a
terrible waste of memory. The slowet solution is to
reallcoate for each single character read it, but is a
terrible waste of CPU time. As I understand the problem, a
strategy is needed that manifests some compromise between
the extremes.

Looking at original, absent posts is something I've done
dozens of times over the years. Never a problem, except
fora time or two with posts from the 1990s. I've used
GigaNews since my regular ISP stopped providing Usenet
access, maybe 15 years ago.

Just in case, there are many totally free Usenet servers,
e.g.:
http://www.eternal-september.org/
https://www.i2pn2.org/

and even a web interface:

https://www.novabbs.com

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Ulrich@21:1/5 to anton.txt@g{oogle}mail.com on Sun Jul 21 19:40:16 2024

XPost: sci.stat.math

On Mon, 8 Jul 2024 20:01:21 +0300, Anton Shepelev
<anton.txt@g{oogle}mail.com> wrote:

Rich Ulrich:

Thanks, so it looks like a failure by GigaNews to retrieve
the recent posts.

I did see a bunch of cross-posted followups. I don't know
C, and I thought there could be more context.

<snip. Thanks for the details.>

Okay, today I discovered that there was no failure by Giganews
or by Forte Agent -- Instead, there was BEHAVIOR by Agent
that I was not aware of.

Today, was looking at All Desks (Agent terminology) to see what
was in Sent, and I noticed there were messages in Inbox -- Those
were the messages that I thought Agent had failed to retrieve.
(Nice discussion there.)

I guess - every other time I've clicked on Message-ID, the old
message was in the group where I was reading and the old one
showed up where I was reading. I've read the Agent group for
ages, and I don't remember this feature ever being mentioned.

Live and learn.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Tue Jul 23 16:47:39 2024

XPost: sci.stat.math

Rich Ulrich:

Today, was looking at All Desks (Agent terminology) to see
what was in Sent, and I noticed there were messages in
Inbox -- Those were the messages that I thought Agent had
failed to retrieve. (Nice discussion there.)

Looking forward to your take on the problem. Mine is rather
simplisitc, but can be easily tested with several
distributions. This is like extrapolation: we know the
optimal solution for the given distribution (exponential)
and want to devise a general method to get the optimal
solution for any given distribution.

Malcolm, if you are still interested, can you provide a test
program that measures some statiscits for various allocation
strategies on various distributions, inclusing the
exponential?

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	58:59:16
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,423

realloc() - frequency, conditions, or experiences about relocation?

Who's Online

Recent Visitors

System Info