==============================================================================
=
Brief Background, Definition, Scope, and Purpose of the Proposal ==============================================================================
=
AI software grows more and more popular, becoming a notable part of the software ecosystem. This trend reveals some new questions and challenges, especially in the interpretation of the Debian Free Software Guidelines (DFSG)
on pre-trained AI models, urging the Debian Project to revisit its interpretation of the Debian Free Software Guidelines (DFSG) in the context of
AI software and models.
A pre-trained "AI model" is usually stored on disk in binary formats designed for numerical arrays, as a "model checkpoint" or "state dictionary", which is essentially a collection of matrices and vectors, holding the learned information from the training data or simulator. When the user make use of such
file, it is usually loaded by an inference program, which performs numerical computations to produce outputs based on the learned information in the model.
Please refer to the appendix for more background information about AI.
This proposal focuses on one interpretation of the DFSG on a particular type of
pre-trained AI models, that (1) is released under DFSG-compliant free software
licenses like MIT/Expat, Apache-2.0, etc, and satisfies any of (2) and (3) below -- (2) is trained on data or simulator that is private, proprietary, or inaccessible to the public; (3) does not provide the original training program.
To avoid creating new terminologies, we will refer to this type of file as "AI
models released under open source license without original training data or program" without any abbreviation. Such models are referred to as "Open Weights" in some circumstances (See: https://opensource.org/ai/open-weights).
The purpose of this proposal is to reach a community consensus on how we should
treat and handle the described type of AI models, which is an inevitable issue
in the future. If necessary, I can work with the Debian Policy Team to incorporate the GR result into appropriate sections of the Debian Policy (e.g.,
in Section 10 "Files").
Note: While nowadays people use "AI" to refer to LLMs, it is a very broad term
that covers much more than language models. AI models apart from language
models must be considered as well, such as computer vision models, audio recognition models, etc.
Note: If condition (1) is not satisfied, it is usually seen "non-free" in the
context of Debian community and no voting is needed. In addition, if
everything (including but not limited to the model itself, training data training program, and inference program) is released under DFSG-compliant
licenses, that again needs no voting.
Note: Traditional software parts, like a Python script or a C++ program, are
out of the scope of this proposal since that is a well-defined case. For example, a deep learning framework or inference software written in
Python or C++, i.e., the program that runs the AI models, is out of the scope of this proposal.
==============================================================================
=
Proposal A: "AI models released under open source license without original training data or program" are not seen as DFSG-compliant.
==============================================================================
=
The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.
------------------------------------------------------------------------------
-
Appendix ------------------------------------------------------------------------------
-
Inevitably there may be some terminology and/or backgrounds that is not well-known or well-understood by the general public. Please refer to the appendices for more information. If you cannot find relevant information to answer your question, please consult a human professional -- or an LLM.
See appendix A for detailed rationale of this proposal.
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.
[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt
[Appendix B] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixB.txt
[Appendix C] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixC.txt
[Appendix D] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixD.txt
Disclaimer
----------
We acknowledge that releasing useful AI models under permissive licenses like MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the society. We sincerely respect the respective authors' work. On the other hand,
DFSG sets a pretty high standard on software that can be included in the Debian
distribution, which means the GR may lead to some results that not everybody agrees with. Nevertheless, we appreciate your understanding of the mission of
the Debian project -- to create a free operating system, where the "free" means
"software freedom".
Brief Background, Definition, Scope, and Purpose of the Proposal >===============================================================================
AI software grows more and more popular, becoming a notable part of the >software ecosystem. This trend reveals some new questions and challenges, >especially in the interpretation of the Debian Free Software Guidelines (DFSG) >on pre-trained AI models, urging the Debian Project to revisit its >interpretation of the Debian Free Software Guidelines (DFSG) in the context of >AI software and models.
A pre-trained "AI model" is usually stored on disk in binary formats designed >for numerical arrays, as a "model checkpoint" or "state dictionary", which is >essentially a collection of matrices and vectors, holding the learned >information from the training data or simulator. When the user make use of such
file, it is usually loaded by an inference program, which performs numerical >computations to produce outputs based on the learned information in the model. >Please refer to the appendix for more background information about AI.
This proposal focuses on one interpretation of the DFSG on a particular type of
pre-trained AI models, that (1) is released under DFSG-compliant free software >licenses like MIT/Expat, Apache-2.0, etc, and satisfies any of (2) and (3) >below -- (2) is trained on data or simulator that is private, proprietary, or >inaccessible to the public; (3) does not provide the original training program.
To avoid creating new terminologies, we will refer to this type of file as "AI >models released under open source license without original training data or >program" without any abbreviation. Such models are referred to as "Open >Weights" in some circumstances (See: https://opensource.org/ai/open-weights).
The purpose of this proposal is to reach a community consensus on how we should
treat and handle the described type of AI models, which is an inevitable issue >in the future. If necessary, I can work with the Debian Policy Team to >incorporate the GR result into appropriate sections of the Debian Policy (e.g.,
in Section 10 "Files").
| Note: While nowadays people use "AI" to refer to LLMs, it is a very broad term
| that covers much more than language models. AI models apart from language >| models must be considered as well, such as computer vision models, audio >| recognition models, etc.
| Note: If condition (1) is not satisfied, it is usually seen "non-free" in the
| context of Debian community and no voting is needed. In addition, if
| everything (including but not limited to the model itself, training data >| training program, and inference program) is released under DFSG-compliant >| licenses, that again needs no voting.
| Note: Traditional software parts, like a Python script or a C++ program, are >| out of the scope of this proposal since that is a well-defined case. For >| example, a deep learning framework or inference software written in
| Python or C++, i.e., the program that runs the AI models, is out of the
| scope of this proposal.
===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant.
===============================================================================
The "AI models released under open source license without original training >data or program", a particular type of files as explained above, are not seen >as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section >of Debian archive can include those files.
-------------------------------------------------------------------------------
Appendix >-------------------------------------------------------------------------------
Inevitably there may be some terminology and/or backgrounds that is not >well-known or well-understood by the general public. Please refer to the >appendices for more information. If you cannot find relevant information to >answer your question, please consult a human professional -- or an LLM.
See appendix A for detailed rationale of this proposal.
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.
[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt
[Appendix B] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixB.txt
[Appendix C] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixC.txt
[Appendix D] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixD.txt
Disclaimer
----------
We acknowledge that releasing useful AI models under permissive licenses like >MIT/Expat and Apache-2.0 is a generous act from the original authors due to >huge costs, and it is a great contribution to the software ecosystem and the >society. We sincerely respect the respective authors' work. On the other hand,
DFSG sets a pretty high standard on software that can be included in the Debian
distribution, which means the GR may lead to some results that not everybody >agrees with. Nevertheless, we appreciate your understanding of the mission of >the Debian project -- to create a free operating system, where the "free" means
"software freedom".
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant.
===============================================================================
The "AI models released under open source license without original training >data or program", a particular type of files as explained above, are not seen >as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section >of Debian archive can include those files.
On Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou wrote:
We acknowledge that releasing useful AI models under permissive licenses like
MIT/Expat and Apache-2.0 is a generous act from the original authors due to >> huge costs, and it is a great contribution to the software ecosystem and the >> society. We sincerely respect the respective authors' work.
i'm not sure i can subscribe to this. after all, most if not all "AI" models >exist because of stealing other peoples work...
or did i miss consentual "AIs"?
We acknowledge that releasing useful AI models under permissive licenses like MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the society. We sincerely respect the respective authors' work.
===============================================================================
Brief Background, Definition, Scope, and Purpose of the Proposal ===============================================================================
AI software grows more and more popular, becoming a notable part of the software ecosystem. This trend reveals some new questions and challenges, especially in the interpretation of the Debian Free Software Guidelines (DFSG)
on pre-trained AI models, urging the Debian Project to revisit its interpretation of the Debian Free Software Guidelines (DFSG) in the context of
AI software and models.
A pre-trained "AI model" is usually stored on disk in binary formats designed for numerical arrays, as a "model checkpoint" or "state dictionary", which is essentially a collection of matrices and vectors, holding the learned information from the training data or simulator. When the user make use of such
file, it is usually loaded by an inference program, which performs numerical computations to produce outputs based on the learned information in the model.
Please refer to the appendix for more background information about AI.
This proposal focuses on one interpretation of the DFSG on a particular type of
pre-trained AI models, that (1) is released under DFSG-compliant free software
licenses like MIT/Expat, Apache-2.0, etc, and satisfies any of (2) and (3) below -- (2) is trained on data or simulator that is private, proprietary, or inaccessible to the public; (3) does not provide the original training program.
To avoid creating new terminologies, we will refer to this type of file as "AI
models released under open source license without original training data or program" without any abbreviation. Such models are referred to as "Open Weights" in some circumstances (See: https://opensource.org/ai/open-weights).
The purpose of this proposal is to reach a community consensus on how we should
treat and handle the described type of AI models, which is an inevitable issue
in the future. If necessary, I can work with the Debian Policy Team to incorporate the GR result into appropriate sections of the Debian Policy (e.g.,
in Section 10 "Files").
| Note: While nowadays people use "AI" to refer to LLMs, it is a very broad term
| that covers much more than language models. AI models apart from language | models must be considered as well, such as computer vision models, audio | recognition models, etc.
| Note: If condition (1) is not satisfied, it is usually seen "non-free" in the
| context of Debian community and no voting is needed. In addition, if
| everything (including but not limited to the model itself, training data | training program, and inference program) is released under DFSG-compliant | licenses, that again needs no voting.
| Note: Traditional software parts, like a Python script or a C++ program, are
| out of the scope of this proposal since that is a well-defined case. For | example, a deep learning framework or inference software written in
| Python or C++, i.e., the program that runs the AI models, is out of the
| scope of this proposal.
===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================
The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.
-------------------------------------------------------------------------------
Appendix -------------------------------------------------------------------------------
Inevitably there may be some terminology and/or backgrounds that is not well-known or well-understood by the general public. Please refer to the appendices for more information. If you cannot find relevant information to answer your question, please consult a human professional -- or an LLM.
See appendix A for detailed rationale of this proposal.
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.
[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt
[Appendix B] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixB.txt
[Appendix C] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixC.txt
[Appendix D] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixD.txt
Disclaimer
----------
We acknowledge that releasing useful AI models under permissive licenses like MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the society. We sincerely respect the respective authors' work. On the other hand,
DFSG sets a pretty high standard on software that can be included in the Debian
distribution, which means the GR may lead to some results that not everybody agrees with. Nevertheless, we appreciate your understanding of the mission of
the Debian project -- to create a free operating system, where the "free" means
"software freedom".
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant.
The "AI models released under open source license without original
training data or program", a particular type of files as explained
above, are not seen as DFSG-compliant. Hence, they can not be
included in the "main" section of the Debian archive. This proposal
does not specify whether the "non-free" section of Debian archive
can include those files.
----------
Appendix
----------
See appendix A for detailed rationale of this proposal....
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.
[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt
Disclaimer
----------
We acknowledge that releasing useful AI models under permissive
licenses like MIT/Expat and Apache-2.0 is a generous act from the
original authors due to huge costs, and it is a great contribution
to the software ecosystem and the society.
The "AI models released under open source license without original training >data or program", a particular type of files as explained above, are not seen >as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive.
On Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou wrote:
We acknowledge that releasing useful AI models under permissive licenses like
MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the
society. We sincerely respect the respective authors' work.
i'm not sure i can subscribe to this. after all, most if not all "AI" models exist because of stealing other peoples work...
On Sat, 19 Apr 2025 at 13:56:17 -0400, M. Zhou made this GR proposal:
The "AI models released under open source license without original
training data or program", a particular type of files as explained
above, are not seen as DFSG-compliant. Hence, they can not be included
in the "main" section of the Debian archive.
Do we have an idea of whether/how many models that match this definition already exist in main? In the Policy process it's usual to require an estimate of how many packages a particular Policy change will make "insta-RC-buggy", and I think GRs that change our self-imposed rules for
what we consider to be Free should do similarly.
===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================
===============================================================================This is the part to vote on. The appendixes are just supplementary
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================
The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.
===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================
The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.
But the practical effects of passing the GR is probably (among other things):
a) Removal of OCR software (like tesseract[1])
b) Removal of image recognition software (like opencv[2])
c) Possibly removal of text-to-speech software (like festival[3] or flite[4])
Ansgar
[1]: https://sources.debian.org/src/tesseract-lang/1%3A4.1.0-2/
[2]: https://sources.debian.org/src/opencv/4.10.0%2Bdfsg- 5/data/haarcascades/haarcascade_fullbody.xml/
[3]: https://sources.debian.org/src/festival-hi/0.1- 11/hindi_NSK_diphone/festvox/hindi_NSK_ene.scm/#L3
[4]: https://sources.debian.org/src/flite/2.2-7/lang/cmu_us_kal/
My concern for Japanese keyboard input method was addressed in
"ToxicCandy Allowlist" by assessing it as non-AI model in ML-policy.
The current policy proposal is vague at what is not "AI models" and
it lacks direct reference to "ToxicCandy Allowlist". (Why missing?
or did I overlook something?)
"ToxicCandy Allowlist" by assessing it as non-AI model in ML-policy.Could we stop using terms like "toxic" or "cancerous" or whatever in technical discussions? (Unless we talk about toxic products or cancer treatment or similar.)
FWIW, I looked specifically in the gnubg case a while ago, because it
was an interesting test case for this discussion.
Here's what I found out:
- The training program (using the language from the GR draft) is
allegedly available and licensed under GPL3.
- The training data is allegedly available as well, but comes without
any declared license. I tend to concur with you, Russ, that it's very
likely non-copyrightable material. But that's only partly reassuring
to me, because I'm not sure how Debian would practically go about
ruling that certain stuff that comes without copyright/license is fine
for main, whereas other stuff in the same situation is not.
I suspect no source technically exists for those weights anywhere, since upstream's training work was fairly manual as I recall and is not
something they tend to work iteratively on, but it's possible that one of
the upstream developers has all of the data and scripts somewhere.
The fact remains that our builders will be unable to reproduce the
resulting network, for well-known practical reasons. Thus we mostly-have-to-trust the original publisher that their network has been
built as documented (or even "documented" given the status of gnubg). In practice this is not a problem for a Backgammon engine, or even for
Tesseract because any serious use case supports, if not requires, human verification of the result — but how sure can I be that a LLM intended
for home automation doesn't contain an Open Sesame backdoor that unlocks
my *home*'s back door?
However, I'm not sure it's very *practical* unless our position is that[...]
we're simply not going to package software that uses machine learning
models (a decision that we could certainly make, but which seems a bit contrary to our normal desire to be a universal operating system).
Problems just off the top of my head include:
I'm also very uncomfortable speaking about AIs similar like I don't
like the term IP=intellectual property...
If we take as a given that copyright does *not* survive the learning
process of a (sufficiently complex) AI system, then it is *not* necessary that all training *data* for training a DFSG-free AI to also be DFSG-free.
It is however necessary that:
* software needed for inference (usage) of the AI model to be DFSG-free
* software needed for the training process of the AI model to be DFSG-free
* software needed to gather, assemble and process the training data to be DFSG-free or the manual process for it to be documented
On Sun Apr 27, 2025 at 12:37 PM BST, Holger Levsen wrote:
I'm also very uncomfortable speaking about AIs similar like I don't
like the term IP=intellectual property...
It would be worthwhile to restrict ourselves to "LLM" for these things
since "AI" is a much broader term and many other technologies (past or future) may be described as "AI" whilst not being LLMs.
I'm also very uncomfortable speaking about AIs similar like I don't
like the term IP=intellectual property...
It would be worthwhile to restrict ourselves to "LLM" for these things
since "AI" is a much broader term and many other technologies (past or
future) may be described as "AI" whilst not being LLMs.
The GR as proposed would apply to a lot of things that are not LLMs,
though. I think the right terminology for what we're currently talking
about might be "machine learning model," which encompasses a wider set of >onstructions from processed training data without limiting them to only >large-language models.
IMHO we are here having a very annoying mixture of technical, legal and >philosophical problems.
Hypothetical 1: Bob reads all programming manuals and all DFSG-free code in >Debian and GitHub and teaches themselves Python programming. They are asked >to solve a simple problem. Their answer basically matches sample solutions >from a few Python coding manuals.
Can Bob release this solution as DFSG-free code? Does it matter if the >specific programming manual or Python course manual was DFSG-free licensed
or not? Does it matter if the manual had a GPL license? What if they
learned in a university setting?
Hypothetical 2: An abstract AI Alice does the same learning process as Bob >and produces the same output in an answer to the same request.
Do the conditions on the output of Alice change? Is the change technical or >legal/philosophical? You could call this a Turing test for copyright.
Processing of experiences into expert opinion is IMHO not directly
comparable with compilation of source to a binary. Regardless if it's done
by a human or a software system. The copyright law makes a distinction here >for humans. And while no explicit legal precedent is yet set for any kinds
of AI (including LLMs), the very lack of massive copyright violation
lawsuits from very sue-happy corporations, like Disney, is already a >noteworthy precedent. If LLMs from Meta and OpenAI (and others) are not
being sued for massive copyright violations, then it is the consensus of
our society and of our legal system that the same kind of expert opinion / >learning protections that humans enjoy also seem to apply to complex-enough >artificial expert systems. One hand-wavy legal loophole could be that the >learning process splits the copyrighted works into chunks small enough that >none of those chunks would legally retain the copyright protection anymore. >But that is just one of many speculations until a law or a court
establishes such guidelines.
What does that mean in terms of this proposal (or a potential alternative >proposal)?
If we take as a given that copyright does *not* survive the learning
process of a (sufficiently complex) AI system, then it is *not* necessary >that all training *data* for training a DFSG-free AI to also be DFSG-free.
It is however necessary that:
* software needed for inference (usage) of the AI model to be DFSG-free
* software needed for the training process of the AI model to be DFSG-free
* software needed to gather, assemble and process the training data to be
DFSG-free or the manual process for it to be documented
In this perspective, we would be seeing the training data itself as
immutable and uncopyrightable facts of world and nature, like positions and >spectra of stars in the sky (because its copyright does not survive the >learning process). It is data that can be gathered again, maybe with slight >variation in results and it does not really change based on who does the >gathering (assuming similar resources get invested).
*However*, models again are substantially different from regular
software (that gets modified in source and then compiled to a binary)
because such a model can be *modified* and adapted to your needs
directly from the end state. In fact, for adjusting a LLM for use in a particular domain or a particular company it actually *is* the "binary"
that is the *preferred* form to be modified - you take a model that
"knows" a lot in general and "knows" how your language works and you
train the model further by doing specialisation training for your,
specific data set. And a result you get from one "generic" binary
another - "specialized" binary.
So, very precisely speaking, modification of a LLM does *not* require
the original training data. Recreating a LLM does. Also developing a new
LLM with different training methods or training conditions does need
some training data (ideally the original training data, especially to
compare end performance). But all in all a developer on a Desert Island
would be better off with a "binary" model to be modified than without
it.
Say for example that an IDE saves its configuration state not in a
common text file, but as a binary memory dump. Say the maintainer of
such a package would use their experience of the IDE and years of
development to go through the GUI of this software to assemble a great
setup configuration that is great for anyone starting to use the IDE and
also has clues left around it how to tailor it further for your needs.
This configuration (as a binary memory dump of the software state) is
then distributed to the users as the default configuration. What is "the source" of it?
Isn't this binary (that the GUI can both read and write) not the
preferred form for modification? The maintainer can describe how he
created the GUI state (document the training process), but not really
include all his relevant experience (training data) that led him to
believe that this state is the best for the new users.
Or Debian could go the MS TTF route - have the software in the archive,
but no models at all. And to get the software to work users would get
used to run a script that would be always pulling a model from
huggingface.co either manually or even during package installation.
Possible with a barely functional placeholder model in the package that
99% of users would replace in real usage. That would keep the "evil" AI
away from the archive, but will that benefit our users?
Will that benefit the development of a freer and more accessible AI landscape?
Wait — Training data are chunks of software. I understand where you are getting to, but in order to redistribute it, we must have the right to. How do we say that training data are "immutable and uncopyrightable facts of world and nature"? The heavily trained machine didn't learn from objects randomly happening in nature...
(...)
So, very precisely speaking, modification of a LLM does *not* require the >original training data. Recreating a LLM does. Also developing a new LLM
with different training methods or training conditions does need some >training data (ideally the original training data, especially to compare
end performance). But all in all a developer on a Desert Island would be >better off with a "binary" model to be modified than without it.
Say for example that an IDE saves its configuration state not in a common >text file, but as a binary memory dump. Say the maintainer of such a
package would use their experience of the IDE and years of development to
go through the GUI of this software to assemble a great setup configuration >that is great for anyone starting to use the IDE and also has clues left >around it how to tailor it further for your needs. This configuration (as a >binary memory dump of the software state) is then distributed to the users
as the default configuration. What is "the source" of it? Isn't this binary >(that the GUI can both read and write) not the preferred form for >modification? The maintainer can describe how he created the GUI state >(document the training process), but not really include all his relevant >experience (training data) that led him to believe that this state is the >best for the new users. So what is LLama if not a **very** complex nvim >configfile focused on autocomplete? :D Quite a few of those questions also >apply to fonts (IMO).
In fact, for adjusting a LLM for use in a particular domain or
a particular company it actually *is* the "binary" that is the
*preferred* form to be modified - you take a model that "knows" a lot
in general and "knows" how your language works and you train the model further by doing specialisation training for your, specific data set.
And a result you get from one "generic" binary another - "specialized" binary.
That *could* be the technical difference in definitions between what
is "DFSG-free AI" and what is "Debian-main-grade-free AI".
Processing of experiences into expert opinion is IMHO not directly
comparable with compilation of source to a binary. Regardless if it's
done by a human or a software system.
The copyright law makes a
distinction here for humans.
And while no explicit legal precedent is
yet set for any kinds of AI (including LLMs), the very lack of massive copyright violation lawsuits from very sue-happy corporations, like
Disney, is already a noteworthy precedent.
One hand-wavy
legal loophole could be that the learning process splits the copyrighted works into chunks small enough that none of those chunks would legally
retain the copyright protection anymore.
Russ Allbery dijo [Mon, Apr 28, 2025 at 09:46:41AM -0700]:
I'm also very uncomfortable speaking about AIs similar like I don't
like the term IP=intellectual property...
It would be worthwhile to restrict ourselves to "LLM" for these things
since "AI" is a much broader term and many other technologies (past or
future) may be described as "AI" whilst not being LLMs.
The GR as proposed would apply to a lot of things that are not LLMs, >>though. I think the right terminology for what we're currently talking >>about might be "machine learning model," which encompasses a wider set of >>onstructions from processed training data without limiting them to only >>large-language models.
This is an important point, which I subscribe. Since its inception over 60 years ago, "Artificial Intelligence" is fluffy marketspeak […]
However, here we have a clear and fundamental change happening in the
copyright law level - there is a legal break/firewall that is happening
during training. The model *is* a derivative work of the source code of
the training software, but is *not* a derivative work of the training
data.
This means that we also have to consider what exactly is training
data and how to deal with it, without automatically falling back to
equating it with source code.
On 04.05.25 14:27, Aigars Mahinovs wrote:
The simple fact that none of the LLMs have been sued out of
existence by any copyright owner is de facto proof that it does not
work that way in the eyes of the judicial system.
That may or may not be correct in the long run, IANAL and all that.
However. Copyright is only one aspect of whether or not models should
end up in main. Plain old reproducibility is important to us too.
On 04.05.25 15:44, Ansgar 🙀 wrote:
What is not reproducible (in the reproducible build sense Debian uses) about, say, the Tesseract OCR models?My point is that reproducing a model requires input data, which requires us to distribute said data, which requires them to be of suitable copyright.
===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================
The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.
This has so far also been the case for statistical data in Debian, such
as simple aggregates such as the number of packages in Debian, which
might be included in Debian without also including the entire Debian
archive as source, data about word or character frequencies in natural language texts, and so on. I guess proponents of the original GR would
also find this problematic?
On Sun, 4 May 2025 at 13:12, Wouter Verhelst <[1][email protected]>
wrote:
On Tue, Apr 29, 2025 at 03:17:52PM +0200, Aigars Mahinovs wrote:
> However, here we have a clear and fundamental change happening
in the
> copyright law level - there is a legal break/firewall that is
happening
> during training. The model *is* a derivative work of the source
code of
> the training software, but is *not* a derivative work of the
training
> data.
I would disagree with this statement. How is a model not a
derivative
work of the training data? Wikipedia defines it as
The simple fact that none of the LLMs have been sued out of
existence by any copyright owner is de facto proof that it does not
work that way in the eyes of the judicial system.
Wikipedia definition is a layman's simplification.
Le Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou a écrit :
===============================================================================
Proposal A: "AI models released under open source license without original >> training data or program" are not seen as DFSG-compliant.
===============================================================================
The "AI models released under open source license without original training >> data or program", a particular type of files as explained above, are not seen
as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section
of Debian archive can include those files.
Could we avoid using the term 'Artificial intelligence' in the text of
the proposal (not in the appendix)? This term dates for 1970 and has had >different meaning for eachdecades since then. In ten years it is likely
that, while the question this GR addresses will still be relevant, the
term 'Artificial intelligence' will refer to something quite different.
Wikipedia includes this citation:
"" However, many AI applications are not perceived as AI: "A lot of cutting >edge AI has filtered into general applications, often without being
called AI because once something becomes useful enough and common enough
it's not labeled AI anymore."[2][3] "" ><https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=1286364868>
We do not just
follow the law in deciding what to distribute and how to do it
if this
were the case, then there would never have been any need for a non-US, >non-free, or non-free-firmware section of our archive
On Sun, 4 May 2025 at 17:30, Wouter Verhelst <[1][email protected]> wrote:
> Wikipedia definition is a layman's simplification.
It may be a simplification, but that in and of itself does not make
it
incorrect.
I have specifically addressed this point with examples in my reply.
Copyright very clearly does not survive learning and then generation of
new solutions. In humans that is a given.
For software I would assume the equivalence, unless proven
differently.
M> Proposal A: "AI models released under open source license without"M" == M Zhou <[email protected]> writes: ===================================
==================================="M" == M Zhou <[email protected]> writes:
M> Proposal A: "AI models released under open source license without M> original training data or program" are not seen as
M> DFSG-compliant.
M> ===============================================================================
M> The "AI models released under open source license without
M> original training data or program", a particular type of files as
I find the use of Open Source License in a Debian context problematic.
The DFSG is not the OSD, and we should care whether a license is DFSG
free not OSI approved.
I hope that you would be willing to accept an amendment to replace all
uses of open source in your proposal.
The issue is also discussed here:
https://lwn.net/Articles/1019028/
A better wording goes:
s/open source license/DFSG-compatible license/g
It is fixed in the git repo: >https://salsa.debian.org/lumin/gr-ai-dfsg/-/commit/9496f9fb6405db5a99fff1672cd4bad66c925c24
The proposal after amendament:
===============================================================================
Proposal A: "AI models released under DFSG-compatible license license without
original training data or program" are not seen as DFSG-compliant.
===============================================================================
The "AI models released under DFSG-compatible license license without original >training data or program", a particular type of files as explained above, are >not seen as DFSG-compliant. Hence, they can not be included in the "main" >section of the Debian archive. This proposal does not specify whether the >"non-free" section of Debian archive can include those files.
"Gunnar" == Gunnar Wolf <[email protected]> writes:
Wikipedia includes this citation:
"" However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, often without being
called AI because once something becomes useful enough and common enough it's not labeled AI anymore."[2][3] "" <https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=1286364868>
I agree with you. However, it is a term firmly set in the mind of too many people.
Keep in mind Mo Zhou's proposal is in a large way an answer to
OSI's OSAID⁴, which many among us feel to be a gross mistake.
⁴ https://opensource.org/blog/the-open-source-initiative-announces-the-release-of-the-industrys-first-open-source-ai-definition
Too many people (both "in the trade" and not) recognize the term AI.
On Sun, 4 May 2025 at 17:30, Wouter Verhelst <[email protected]> wrote:
It is incorrect, because the New York Times did in fact file suit
against Microsoft, OpenAI, and other parties related to copyright infringement of their large library of news articles in creating ChatGPT[1]. The case is still in court.
[1] https://www.courtlistener.com/docket/68117049/the-new-york-times-company-v-microsoft-corporation/
Thanks for this link, it has been a very interesting read.
b) Removal of image recognition software (like opencv[2])
Not likely. I'm the uploader of src:opencv.
This is a pretty large library that contains lots of functionalities
that does not require a "model" to function. For opencv, it is at
most adding one more file for the +dfsg file exclusion, or spliting
the model to maybe a non-free package and set Recommends to pull
the package.
This one is much simpler. Maybe because the lawyers being used are not too good.
https://www.courtlistener.com/docket/67538258/tremblay-v-openai-inc/
Authors claim a lot of stuff, basically a generic shotgun of copyright claims, but all secondary claims get dismissed by the court at pre-trial stage due to bad legal reasoning and failing to detail or prove any actual wrongdoing. And specifically a claim that all outputs from a LLM are
derived works of all inputs is dismissed based on already decided case law.
Only the claim of direct copyright infringement of using a text of a book
in the training process of a model still stands to avait the actual trial. And there OpenAI is citing a lot of good reasons why that does not
constitute distribution at all and why the result of the work is transformative and thus is protected by fair use. Just the fact of
accessing some data at some point does not create copyright infringement.
The whole lawsuit is very sloppy IMHO, IANAL.
AFAIK this legal theory has not been tested in court yet. But the big commercial players (who, remember, have vetted interests in being
copyright absolutists) believe in it so much, that they go as far as
offering legal indemnity promises to users of their LLMs who encounter
legal issues due to the use of generated output.
===============================================================================
Proposal A: "AI models released under DFSG-compatible license license without
original training data or program" are not seen as DFSG-compliant.
===============================================================================
The "AI models released under DFSG-compatible license license without original
training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.
The transformative criteria here is that the resulting work needs to be
transformed in such a way that it adds value. And generating new texts
from a LLM is pretty clearly a value-adding transformation compared to
the original articles. Even more so than the already ruled-on Google
Books case.
On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote:
The transformative criteria here is that the resulting work needs to be >> transformed in such a way that it adds value. And generating new texts
from a LLM is pretty clearly a value-adding transformation compared to
the original articles. Even more so than the already ruled-on Google
Books case.
OK, let me change it around a bit, because I don't think this discussion
is going in any direction that is relevant for Debian.
The only way in which you can build a model is by taking loads and loads
of data, running some piece of software over it, and storing the result >somewhere.
How can we do this legally, reproducibly, and openly if we do not have
the rights to redistribute the said "loads and loads of data"?
The answer is, we can't.
(...)
On Thu, 8 May 2025 at 12:46, Wouter Verhelst <[email protected]> wrote:
On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote:
The transformative criteria here is that the resulting work needs to be
transformed in such a way that it adds value. And generating new texts
from a LLM is pretty clearly a value-adding transformation compared to
the original articles. Even more so than the already ruled-on Google
Books case.
OK, let me change it around a bit, because I don't think this discussion
is going in any direction that is relevant for Debian.
The only way in which you can build a model is by taking loads and loads
of data, running some piece of software over it, and storing the result somewhere.
How can we do this legally, reproducibly, and openly if we do not have
the rights to redistribute the said "loads and loads of data"?
The answer is, we can't.
Sure we can. It is a technical problem, actually. As long as the data
is still available, you can store and redistribute information about
which data you gathered, from where and how it looked like - hashes of copyrigthed content are not copyrighted ;)
Training data is not source code,
The problem is that all those missing factors are destined to go
un-missing — and then what? We can't base our rules on biological exceptionalism.
Why not? The entirety of law, politics, and civilization is designed by humans, for humans. Free software is a movement of humans that attempts to provide other humans with specific freedoms and guarantees around the software they use. I don't work on free software because I want to make something easier for Google's LLM. I work on free software because I want
to give freedom and control to human beings.
We're the ones building the system. Why should we not design the system
for us, to help us, to make our lives better?
The LLMs are by and large the creations of corporations because they have collective resources that dwarf the resources of nearly all individual humans. Where this line of reasoning goes in practice is to (further)
create a legal system that treats corporations and their tools as the most important actors and humans as secondary material for corporations to consume. We already have too much of that.
We *absolutely* should base our rules on what's best for human beings, not corporate constructs. That is the entire point of the free software
movement.
Russ Allbery writes ("Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models"):
Why not? The entirety of law, politics, and civilization is designed by humans, for humans. Free software is a movement of humans that attempts to provide other humans with specific freedoms and guarantees around the software they use. I don't work on free software because I want to make something easier for Google's LLM. I work on free software because I want to give freedom and control to human beings.
We're the ones building the system. Why should we not design the system
for us, to help us, to make our lives better?
The LLMs are by and large the creations of corporations because they have collective resources that dwarf the resources of nearly all individual humans. Where this line of reasoning goes in practice is to (further) create a legal system that treats corporations and their tools as the most important actors and humans as secondary material for corporations to consume. We already have too much of that.
We *absolutely* should base our rules on what's best for human beings, not corporate constructs. That is the entire point of the free software movement.*applause*
We *absolutely* should base our rules on what's best for human beings, not >> > corporate constructs. That is the entire point of the free software*applause*
movement.
/me joins the cheering!
We *absolutely* should base our rules on what's best for human beings, not >>>> corporate constructs. That is the entire point of the free software*applause*
movement.
/me joins the cheering!
With all this enthusiasm, maybe it's time to resurrect >https://www.debian.org/vote/2004/vote_002
Matthias Urlichs <[email protected]> writes:
The problem is that all those missing factors are destined to go
un-missing — and then what? We can't base our rules on biological exceptionalism.
Why not? The entirety of law, politics, and civilization is designed by humans, for humans. Free software is a movement of humans that attempts to provide other humans with specific freedoms and guarantees around the software they use. I don't work on free software because I want to make something easier for Google's LLM. I work on free software because I want
to give freedom and control to human beings.
I'm not disputing any of that. *Of course* we should write our rules and
laws to benefit humans / humanity, not robots or AIs or corporate profiteering or what-have-you.
All I'm saying is that the idea "a human can examine a lot of
copyrighted stuff and then produce non-copyrighted output but a computer cannot" might still hold some water today, but the bucket is leaky and getting leakier every couple of months, if not weeks.
I find that thinking to be rather limited. LLM are not self-aware or self-operating entities. There is always a human that uses an LLM.
It's their freedom that you are discounting.
Moreover - there are *far* more people that can use an LLM to benefit
from its gathered knowledge compared to the number of people that have
spent decades learning programming like we have. Hating on LLMs hurts
the freedom of a lot more people.
This was in response to Russ articulating that: "I don't work on free software because I want to make something easier for Google's LLM. I
work on free software because I want to give freedom and control to
human beings."
The false assumption here being that making "something easier for LLMs"
will only benefit Google (who are nowhere near top in terms of AI development, btw) and not "human beings", which quite obviously fails to
take in account any freedom and control that a LLMs *does* in fact give
its users, who are also human beings.
Aigars, it would be a lot easier to have this conversation with you if you pay somewhat closer attention to what other people are really arguing.
The issue is also discussed here:
https://lwn.net/Articles/1019028/
A better wording goes:
s/open source license/DFSG-compatible license/g
It is fixed in the git repo: https://salsa.debian.org/lumin/gr-ai-dfsg/-/commit/9496f9fb6405db5a99fff1672cd4bad66c925c24
The proposal after amendament:
===============================================================================
Proposal A: "AI models released under DFSG-compatible license license without
original training data or program" are not seen as DFSG-compliant.
===============================================================================
The "AI models released under DFSG-compatible license license without original
training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.
2. What is the preferred form of modification? This is IMHO the
deciding, relevant question.
Aigars says weights and I've heard that from several other people active
in machine learning. OSI says the same.
Mo Zhu says training data is. I haven't heard that from anybody else.
It would be a lot easier to have a conversation with you, if you would
spend more time articulating and detailing *your own* position, instead
of guessing about the positions of others (and then talking down to
those positions). Ideally in the actual manner that matters to you.
This is going to be a really long mail message and I'm sorry.
I care about three different things when it comes to machine learning
models in Debian.
This is going to be a really long mail message and I'm sorry. I'm not
making it long to try to browbeat people; I'm making it long because I
don't know how to express how I feel in fewer words and still try to
capture the nuance and complications.
Second, in the specific case of *software*, I think our current compromise
is over-broad in what it protects. Software is frequently *not* a deeply >meaningful creative human communication that reflects its creator. It's
often algorithmic, mechanical, and functional, attributes that, elsewhere
in our copyright compromise, define works that are not protected by >copyright. I don't consider protecting every software program as strongly
as a novel or painting to be morally justifiable.
On Wed, 14 May 2025 at 08:58, Simon Josefsson <[email protected]> wrote:
To me I think we have at least two camps:
1) We must have DFSG-compliant licensing of source code for everything
in main, and that source code should encompass everything needed for a
skilled person to re-create identical (although possibly not bit-by-bit
identical) artifacts.
2) We must have DFSG-compliant licensing of source code for everything
in main, but training data is not part of source code. Instead source code for
training models would be code and protocol describing how to generate
or gather training data in such a way that a skilled person would be able to re-create functionally the same (although not identical) artifacts. If re-creation
is impractical (due to compute costs) then the model must also be modifiable after training by a skilled person with tooling in the archive.
2) It is acceptable to not have DFSG-compliant licensing for things that aren't important for Debian and still ship those, because doing so helps
our users and helping users is more important than DFSG-licensing.
Thanks for writing this! I find myself in agreement with most (if not
all) of it, but what is puzzling me is how this differs from the other proposals presented earlier.
Is it possible give a short summary of principles that differs in your thinking from Thorsten Glaser's proposal? I find find myself agreeing
with both of you.
Thank you very much for the write-up. I do highly appreciate the time
taken to express your position. Now we do have a clear and coherent
moral position that could be read and understood.
I clearly disagree on a few key things, but, as you said, there is
little point discussing it if no decision is to be made right now.
On Wed, 14 May 2025 at 00:03, Soren Stoutner <[email protected]> wrote:that
On Tuesday, May 13, 2025 12:06:05 PM Mountain Standard Time Ilu wrote:
2. What is the preferred form of modification? This is IMHO the
deciding, relevant question.
Aigars says weights and I've heard that from several other people active in machine learning. OSI says the same.
Mo Zhu says training data is. I haven't heard that from anybody else.
I thought several other people besides Mo Zhu had also said that on this list, but just in case they haven’t, I would like to go on the records
anythingI also feel that training data is one of the preferred forms of modification in machine learning and should be thus considered for
being included in main.
Could you expand a bit on this topic, so I can understand this position better?
Say that we are talking about an otherwise-free LLM model trained on a multi-gigabyte data set. Data from the dataset may be downloaded from
the Internet (but may not redistributed by Debian). Let's assume that
the source code of the LLM also includes a script that would, if
executed, do all the downloading and formatting of the training data
from Internet sources for you. The data *may* even be binary identical
to the original training data (if it is only trained on snapshotted
data mining collections that one can download from torrent via a
magnet link for example), or it may be in a newer state than when it
was trained originally (if you choose to switch to newer snapshots or
if data collection happens directly from source servers or their
proxies). You can add, remove or filter data sources to modify the
contents of the training data on a high or granular level.
Would that be a sufficient definition of training data to satisfy the preferred form of modification criteria for you?
If any use of the original training data (or of its description as
above) requires 100 000 Nvidia H100 cards running for a month using a
few billion USD of investment and several million dollars of
electricity, does that training data *still* satisfy the criteria for "preferred form of modification"?
And, to ask explicitly, is raw training data a better form of
modification for you compared to a description of that same training
data, in automated form that would generate the training data for you
on request?
Is it important for you if the training data *only* comes to you from
Debian mirrors? Or is the same data coming to you from other sources
also fine?
applicationsIn my opinion, it is fine to include otherwise distributable ML
without available training data in non-free.
Technically - yes, and I would be fine to include OSI-free AI in
Debian non-free, but IMHO it does nothing to resolve ethical concerns.
If we limit that to only OSI-free AI then that would also be giving
the same kind of guidance to the AI community - with both upsides and downsides.
That is not what I asked. Redistributing is a completely different
question from a different point of DFSG and even from interpretation
of whether DFSG even applies to the training data as such. And that in
turn very specifically depends on a very isolated question - what is
the preferred form of modification. And that is why I am
*specifically* asking how your opinion that "training data is the
prefered form of modification" works in real world examples.
Only that specific criteria. Not about Debian, not about main or
non-main. Not for other people or for the project.
What does "preferable form of modification" mean for *you*? For
example in that case above. Is the raw training data *really* _the_ preferable form of modification? Or is it the data definition? Which
would you *prefer* to *modify*?
On Wed, 14 May 2025 at 23:13, Soren Stoutner <[email protected]> wrote:
On Wednesday, May 14, 2025 1:51:27 PM Mountain Standard Time AigarsMahinovs
wrote:
That is not what I asked. Redistributing is a completely different question from a different point of DFSG and even from interpretation
of whether DFSG even applies to the training data as such. And that in turn very specifically depends on a very isolated question - what is
the preferred form of modification. And that is why I am
*specifically* asking how your opinion that "training data is the prefered form of modification" works in real world examples.
Only that specific criteria. Not about Debian, not about main or non-main. Not for other people or for the project.
What does "preferable form of modification" mean for *you*? For
example in that case above. Is the raw training data *really* _the_ preferable form of modification? Or is it the data definition? Which would you *prefer* to *modify*?
In my opinion, the preferred form of modification is the raw trainingdata. I
apologize if I did not make this clear in my previous email. I thoughtI had.
You would *actually* technically, in reality, prefer digging through gigabytes of text files and do some kind of manual modifications in
that sea of raw data? Modifications that are basically impossible to
track in any kind of change tracker. That are excessively hard and
time consuming to actually do and check. Instead of just adjusting
input parameters on the ingest script? *That* is what I consider to be frankly very hard to believe.
I rather get the impression that you prefer expressing this position
because of the logical consequences on the discussion. Especially if
you immediately change the topic from prefered form of modification to redistribution and DFSG and main and other things that are entirely irrelevant to the question of what is the prefered form of
modification. Technically. In practice. Not morally or spiritually.
However, as you asked my opinion of what Debian’s policy should be, I endorse the above.
That is *very* explicitly *not* what I asked your opinion on. I asked
you to consider very specific examples and what is the prefered form
of modification in those cases. Really consider.
--
Best regards,
Aigars Mahinovs
During the course of my semester thesis on Retrieval-Augmented Generation (RAG), I encountered a compelling example wherein an AI model identified a previously unknown biomarker associated with cancer. This discovery was
only possible because the researchers had access to the underlying dataset. Without that access, the model’s findings would have been opaque and potentially unverifiable.
This brings me to a central concern: when data scientists are given a model to work with, their first question is often:
“What data was used to train it?”
This question is not incidental. It is fundamental to understanding the model’s behaviour, biases, and limitations. It is also essential for scientific reproducibility.
In the course of the earlier email exchange, it was argued that the
hardware requirements for training large-scale models place them out of
reach for anyone without a budget in the range of 100 M€. While this may be true for frontier-scale models, I believe it overlooks a significant
portion of real-world use cases.
In my undergraduate work, we frequently relied on publicly available
datasets from sources such as Kaggle. These enabled us to train our own models, interpret results, and explore data-driven questions in a hands-on manner. Providing access to training data empowers researchers,
institutions, and independent developers to create models adapted to their specific needs. Moreover, it facilitates the composability of data, an essential feature in interdisciplinary research and real-world applications.
It is absolutely critical to a very specific DFSG question on what is
the "prefered form of modification".
You would *actually* technically, in reality, prefer digging through gigabytes of text files and do some kind of manual modifications in
that sea of raw data? Modifications that are basically impossible to
track in any kind of change tracker. That are excessively hard and
time consuming to actually do and check. Instead of just adjusting
input parameters on the ingest script? *That* is what I consider to be frankly very hard to believe.
On Wed, May 14, 2025 at 10:51:27PM +0200, Aigars Mahinovs wrote:
It is absolutely critical to a very specific DFSG question on what is
the "prefered form of modification".
On Wed, May 14, 2025 at 10:51:27PM +0200, Aigars Mahinovs wrote:
It is absolutely critical to a very specific DFSG question on what is
the "prefered form of modification".
This is maybe a side terminological note, but note that the DFSG talks
about "source code", not "preferred form of modification". The latter >expression comes historically from the GPL, rather than DFSG (or OSD).
On Thu, 15 May 2025 at 10:06, Stefano Zacchiroli <[email protected]> wrote:[..]
But I don't think it is disputable that the *most general* way of
modifying an ML model is achievable only starting from the full training dataset and pipeline. There are simply things that you cannot do
starting from the trained model.
This is not quite the point I was trying to make in this specific
thread. I was pointing out the difference between raw blob of training
data and pipeline that creates/gathers that raw blob of training data.
But I do think that it should be perfectly fine to have an ingest
pipeline that simply downloads " https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-18/warc.paths.gz
" for example.
FWIW, I agree that "where is it hosted?" is a less important question
wrt the one of whether the full/pristine training dataset is
available, for our users, *somewhere* in the first place. But note
that if Debian accepts not to host datasets on its own infrastructure,
then a number of practical issues arises, e.g., what do we do with the package in main if/when the data disappears from the external hosting
place?
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 13:12:53 |
| Calls: | 12,100 |
| Files: | 15,003 |
| Messages: | 6,518,006 |