Forum: >>> Magnum BBS <<<

Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Mode

From M. Zhou@21:1/5 to All on Sat Apr 19 20:00:01 2025

=============================================================================== Brief Background, Definition, Scope, and Purpose of the Proposal ===============================================================================

AI software grows more and more popular, becoming a notable part of the software ecosystem. This trend reveals some new questions and challenges, especially in the interpretation of the Debian Free Software Guidelines (DFSG) on pre-trained AI models, urging the Debian Project to revisit its interpretation of the Debian Free Software Guidelines (DFSG) in the context of AI software and models.

A pre-trained "AI model" is usually stored on disk in binary formats designed for numerical arrays, as a "model checkpoint" or "state dictionary", which is essentially a collection of matrices and vectors, holding the learned information from the training data or simulator. When the user make use of such file, it is usually loaded by an inference program, which performs numerical computations to produce outputs based on the learned information in the model. Please refer to the appendix for more background information about AI.

This proposal focuses on one interpretation of the DFSG on a particular type of pre-trained AI models, that (1) is released under DFSG-compliant free software licenses like MIT/Expat, Apache-2.0, etc, and satisfies any of (2) and (3) below -- (2) is trained on data or simulator that is private, proprietary, or inaccessible to the public; (3) does not provide the original training program. To avoid creating new terminologies, we will refer to this type of file as "AI models released under open source license without original training data or program" without any abbreviation. Such models are referred to as "Open Weights" in some circumstances (See: https://opensource.org/ai/open-weights).

The purpose of this proposal is to reach a community consensus on how we should treat and handle the described type of AI models, which is an inevitable issue in the future. If necessary, I can work with the Debian Policy Team to incorporate the GR result into appropriate sections of the Debian Policy (e.g., in Section 10 "Files").

| Note: While nowadays people use "AI" to refer to LLMs, it is a very broad term
| that covers much more than language models. AI models apart from language
| models must be considered as well, such as computer vision models, audio
| recognition models, etc.

| Note: If condition (1) is not satisfied, it is usually seen "non-free" in the | context of Debian community and no voting is needed. In addition, if
| everything (including but not limited to the model itself, training data
| training program, and inference program) is released under DFSG-compliant
| licenses, that again needs no voting.

| Note: Traditional software parts, like a Python script or a C++ program, are | out of the scope of this proposal since that is a well-defined case. For
| example, a deep learning framework or inference software written in
| Python or C++, i.e., the program that runs the AI models, is out of the
| scope of this proposal.

=============================================================================== Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================

The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

------------------------------------------------------------------------------- Appendix -------------------------------------------------------------------------------

Inevitably there may be some terminology and/or backgrounds that is not well-known or well-understood by the general public. Please refer to the appendices for more information. If you cannot find relevant information to answer your question, please consult a human professional -- or an LLM.

See appendix A for detailed rationale of this proposal.
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.

[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt
[Appendix B] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixB.txt
[Appendix C] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixC.txt
[Appendix D] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixD.txt

Disclaimer
----------

We acknowledge that releasing useful AI models under permissive licenses like MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the society. We sincerely respect the respective authors' work. On the other hand, DFSG sets a pretty high standard on software that can be included in the Debian distribution, which means the GR may lead to some results that not everybody agrees with. Nevertheless, we appreciate your understanding of the mission of the Debian project -- to create a free operating system, where the "free" means "software freedom".

-----BEGIN PGP SIGNATURE-----

iQJFBAABCgAvFiEEY4vHXsHlxYkGfjXeYmRes19oaooFAmgD48ERHGx1bWluQGRl Ymlhbi5vcmcACgkQYmRes19oaorFyQ/+Orf6+6RYieYD95s0gxBM2qmHISEa3Xeh yLKJ7t2uiOeI6iHSOsfHCfJ6Bn4WATQDH6k12DCnFYglWflIRXh9hh+uGpfA7bY3 PhvU4+hpbE3F2trV5Qi7ai4HuJvUEp6rA50pgHjA90kpKANHvAaNMRPBBuYuy3k2 JOBGKDAOYKTxA+l6NMpzaD99BD/5eBrK3BsXW/zMtFNhdXQhFDBlRJAdKZrE95gg nZkQP5I6kDzmHax9+7X3btq99OTPU6szD3s142ytV/+AGh32yYBQJGn0glXz2ito 7toDnCDZWTcTrbqZh6+9XHFAYBbxHdE4wIUk/UX6qzKidj1gBD31ED1DwYetBT3B sf0TuP8/5xie3ShmLOWHFx1q5k4rMvjpVObtie3glStYYzBG3ov3oDJ7iyibZgnT oK+89ajpDfbrWzHLLDZYPVSdIyrNLTPi4j8YXmqdURi1NXEOysjupndx1pAondxj FyJ4ijozuC6vq3w5jHw5U/nz/tmITLaw6/LSpkUURV6z7rg01t0xvW/74UQ4ejir nspyAWNZNkniUftrva3aKBA1JkJahYGyGFnaBmbeQNhi2/4CyWPJ8xp95L/euOr4 FZGG8McF3nM6ciq8SjTqBnJ5eoZlS1+a46D9pG+G59tVoDpaOIobFnaGqHETvofg
YoGwL06OB+E=
=ad91
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Francois Mazen@21:1/5 to All on Mon Apr 21 10:10:01 2025

I support and sponsor this proposal in its entirety.

Thanks M.Zhou/lumin for your hard work regarding DFSG and AI models.

François (mzf)

Le samedi 19 avril 2025 à 13:56 -0400, M. Zhou a écrit :

==============================================================================
=
Brief Background, Definition, Scope, and Purpose of the Proposal ==============================================================================
=

AI software grows more and more popular, becoming a notable part of the software ecosystem. This trend reveals some new questions and challenges, especially in the interpretation of the Debian Free Software Guidelines (DFSG)
on pre-trained AI models, urging the Debian Project to revisit its interpretation of the Debian Free Software Guidelines (DFSG) in the context of
AI software and models.

A pre-trained "AI model" is usually stored on disk in binary formats designed for numerical arrays, as a "model checkpoint" or "state dictionary", which is essentially a collection of matrices and vectors, holding the learned information from the training data or simulator. When the user make use of such
file, it is usually loaded by an inference program, which performs numerical computations to produce outputs based on the learned information in the model.
Please refer to the appendix for more background information about AI.

This proposal focuses on one interpretation of the DFSG on a particular type of
pre-trained AI models, that (1) is released under DFSG-compliant free software
licenses like MIT/Expat, Apache-2.0, etc, and satisfies any of (2) and (3) below -- (2) is trained on data or simulator that is private, proprietary, or inaccessible to the public; (3) does not provide the original training program.
To avoid creating new terminologies, we will refer to this type of file as "AI
models released under open source license without original training data or program" without any abbreviation. Such models are referred to as "Open Weights" in some circumstances (See: https://opensource.org/ai/open-weights).

The purpose of this proposal is to reach a community consensus on how we should
treat and handle the described type of AI models, which is an inevitable issue
in the future. If necessary, I can work with the Debian Policy Team to incorporate the GR result into appropriate sections of the Debian Policy (e.g.,
in Section 10 "Files").

Note: While nowadays people use "AI" to refer to LLMs, it is a very broad term
that covers much more than language models. AI models apart from language
models must be considered as well, such as computer vision models, audio recognition models, etc.

Note: If condition (1) is not satisfied, it is usually seen "non-free" in the
context of Debian community and no voting is needed. In addition, if
everything (including but not limited to the model itself, training data training program, and inference program) is released under DFSG-compliant
licenses, that again needs no voting.

Note: Traditional software parts, like a Python script or a C++ program, are
out of the scope of this proposal since that is a well-defined case. For example, a deep learning framework or inference software written in
Python or C++, i.e., the program that runs the AI models, is out of the scope of this proposal.

==============================================================================
=
Proposal A: "AI models released under open source license without original training data or program" are not seen as DFSG-compliant.
==============================================================================
=

The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

------------------------------------------------------------------------------
-
Appendix ------------------------------------------------------------------------------
-

Inevitably there may be some terminology and/or backgrounds that is not well-known or well-understood by the general public. Please refer to the appendices for more information. If you cannot find relevant information to answer your question, please consult a human professional -- or an LLM.

See appendix A for detailed rationale of this proposal.
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.

[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt
[Appendix B] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixB.txt
[Appendix C] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixC.txt
[Appendix D] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixD.txt

Disclaimer
----------

We acknowledge that releasing useful AI models under permissive licenses like MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the society. We sincerely respect the respective authors' work. On the other hand,
DFSG sets a pretty high standard on software that can be included in the Debian
distribution, which means the GR may lead to some results that not everybody agrees with. Nevertheless, we appreciate your understanding of the mission of
the Debian project -- to create a free operating system, where the "free" means
"software freedom".

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEhqWr1v/bCgx/UFfTR5f6chw1HJ4FAmgF+9UACgkQR5f6chw1 HJ54PxAA1TNdT1k+OLTapdX4UlljoQqD7D6QnB5g13qniVCy9OFJcnoyBrgLc0nH Zq/noA97NMm/TnLUh5KXXLuHFhuQSh1ZLaP6NblxlapdCR3Cs2lzzAD80S07OBzo t+jEsiAnU2YkPHx9ABubhMELhBdB1PQrwcadNN4XS35e0X3X6mumvrhgMipYITMY mWWe0t6HiYEgo49cQz6kOgfUnJxfGwBwH9GhCWOgHma1LTghwYtzle0TYtlDgEDX 8BGTd4HztsdooT6CYQORDCTJSUrjnHEAlqMN+OAdsOx8cOhNclXVLZLzc2nER06J L+wK8tt7oS5qB4oK4oIjPXE3yc8Z7FFbWrorc0DQk7/TweEZkgJWbEW9rdn2T0lp HkBPMuO+avpJn+Igi5q4AT6NnwHNnnSY6p7jPqH4wvsDJcYZPGmMxvgcShadi91y /pgWyHyZRDLC26hMzcQ7D7yteyMzxdxXvgeMe1i4ByqKKHJD/BfIiwJMIWWMcDFy HzRDnhPmZH3RTz7IqPRl+aVIOR2hubLi7EB8UZzHy46smFOO3HC0IVojIp0sLeUg nVfJhiiJRnGosMA+Kk/2hSh8pBM3oW2qfU3kVrpTXTQrxhblXcwiZHVu0FFmV7Le BBs14PYyjCChktjqT+Qb73BqjcodJJxWLH7lpQRq6aoYPi9Qay4=
=58VN
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Timo =?utf-8?Q?R=C3=B6hling?=@21:1/5 to All on Mon Apr 21 11:00:01 2025

Hi,

* M. Zhou <[email protected]> [2025-04-19 13:56]: >===============================================================================

Brief Background, Definition, Scope, and Purpose of the Proposal >===============================================================================

AI software grows more and more popular, becoming a notable part of the >software ecosystem. This trend reveals some new questions and challenges, >especially in the interpretation of the Debian Free Software Guidelines (DFSG) >on pre-trained AI models, urging the Debian Project to revisit its >interpretation of the Debian Free Software Guidelines (DFSG) in the context of >AI software and models.

A pre-trained "AI model" is usually stored on disk in binary formats designed >for numerical arrays, as a "model checkpoint" or "state dictionary", which is >essentially a collection of matrices and vectors, holding the learned >information from the training data or simulator. When the user make use of such
file, it is usually loaded by an inference program, which performs numerical >computations to produce outputs based on the learned information in the model. >Please refer to the appendix for more background information about AI.

This proposal focuses on one interpretation of the DFSG on a particular type of
pre-trained AI models, that (1) is released under DFSG-compliant free software >licenses like MIT/Expat, Apache-2.0, etc, and satisfies any of (2) and (3) >below -- (2) is trained on data or simulator that is private, proprietary, or >inaccessible to the public; (3) does not provide the original training program.
To avoid creating new terminologies, we will refer to this type of file as "AI >models released under open source license without original training data or >program" without any abbreviation. Such models are referred to as "Open >Weights" in some circumstances (See: https://opensource.org/ai/open-weights).

The purpose of this proposal is to reach a community consensus on how we should
treat and handle the described type of AI models, which is an inevitable issue >in the future. If necessary, I can work with the Debian Policy Team to >incorporate the GR result into appropriate sections of the Debian Policy (e.g.,
in Section 10 "Files").

| Note: While nowadays people use "AI" to refer to LLMs, it is a very broad term
| that covers much more than language models. AI models apart from language >| models must be considered as well, such as computer vision models, audio >| recognition models, etc.

| Note: If condition (1) is not satisfied, it is usually seen "non-free" in the
| context of Debian community and no voting is needed. In addition, if
| everything (including but not limited to the model itself, training data >| training program, and inference program) is released under DFSG-compliant >| licenses, that again needs no voting.

| Note: Traditional software parts, like a Python script or a C++ program, are >| out of the scope of this proposal since that is a well-defined case. For >| example, a deep learning framework or inference software written in
| Python or C++, i.e., the program that runs the AI models, is out of the
| scope of this proposal.

===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant.
===============================================================================

The "AI models released under open source license without original training >data or program", a particular type of files as explained above, are not seen >as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section >of Debian archive can include those files.

-------------------------------------------------------------------------------
Appendix >-------------------------------------------------------------------------------

Inevitably there may be some terminology and/or backgrounds that is not >well-known or well-understood by the general public. Please refer to the >appendices for more information. If you cannot find relevant information to >answer your question, please consult a human professional -- or an LLM.

See appendix A for detailed rationale of this proposal.
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.

[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt
[Appendix B] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixB.txt
[Appendix C] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixC.txt
[Appendix D] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixD.txt

Disclaimer
----------

We acknowledge that releasing useful AI models under permissive licenses like >MIT/Expat and Apache-2.0 is a generous act from the original authors due to >huge costs, and it is a great contribution to the software ecosystem and the >society. We sincerely respect the respective authors' work. On the other hand,
DFSG sets a pretty high standard on software that can be included in the Debian
distribution, which means the GR may lead to some results that not everybody >agrees with. Nevertheless, we appreciate your understanding of the mission of >the Debian project -- to create a free operating system, where the "free" means
"software freedom".

I sponsor this GR proposal.

Cheers
Timo

--
⢀⣴⠾⠻⢶⣦⠀ ╭────────────────────────────────────────────────────╮
⣾⠁⢠⠒⠀⣿⡁ │ Timo Röhling │
⢿⡄⠘⠷⠚⠋⠀ │ 9B03 EBB9 8300 DF97 C2B1 23BF CC8C 6BDD 1403 F4CA │
⠈⠳⣄⠀⠀⠀⠀ ╰────────────────────────────────────────────────────╯

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEmwPruYMA35fCsSO/zIxr3RQD9MoFAmgGBuUACgkQzIxr3RQD 9MojRQ/9H+jUmr1XNY1MA1RyMEf1FJzCa18OB5ImLUVAAl/xOfcQnUbkk1vR0uTy RCmzywFFPpHXR5n08VaXT0IseLtp7yEKPT4EuzMZp7gYY+dVz8kBlF+0b8UKTPgP gkGggZ3fLzsm3mUoh582s5iyMvFaiCgfPVUBoAtArh7Rk2CtgmwSm89NWfk4f7bM yzGq8QjzkmdaJxn/dYSHV7gePjFB0QTatz4LWmH7czmPc3RBc4rWwOEVRMZc6uxr 531CJl5JverlIsBUkloFUHH3aIxbAGRAGDBrmcwi80sfgil5hSM2NLxrGlAiLgfn M3lmlSpvWjJbbgt3K91/Wqqncl/nSXzRQ317Ebfuk5k

From Matthias Urlichs@21:1/5 to All on Mon Apr 21 11:30:02 2025

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------dk0kwqXKuruchXvfSHXQjNvA
Content-Type: multipart/mixed; boundary="------------ipOPWXcJLt0qorcr5htRE04j"

--------------ipOPWXcJLt0qorcr5htRE04j
Content-Type: multipart/alternative;
boundary="------------oqGGLdbVnDFKCbjc0sgdYeBf"

--------------oqGGLdbVnDFKCbjc0sgdYeBf
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMTkuMDQuMjUgMTk6NTYsIE0uIFpob3Ugd3JvdGU6DQo+IFByb3Bvc2FsIEE6ICJBSSBt b2RlbHMgcmVsZWFzZWQgdW5kZXIgb3BlbiBzb3VyY2UgbGljZW5zZSB3aXRob3V0IG9yaWdp bmFsDQo+ICAgICAgICAgICAgICB0cmFpbmluZyBkYXRhIG9yIHByb2dyYW0iIGFyZSBub3Qg c2VlbiBhcyBERlNHLWNvbXBsaWFudC4NCg0KSSBzcG9uc29yIHRoaXMgR1IgcHJvcG9zYWwu DQoNCi0tIA0KLS0gcmVnYXJkcw0KLS0gDQotLSBNYXR0aGlhcyBVcmxpY2hzDQoNCg== --------------oqGGLdbVnDFKCbjc0sgdYeBf
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 19.04.25 19:56, M. Zhou wrote:<br>
</div>
<blockquote type="cite" cite="mid:[email protected]">
<pre class="moz-quote-pre" wrap="">Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant.</pre>
</blockquote>
<p>I sponsor this GR proposal.<br>
</p>
<pre class="moz-signature" cols="72">--
-- regards
--
-- Matthias Urlichs</pre>
</body>
</html>

--------------oqGGLdbVnDFKCbjc0sgdYeBf--

--------------ipOPWXcJLt0qorcr5htRE04j
Content-Type: text/vcard; charset=UTF-8; name="matthias.vcf" Content-Disposition: attachment; filename="matthias.vcf" Content-Transfer-Encoding: base64

QkVHSU46VkNBUkQNClZFUlNJT046NC4wDQpOOlVybGljaHM7TWF0dGhpYXM7OzsNCk5JQ0tO QU1FOlNtdXJmDQpFTUFJTDtQUkVGPTE6bWF0dGhpYXNAdXJsaWNocy5kZQ0KVEVMO1RZUEU9 d29yaztWQUxVRT1URVhUOis0OSA5MTEgNTk4MTggMA0KVVJMO1RZUEU9aG9tZTpodHRwczov L21hdHRoaWFzLnVybGljaHMuZGUNCkVORDpWQ0FSRA0K

--------------ipOPWXcJLt0qorcr5htRE04j--

--------------dk0kwqXKuruchXvfSHXQjNvA--

-----BEGIN PGP SIGNATURE-----

wsF5BAABCAAjFiEEr9eXgvO67AILKKGfcs+OXiW0wpMFAmgGD7sFAwAAAAAACgkQcs+OXiW0wpMR 9w/+OyNlTrXY3PiOl96HUblYtXh6kWINu6u94oT9BRveWgvFPnPSvjInOVjH9+eZ4O893gH8yZTt cW5svq6eLVe2vBl7C1OfvFeL5YySVQ+54SLtmgJOnjz6FOjHHpG8SBEnJZf6KSRM1ya/4Z8gAyeX wuXNbbCNc3ScsoRb5QrQf2GKLvRxVSFG1hWU5xYGUqaGwj7nEP1uQGGuhG+0pFKfc20+Pbfivr1F oeNZw8Zt7aCsYRNUlv0QbFr5kN

From Gunnar Wolf@21:1/5 to All on Mon Apr 21 18:30:03 2025

Hello,

M. Zhou dijo [Sat, Apr 19, 2025 at 01:56:17PM -0400]: >===============================================================================

Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant.
===============================================================================

The "AI models released under open source license without original training >data or program", a particular type of files as explained above, are not seen >as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section >of Debian archive can include those files.

Thank you very much, Mo Zhou. I am happy to sponsor this proposal as
presented.

We cannot magically extend DFSG-freeness to a binary we have no way to
recreate (or modify) just because it serves as pre-training for a kind of software that is likely to be increasingly used and that changes many assumptions we have long had.

This does not mean Debian will be shut aside from participating in the LLM world or other AI-related applications. As it has often happened, users interested in running non-DFSG-free models will be able to download them
from other sources — or even from our own non-free section.

– Gunnar.

-----BEGIN PGP SIGNATURE-----

wr0EABYKAG8FgmgGcG4JEOL2O0NT9FmJRxQAAAAAAB4AIHNhbHRAbm90YXRpb25z LnNlcXVvaWEtcGdwLm9yZ3Y3Djyp6SNA4NHr2x9wgT+PV3UMw+Zld60sgs+AejaT FiEEYLMJPZYQjly5cULv4vY7Q1P0WYkAADZZAQC8vuJCm+06iE5WM63uRmw/GGFr 0U3Qfh8NNI21C+JihwEAocCxjAHVBJGLH2Z8Wi0EWceCxklpnSd6Uwb/xlBnAA4=
=V7Ge
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gunnar Wolf@21:1/5 to All on Tue Apr 22 19:30:01 2025

Holger Levsen dijo [Tue, Apr 22, 2025 at 05:16:51PM +0000]:

On Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou wrote:

We acknowledge that releasing useful AI models under permissive licenses like
MIT/Expat and Apache-2.0 is a generous act from the original authors due to >> huge costs, and it is a great contribution to the software ecosystem and the >> society. We sincerely respect the respective authors' work.

i'm not sure i can subscribe to this. after all, most if not all "AI" models >exist because of stealing other peoples work...

or did i miss consentual "AIs"?

AI models _can_ be prepared with legally distributable content. And that
would be very adequate in this context. If we had an LLM trained _only_,
say, on the Wikipedia and the books of Project Gutenberg... Of course, it
would not be a generally-queriable chatbot, but it could be just-enough for sustaining some level of dialogue. It could be distributed in our main
section. And if you "post-train" that very little model with the domain-specific documents/knowledge you want to work in, you can find clear value off a truly-free LLM.

Of course, I don't know if the training set I'm proposing would be anywhere near enough. And I cannot say we have buildds with enough GPU power to
perform the training. But it's just an example that can surely be "beat
into correctness".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Holger Levsen@21:1/5 to M. Zhou on Tue Apr 22 19:20:01 2025

On Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou wrote:

We acknowledge that releasing useful AI models under permissive licenses like MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the society. We sincerely respect the respective authors' work.

i'm not sure i can subscribe to this. after all, most if not all "AI" models exist because of stealing other peoples work...

or did i miss consentual "AIs"?

--
cheers,
Holger

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

Nach wieviel Einzelfällen wird ein Einzelfall zum Normalfall?
(Jan Böhmermann)

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmgHzwMACgkQCRq4Vgaa qhw/AQ/9FIOz061RBfxy2bou1rakW88U20HozTjllpawgswacJjU8JyQhBnTlP3I Ip3mnOrELsZ4RtWqvSeTyBJb1QFO95ji7NsyP2ItFPMmaxwE7hRHNs46GmCmsilu u8xcjUEecoubgSQbn9vCyv0azqGPBrrDTu53tRoJct2QWSXkeInU22Bo1APgkZkf KnCGJA9K9Eoju5HRmynCLz/GNaBA7nGfv1ABNFbhGaXUHhF3Qj5zXCP5PP5acUz0 VTqGNLCwMwS/VJFbJ3beB9n0588LjMBFP+EW2k8bHc/c+jgnd9B7PExNbaillmMC ozqXZHsTuiudIjP8s/6ZhbU7+kU9MXD1ravWjsaBE7InPIfvIiRVXdKNsIU72Ejp dACQDaSTYlwSNCvmHIhWQWZznZ8ozJIs7e4w5HnfSOVbIpkmgR3Qa72wd1Venqul Vr69C+nIkgf8coGT6Sbhx7gu+1go5R7DASHO1PvF32ttk4IbiRaNZLc8XSzqMcxb OH6KkActCun8DB73pPxyMOhvvtFT5vJxqEpp28jYpXZYI/2Bv26g7NOYVrMdQJGb D/8pu3/TCtJ3W3pKiv+d+1/8b3NSAm5

From Carsten Leonhardt@21:1/5 to M. Zhou on Tue Apr 22 20:40:01 2025

I second this proposal, cited below in full.

Regards

Carsten

"M. Zhou" <[email protected]> writes:

===============================================================================
Brief Background, Definition, Scope, and Purpose of the Proposal ===============================================================================

AI software grows more and more popular, becoming a notable part of the software ecosystem. This trend reveals some new questions and challenges, especially in the interpretation of the Debian Free Software Guidelines (DFSG)
on pre-trained AI models, urging the Debian Project to revisit its interpretation of the Debian Free Software Guidelines (DFSG) in the context of
AI software and models.

A pre-trained "AI model" is usually stored on disk in binary formats designed for numerical arrays, as a "model checkpoint" or "state dictionary", which is essentially a collection of matrices and vectors, holding the learned information from the training data or simulator. When the user make use of such
file, it is usually loaded by an inference program, which performs numerical computations to produce outputs based on the learned information in the model.
Please refer to the appendix for more background information about AI.

This proposal focuses on one interpretation of the DFSG on a particular type of
pre-trained AI models, that (1) is released under DFSG-compliant free software
licenses like MIT/Expat, Apache-2.0, etc, and satisfies any of (2) and (3) below -- (2) is trained on data or simulator that is private, proprietary, or inaccessible to the public; (3) does not provide the original training program.
To avoid creating new terminologies, we will refer to this type of file as "AI
models released under open source license without original training data or program" without any abbreviation. Such models are referred to as "Open Weights" in some circumstances (See: https://opensource.org/ai/open-weights).

The purpose of this proposal is to reach a community consensus on how we should
treat and handle the described type of AI models, which is an inevitable issue
in the future. If necessary, I can work with the Debian Policy Team to incorporate the GR result into appropriate sections of the Debian Policy (e.g.,
in Section 10 "Files").

| Note: While nowadays people use "AI" to refer to LLMs, it is a very broad term
| that covers much more than language models. AI models apart from language | models must be considered as well, such as computer vision models, audio | recognition models, etc.

| Note: If condition (1) is not satisfied, it is usually seen "non-free" in the
| context of Debian community and no voting is needed. In addition, if
| everything (including but not limited to the model itself, training data | training program, and inference program) is released under DFSG-compliant | licenses, that again needs no voting.

| Note: Traditional software parts, like a Python script or a C++ program, are
| out of the scope of this proposal since that is a well-defined case. For | example, a deep learning framework or inference software written in
| Python or C++, i.e., the program that runs the AI models, is out of the
| scope of this proposal.

===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================

The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

-------------------------------------------------------------------------------
Appendix -------------------------------------------------------------------------------

Inevitably there may be some terminology and/or backgrounds that is not well-known or well-understood by the general public. Please refer to the appendices for more information. If you cannot find relevant information to answer your question, please consult a human professional -- or an LLM.

See appendix A for detailed rationale of this proposal.
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.

[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt
[Appendix B] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixB.txt
[Appendix C] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixC.txt
[Appendix D] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixD.txt

Disclaimer
----------

We acknowledge that releasing useful AI models under permissive licenses like MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the society. We sincerely respect the respective authors' work. On the other hand,
DFSG sets a pretty high standard on software that can be included in the Debian
distribution, which means the GR may lead to some results that not everybody agrees with. Nevertheless, we appreciate your understanding of the mission of
the Debian project -- to create a free operating system, where the "free" means
"software freedom".

-----BEGIN PGP SIGNATURE-----

iQJDBAEBCgAtFiEETxO/995XSygF6EUmKrIctfbulV4FAmgH3tEPHGxlb0BkZWJp YW4ub3JnAAoJECqyHLX27pVerNQQAL+1Ihx0XfyqtKJsK8SxSutijq5MKZI5WYFg +cZCYydA1OChaaWOTHNQHrIB+eI+GLH7i1WRs05QeM5i8aJflaHUj93xkJfixEq4 MgEtgAWQAgvY4dZsrR4oXkTL6yyBP7jwr2MQXyDYatXBM5C1RT7aZtpRZbqVs/Kq uISaJt2qxUYmq2KJ+qZS7q5r8ALcX0kbCAHYnPBitKSn1WPhhMXsl8AzySj681zr /ZIekt363ThdI8zmvw/EXbe5enyFDTUg7vQEWplKtdZBR252oDmMOa9iuTPcQ+Vo YLYOzzIkhJDvDBx2Dn0kbZAzrEhWUNKBFfydJmGBga/Mph+InnBzknoBs4oyvd+M lJx5KiTH9AxuJzKvJtU1jp6cU3RBbZ4PQJyw4OhSfXLnrlRcgtBZ2TyfcqNk5pZf cFO2+Dj97HdjMyfqh6QHDJtq+RaN/+MlycbTlPJeRYbpKtwMUO6lrNSG4MXTpvYi 2XMDc99LgfZHcVtdnj8XtInw7GeSKcYYCx77hMDFI/hKzES9UYF0S7X9DhUrRNar cN3ajag6LngYHCDaWbfUCQEOTO1gqmommkrDfRjqM+gf5aNAMx40bnRZPGfVjyOw PGlVi4LVWrzsTbPRgsKHZkSI8NWkAuM6vKPnaaPX+M2aO+7iJfKkyAiRIEohYDzJ
nrH4I20q
=WcMq
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ian Jackson@21:1/5 to M. Zhou on Tue Apr 22 21:20:01 2025

M. Zhou writes ("Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models"):

Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant.

The "AI models released under open source license without original
training data or program", a particular type of files as explained
above, are not seen as DFSG-compliant. Hence, they can not be
included in the "main" section of the Debian archive. This proposal
does not specify whether the "non-free" section of Debian archive
can include those files.

I support the intent of this proposal and the wording seems to do the
job.

Thank you for working on this. I have some comments about wording and structure.

Firstly, the clause "without original training data or program" is a
little ambiguous in a couple of respects:

a. I interpret this to mean that we neeed *both* the training data,
*and* the training software, for the model to be considered Free.

b. I interpret this to mean that the training data and the training
software must *also be in main*, not merely "present" (possibly
somewhere else outside Debian, and possibly not itself DFSG-free).

I think this wording could be read in other ways. But I think it
should be clear to full Debian Members, and to the ftpmasters - and
those are the people who will be implementing and enforcing this
policy.

----------
Appendix
----------

Does this form part of the GR proposal? In my experience GR proposals
often come with some preparatory text from the proponent, which we
aren't formally voting on.

See appendix A for detailed rationale of this proposal.
See appendix B for background and comments about current AI software.
See appendix C for some related previous efforts and discussions.
See appendix D for comments on potential implications of this proposal.

[Appendix A] https://salsa.debian.org/lumin/gr-ai-dfsg/-/blob/main/AppendixA.txt

...

When giving links in formal documents like this, you may wish to
consider whether to give URLs that specify the particular git
commitid. That nails down precisely what you're talking about - and
can even be used on a future occasion if the repository has moved
elsewhere.

If this is supposed to be part of the GR that we're voting on, and not
just explanatory background, then stating the commitids is essential.

Disclaimer
----------

We acknowledge that releasing useful AI models under permissive
licenses like MIT/Expat and Apache-2.0 is a generous act from the
original authors due to huge costs, and it is a great contribution
to the software ecosystem and the society.

I am uncomfortable with this part of the Disclaimer.

Ian.

--
Ian Jackson <[email protected]> These opinions are my own.

Pronouns: they/he. If I emailed you from @fyvzl.net or @evade.org.uk,
that is a private address which bypasses my fierce spamfilter.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to All on Tue Apr 22 23:00:01 2025

On Sat, 19 Apr 2025 at 13:56:17 -0400, M. Zhou made this GR proposal:

The "AI models released under open source license without original training >data or program", a particular type of files as explained above, are not seen >as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive.

Do we have an idea of whether/how many models that match this definition already exist in main? In the Policy process it's usual to require an
estimate of how many packages a particular Policy change will make "insta-RC-buggy", and I think GRs that change our self-imposed rules for
what we consider to be Free should do similarly.

The project is currently in freeze in preparation for our next stable release[1], the result of around 2 years of work since the last stable
release. If there are models currently in main that match this
definition, is this GR intended to take effect immediately, potentially
forcing the packages containing those models and their
reverse-dependencies to be removed from main? Or is its effect intended
to start from the beginning of the forky cycle?

Or, would it be considered to be valid for some relevant team (release
team? ftp team?) to tag RC bugs like "foobar contains an AI model without original training data" with the trixie-ignore tag, so that we can get the trixie release out, leaving the bug to be addressed for forky?

Our pre-release freeze process is already quite long (which it has to
be, to get a distribution of this size to a releasable quality) and a
long freeze hurts the project's momentum and level of motivation, so I
think it would be best if we can avoid lengthening that process further
by pausing the release process to vote on a change to our self-imposed
rules, particularly if that requires a subsequent pause to vote on
something similar to [2] or [3] before the release can happen.

smcv

[1] https://lists.debian.org/debian-devel-announce/2025/03/msg00011.html
[2] https://www.debian.org/vote/2004/vote_004
[3] https://www.debian.org/vote/2008/vote_003

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Holger Levsen on Tue Apr 22 23:20:01 2025

Hi,

On Tue, 2025-04-22 at 17:16 +0000, Holger Levsen wrote:

On Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou wrote:

We acknowledge that releasing useful AI models under permissive licenses like
MIT/Expat and Apache-2.0 is a generous act from the original authors due to huge costs, and it is a great contribution to the software ecosystem and the
society. We sincerely respect the respective authors' work.

i'm not sure i can subscribe to this. after all, most if not all "AI" models exist because of stealing other peoples work...

So just like Linux only exists by stealing other people's operating
system design? :-)

But the practical effects of passing the GR is probably (among other
things):

a) Removal of OCR software (like tesseract[1])
b) Removal of image recognition software (like opencv[2])
c) Possibly removal of text-to-speech software (like festival[3] or
flite[4])

I'm not sure what it means for other software with weights or similar
data of uncertain origin (say S-boxes in cryptographic algorithms,
possibly pre-set tuning parameters in drivers, who knows) or what
happens if someone manages to use the DFSG document as weights for an
AI model: it would certainly miss training data ;-)

Ansgar

[1]: https://sources.debian.org/src/tesseract-lang/1%3A4.1.0-2/
[2]: https://sources.debian.org/src/opencv/4.10.0%2Bdfsg-5/data/haarcascades/haarcascade_fullbody.xml/
[3]: https://sources.debian.org/src/festival-hi/0.1-11/hindi_NSK_diphone/festvox/hindi_NSK_ene.scm/#L3
[4]: https://sources.debian.org/src/flite/2.2-7/lang/cmu_us_kal/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Simon McVittie on Tue Apr 22 23:20:01 2025

Simon McVittie <[email protected]> writes:

On Sat, 19 Apr 2025 at 13:56:17 -0400, M. Zhou made this GR proposal:

The "AI models released under open source license without original
training data or program", a particular type of files as explained
above, are not seen as DFSG-compliant. Hence, they can not be included
in the "main" section of the Debian archive.

Do we have an idea of whether/how many models that match this definition already exist in main? In the Policy process it's usual to require an estimate of how many packages a particular Policy change will make "insta-RC-buggy", and I think GRs that change our self-imposed rules for
what we consider to be Free should do similarly.

gnubg, at least, comes with neural network weights that do not have source
code under this definition. I have to admit that while I was maintaining
the package I didn't give this a ton of a thought because it predates the
whole LLM craze. I'm not even sure if the data on which it's trained (backgammon games, I think mostly against bots) is copyrightable.

I suspect no source technically exists for those weights anywhere, since upstream's training work was fairly manual as I recall and is not
something they tend to work iteratively on, but it's possible that one of
the upstream developers has all of the data and scripts somewhere.

I'm not sure how many other old-school machine learning applications like
that we may have lurking around. I also have no strong opinion about what
we should do with such packages, to be clear; this should not be taken as
an objection to the proposed GR.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrea Pappacoda@21:1/5 to M. Zhou on Wed Apr 23 13:00:01 2025

--4f233a12cf2ff5f7b2639ba4faa292da46ec9c37f28f5eb7f778644c7002 Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8; format=Flowed

Hi,

On Sat Apr 19, 2025 at 7:56 PM CEST, M. Zhou wrote:

===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================

Thank you, Mo, for your work on this. I support and sponsor this
proposal.

Bye!

--4f233a12cf2ff5f7b2639ba4faa292da46ec9c37f28f5eb7f778644c7002
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iIcEABYIAC8WIQS6VuNIvZRFHt7JcAdKkgiiRVB3pwUCaAjFyxEcdGFjaGlAZGVi aWFuLm9yZwAKCRBKkgiiRVB3p1KUAQCBHq9XhkAALqdvaHV5WG65rOHNVzYg6K/e DujNRXoCtgD9HnBXXWNIJ0hpPSDoi5b/pfIo1lPS8196JJZDxfGeIgg=NQXl
-----END PGP SIGNATURE-----

--4f233a12cf2ff5f7b2639ba4faa292da46ec9c37f28f5eb7f778644c7002--

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Matthias Urlichs@21:1/5 to All on Wed Apr 23 12:40:01 2025

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------Ypdh5ziFjxdpKyCHt3NDSu0i
Content-Type: multipart/mixed; boundary="------------xe7Oo800R2WDsxqdTPnqZuiu"

--------------xe7Oo800R2WDsxqdTPnqZuiu
Content-Type: multipart/alternative;
boundary="------------SM2uvQUYUsobyrNPz7A0dqlV"

--------------SM2uvQUYUsobyrNPz7A0dqlV
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMjIuMDQuMjUgMjI6NTksIEFuc2dhciDwn5mAIHdyb3RlOg0KPiBCdXQgdGhlIHByYWN0 aWNhbCBlZmZlY3RzIG9mIHBhc3NpbmcgdGhlIEdSIGlzIHByb2JhYmx5IChhbW9uZyBvdGhl cg0KPiB0aGluZ3MpOg0KPg0KPiBhKSBSZW1vdmFsIG9mIE9DUiBzb2Z0d2FyZSAobGlrZSB0 ZXNzZXJhY3RbMV0pDQo+IGIpIFJlbW92YWwgb2YgaW1hZ2UgcmVjb2duaXRpb24gc29mdHdh cmUgKGxpa2Ugb3BlbmN2WzJdKQ0KPiBjKSBQb3NzaWJseSByZW1vdmFsIG9mIHRleHQtdG8t c3BlZWNoIHNvZnR3YXJlIChsaWtlIGZlc3RpdmFsWzNdIG9yDQo+IGZsaXRlWzRdKQ0KDQpZ b3UgbWlnaHQgd2FudCB0byB3cml0ZSBhIGNvdW50ZXItcHJvcG9zYWwgQiwgdGhlbi4NCg0K T3IgZXZlbiBhIHByb3Bvc2FsIEMgdGhhdCdzIG1vcmUgbnVhbmNlZC4NCg0KSSBtZWFuLCB3 aXRoIHRoZSByaWdodCBwcm9tcHQgeW91IGNhbiBnZXQgbWFueSBBSSBtb2RlbHMgdG8gcmVn dXJnaXRhdGUgDQpzb21lIG9mIHRoZSB0ZXh0cyBvciBpbWFnZXMgdGhleSd2ZSBiZWVuIHRy YWluZWQgd2l0aC4gVFRCT01LIGl0J3MgDQptb3N0bHktaW1wb3NzaWJsZSB0byBkbyB0aGF0 IHdpdGggVGVzc2VyYWN0IG9yIE9wZW5DVi4NCg0KTkIsIGRvIHdlIHJlYWxseSBuZWVkIHRv ICpyZW1vdmUqIHRoZXNlIHBhY2thZ2VzPyBvciBtYXliZSBqdXN0IG1vdmUgDQp0aGVtIHRv IGNvbnRyaWIsIGFuZCB0aGVpciBtb2RlbCBmaWxlcyB0byBub24tZnJlZT8NCg0KLS0gDQot LSByZWdhcmRzDQotLSANCi0tIE1hdHRoaWFzIFVybGljaHMNCg0K --------------SM2uvQUYUsobyrNPz7A0dqlV
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 22.04.25 22:59, Ansgar 🙀 wrote:<br>
</div>
<blockquote type="cite" cite="mid:[email protected]">
<pre class="moz-quote-pre" wrap="">But the practical effects of passing the GR is probably (among other
things):

a) Removal of OCR software (like tesseract[1])
b) Removal of image recognition software (like opencv[2])
c) Possibly removal of text-to-speech software (like festival[3] or flite[4])</pre>
</blockquote>
<p>You might want to write a counter-proposal B, then.</p>
<p>Or even a proposal C that's more nuanced.</p>
<p>I mean, with the right prompt you can get many AI models to
regurgitate some of the texts or images they've been trained with.
TTBOMK it's mostly-impossible to do that with Tesseract or OpenCV.</p>
<p>NB, do we really need to *remove* these packages? or maybe just
move them to contrib, and their model files to non-free?<br>
</p>
<pre class="moz-signature" cols="72">--
-- regards
--
-- Matthias Urlichs</pre>
</body>
</html>

--------------SM2uvQUYUsobyrNPz7A0dqlV--

--------------xe7Oo800R2WDsxqdTPnqZuiu
Content-Type: text/vcard; charset=UTF-8; name="matthias.vcf" Content-Disposition: attachment; filename="matthias.vcf" Content-Transfer-Encoding: base64

QkVHSU46VkNBUkQNClZFUlNJT046NC4wDQpOOlVybGljaHM7TWF0dGhpYXM7OzsNCk5JQ0tO QU1FOlNtdXJmDQpFTUFJTDtQUkVGPTE6bWF0dGhpYXNAdXJsaWNocy5kZQ0KVEVMO1RZUEU9 d29yaztWQUxVRT1URVhUOis0OSA5MTEgNTk4MTggMA0KVVJMO1RZUEU9aG9tZTpodHRwczov L21hdHRoaWFzLnVybGljaHMuZGUNCkVORDpWQ0FSRA0K

--------------xe7Oo800R2WDsxqdTPnqZuiu--

--------------Ypdh5ziFjxdpKyCHt3NDSu0i--

-----BEGIN PGP SIGNATURE-----

wsF5BAABCAAjFiEEr9eXgvO67AILKKGfcs+OXiW0wpMFAmgIv/wFAwAAAAAACgkQcs+OXiW0wpPS OBAA0E1d1YzXQC0vPcOmvQYFCXc/SHd6vMKzdQOxpdsjHl3xXsAvKdIAwHL72H2CkGMKQZMSL90p 03ii2ex2KCwksZFIbf8aVUgO06AP1jIT3pj6Qcku7WpqQc/1xwWPWbMPy+JOvPrRQk2/gJkPKlkX cQQzmqJQcQ8ewW/q0hDMaWTXG2KVq4nxAN+nSaA7F35bZbtvobhjw/pRMC0LZiQcsT/kJdvTKlxc 5lKWl1EV1BLQ1PfpH4fImGXz8T

From Mo Zhou@21:1/5 to M. Zhou on Thu Apr 24 16:30:01 2025

Hi all,

Too busy to write response to detailed questions. But here is a quick
comment about the original proposal:

AI can be an extremely complicated matter, trying to resolve a wildcard
AI issue
in a single GR will gradually expose much more problems, making things more
and more complicated. I know that because the original proposal is exactly
the result of going through wildcard cases, simplifying things, and
eventually
distilled into a very short text -- namely, it is the most problematic
case for AI. The case in the original proposal is technically well-defined.
A side effect is that people may feel the original proposal unclear because
it does not cover all cases about AI.

The particular case in the original proposal, I believe, will follow the Pareto's law -- that we can address >80% issues by investing a relatively
tiny effort. This is not made clear in the original proposal but I think
the "Note:" parts in the original proposal have already implied this.

=== Schrodinger's Firmware ===

Apart from that, "whether AI can enter non-free{,-firmware}" is a different issue which derails from my simplified result (that said, different opinions are welcome). Assume Debian disallows AI in non-free{,-firmware}, then
here is an imagined example:

1. In January, a CPU/GPU company/organization released firmware, it is
   proprietary, but redistributable. Then, Debian integrates the firmware
   into non-free-firmware.
2. In Feburary, the company/organization happily announced that they used
   machine learning / AI in their firmware. Then, Debian has to remove the
   firmware due to policy violation.
3. In March, Debian start to worry about the remaining black-box blobs in
   the non-free{,-firmware} sections, because the blobs have to be removed
   once there is any news that acknowledges AI/ML usage inside -- we are
   dealing with Schrodinger's cats.

I insist on my own opinion to stay neutral towards non-free{,-firmware}, and delay the non-free{,.*} issue for future investigation. It derails from
the most
problematic case that I care about in the original proposal, and will add complication to the GR, and discussion length.

On 4/19/25 13:56, M. Zhou wrote:

===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================

The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

This is the part to vote on. The appendixes are just supplementary
information
on how I explain this proposal. People may disagree with my rational
behind this
proposal, but as long as we converge into the same conclusion, it is
always good
to stay efficient and avoid voting on my personal opinions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Grohne@21:1/5 to M. Zhou on Sat Apr 26 21:50:01 2025

Hi Mo,

Please Cc me in replies as I am not subscribed.

I am aware that you have been working on this for quite some while and
have extensively collected feedback already. Thanks for taking the next
step and attempting to form project consensus.

On Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou wrote:

===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================

The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

Others have taken up some aspects already. Ian observed a bit of
vagueness and Simon also asked about how this would be applied to what
is in Debian. Ansgar and Russ identified possibly affected packages. I'd appreciate answers to these before going to vote.

Maybe we can also approach this from a different angle. The main
approach here appears to be drawing a line using principles and turning
that into policy. How about also approaching it from practical effects?
When it comes to individual packages, many of us have an easier time
forming an opinion as to whether it should be included in Debian and
whether it should be included in Debian main. For some packages, we
disagree here, but for many we likely agree. The risk here is that we
may get lost in details.

We presently include trained networks without training data or program
for OCR, TTS, board games, and image recognition in main. For some of
those, it may be questionable whether those really should be in main,
but I guess that we mostly have consensus on including them in Debian
(with Thorsten being an exception here) being a good thing. I hope that
we find a way that enables us to upload more existing models to some
section of Debian.

My impression is that Mo's proposal attempts to clarify DFSG into a
relatively literal interpretation that thinks of training as a
compilation step, but such consideration would result in us
re-evaluating existing components and likely require us to move some
pieces from main to non-free. Practically speaking accessibility may no
longer work unless enabling non-free.

When we split off non-free-firmware from non-free, one of the big
reasons for doing it was that firmware would not typically run on the
primary CPU. To me, machine learning models are a bit similar. Often
enough, the model architecture is DFSG-free software and it is merely
the model weights that lack "sources" in a strict DFSG interpretation.
The model weights influence the computation, but the choice of weights typically does not allow execution of arbitrary instructions. Like
firmware, model weights are somewhat sandboxed. This kinda also applies
to non-free documentation packages or other kind of data packages. To
me, this is a significant difference. Even if the model may be
influenced in a hostile way (something we likely cannot check when
training data or program is unavailable), it typically cannot run
arbitrary code on our computers. I would appreciate if there was a way
to tell general non-free from this more limited form apart (e.g. using
a separate archive section). Do others agree that such a classification
of non-free would be useful?

Helmut

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Osamu Aoki@21:1/5 to All on Sun Apr 27 11:50:01 2025

Hi,

* Ansgar: you have very good points.
* Zo: can you address these points?

Although I like Zo's proposal in general, I also had similar concern for the "practical impact", too. It was answered by Zo to make me feel OK.

My concern for Japanese keyboard input method was addressed in "ToxicCandy Allowlist" by assessing it as non-AI model in ML-policy.

https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML- Policy.rst

The current policy proposal is vague at what is not "AI models" and it lacks direct reference to "ToxicCandy Allowlist". (Why missing? or did I overlook something?)

If Ansgar's cases for models are deemed not as "AI models" with good reason, we
can qualify them as "ToxicCandy Allowlist". Then, it is easier for us to vote yes.

But the practical effects of passing the GR is probably (among other things):

a) Removal of OCR software (like tesseract[1])
b) Removal of image recognition software (like opencv[2])
c) Possibly removal of text-to-speech software (like festival[3] or flite[4])

Regards,

Osamu

Ansgar

[1]: https://sources.debian.org/src/tesseract-lang/1%3A4.1.0-2/
[2]: https://sources.debian.org/src/opencv/4.10.0%2Bdfsg- 5/data/haarcascades/haarcascade_fullbody.xml/
[3]: https://sources.debian.org/src/festival-hi/0.1- 11/hindi_NSK_diphone/festvox/hindi_NSK_ene.scm/#L3
[4]: https://sources.debian.org/src/flite/2.2-7/lang/cmu_us_kal/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Osamu Aoki on Sun Apr 27 12:30:02 2025

Hi,

On Sun, 2025-04-27 at 18:47 +0900, Osamu Aoki wrote:

My concern for Japanese keyboard input method was addressed in
"ToxicCandy Allowlist" by assessing it as non-AI model in ML-policy.

Could we stop using terms like "toxic" or "cancerous" or whatever in
technical discussions? (Unless we talk about toxic products or cancer
treatment or similar.)

The current policy proposal is vague at what is not "AI models" and
it lacks direct reference to "ToxicCandy Allowlist". (Why missing?
or did I overlook something?)

The GR proposal does not talk about this, but the notes in the proposal explicitly state:

| Note: While nowadays people use "AI" to refer to LLMs, it is a very broad term
| that covers much more than language models. AI models apart from language
| models must be considered as well, such as computer vision models, audio
| recognition models, etc.

So the intent seems to be a broad interpretation of what AI means, so
probably including models for input methods built from other source
data.

Ansgar

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Holger Levsen@21:1/5 to All on Sun Apr 27 13:40:01 2025

On Sun, Apr 27, 2025 at 12:24:28PM +0200, Ansgar 🙀 wrote:

"ToxicCandy Allowlist" by assessing it as non-AI model in ML-policy.

Could we stop using terms like "toxic" or "cancerous" or whatever in technical discussions? (Unless we talk about toxic products or cancer treatment or similar.)

very much +1

I'm also very uncomfortable speaking about AIs similar like I don't like
the term IP=intellectual property...

AIs have no intelligence.

--
cheers,
Holger

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

I used to be scared for our grandchildren's future. Such optimism!

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmgOFwcACgkQCRq4Vgaa qhweNQ//TulkGO7O3Dl/GPIcJgM3aGMTbW+JSwmM/ASNUmvOEgNwWmw6RJrmdl/n YV/0hpJNQXQ9eAEOGkeHbiIXB4It1F3sRSI1KsJ1C4QMroqoC39E3TDlbosLpg3I MHdwWwS0oqa9rZkNYVCpXfjcARsW6kmH8DZdCXsQMsuma5/5dbs4ezGXltcLvyQU zUCaW9HILaUDlHHhd9Q9kYi8YIJOE3K/iIgspnxMjs6bWiDh70yTfVighyUOXu+0 vjTISR9zeBpJHTqZJUtgccLj9D3xU2/ECP1ye3uMQfPKLxvVpMFvd79OOqs74z8E myFiBL4jLlrKOc/XdVT6UJwfSzW2aRMzOktunfg7gqMnmNSHLZ+vc/sfnfpXHYQo drU8CK2/nbhXfONkNiBhKltmwO8zix+LpLVD0NgzVsSL3+3pQ2d0KjZnqUIdHGGX Y7yn8+Dlq9vYq96o3wGVKfLXtdyqKt4bVsiD83gnin4zLq4yayG+ZqBjn+0LWeqg HPu0x7K0BmFJkIlmmNY9t65uJF9p/Ck681IjpIrxIVoHJqp49BovgojT3hFVno9d FLwtEYdfoW4nMFRLtc3e7Gf0zP3TIqvYEQPX2f2c/zRdFt

From Russ Allbery@21:1/5 to Stefano Zacchiroli on Mon Apr 28 00:10:01 2025

Stefano Zacchiroli <[email protected]> writes:

FWIW, I looked specifically in the gnubg case a while ago, because it
was an interesting test case for this discussion.

Oh, thank you! I very much appreciate you doing the work to uncover actual facts as opposed to my mostly uninformed speculations.

Here's what I found out:

- The training program (using the language from the GR draft) is
allegedly available and licensed under GPL3.

- The training data is allegedly available as well, but comes without
any declared license. I tend to concur with you, Russ, that it's very
likely non-copyrightable material. But that's only partly reassuring
to me, because I'm not sure how Debian would practically go about
ruling that certain stuff that comes without copyright/license is fine
for main, whereas other stuff in the same situation is not.

Yes, this is the tricky part for any sort of general "AI" policy (I agree
with Holger that this term is annoying propaganda, but we're probably
stuck with it). Right now, people are mostly thinking about LLMs, which
are trained on large amounts of writing, which is almost always
copyrighted because it's one of the core types of artistic creativity recognized by copyright laws. (Likewise for image generators, which are
trained on art.)

There are a bunch of other things that fall into the AI bucket, however,
and many of them predate the invention of LLMs. Some of them will have
similar challenges with training data (translation software is probably
also trained on writing, for instance, and voice recognition software is probably trained on voice samples that are often copyrighted). Some of
them, however, will be trained on things that are widely recognized to be non-copyrightable facts, such as records of backgammon, chess, or go
games.

However, even that is tricky, because the *annotations* on chess games can
be copyrighted. What is the line beyond which the game annotations are copyrighted material? I personally have no idea; I don't know if tagging
moves with !, !!, ?, and ?? but no other commentary would constitute copyrightable material. I also don't know if chess engines use such
annotations in their training.

The simplest and most ideologically consistent position that we could
take, at least from my perspective, would be to decide that any data file
in the form of distilled neural network weights or similar encoded
training data is the "binary" output of a "compilation" process and the training data is the source code for that binary, which means that under
the DFSG the source code not only has to be free software but has to be included in the archive. This is pleasingly ideologically coherent and
mostly avoids weird and uncomfortable ethical compromises.

However, I'm not sure it's very *practical* unless our position is that
we're simply not going to package software that uses machine learning
models (a decision that we could certainly make, but which seems a bit
contrary to our normal desire to be a universal operating system).
Problems just off the top of my head include:

1. This data is often huge and also of very little interest to anyone
other than people attempting to confirm the free software status of the
resulting model. Unlike the more typical forms of source code, I
suspect it's rare to want to tweak the training data to fix some bug or
add some feature and then "recompile." I certainly had never considered
doing such a thing when maintaining gnubg, but I patched the more
conventional source code quite frequently.

2. Using the data to reproduce the model often takes significant amounts
of computing resources, quite possibly more than we would like to spend
on such a task. But if we don't do that work, we don't really know if
we have the real sources.

3. It's quite likely, as I understand it, that the training process is not
going to be deterministic, so we may not easily be able to process the
training data and get back the original weights. My understanding is
that training tends to involve some randomization for technical
reasons. Also, even if it's *possible* to design a reproducible
training process, I suspect many upstreams will not have bothered.

4. As you discovered, finding the training data, even when upstream has
retained it (which I suspect will not always be the case, since I
expect in at least some cases upstream would just start over if they
wanted to retrain the model and therefore would view at least some of
the training data as equivalent to ephemeral object files they would
discard), is not going to be easy since almost no one cares. This is of
course not a new problem in free software, and we have long experience
with telling upstreams that no, we really do care about all of the
source code, but it is incrementally more work of a type that most
Debian packagers truly dislike doing.

I'm a bit worried that people have the specific case of LLMs in mind,
which are almost always going to pose copyright problems and derivative
work problems. I'm sure I'm not the only one here who is a general LLM
skeptic who has been underwhelmed by the quality of the output LLM
advocates claim to find useful, and therefore would find it quite easy to
say no to LLMs in Debian without feeling like the project was missing
anything of significance.

But machine learning is a lot older than LLMs and has a lot of useful applications other than mediocre text generation, and training data for at least some of those models doesn't look anything like LLM training data
and may have entirely different licensing properties. It feels likely to
me that there are some babies in that bathwater.

Maybe we've been ethical hypocrites all along about machine learning applications packaged in Debian, and the current LLM craze is a good opportunity to clean house and reaffirm a strict free software policy
including training data. I'm rather sympathetic to that argument, frankly,
just because the simplicity of the "source code for everything, no
exceptions" position is comfortable in my brain. But we should be fairly
sure about what we're agreeing to before making that decision.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Zacchiroli@21:1/5 to Russ Allbery on Sun Apr 27 23:20:01 2025

On Tue, Apr 22, 2025 at 02:14:34PM -0700, Russ Allbery wrote:

I suspect no source technically exists for those weights anywhere, since upstream's training work was fairly manual as I recall and is not
something they tend to work iteratively on, but it's possible that one of
the upstream developers has all of the data and scripts somewhere.

FWIW, I looked specifically in the gnubg case a while ago, because it
was an interesting test case for this discussion. Here's what I found
out:

- The training program (using the language from the GR draft) is
allegedly available and licensed under GPL3.

- The training data is allegedly available as well, but comes without
any declared license. I tend to concur with you, Russ, that it's very
likely non-copyrightable material. But that's only partly reassuring
to me, because I'm not sure how Debian would practically go about
ruling that certain stuff that comes without copyright/license is fine
for main, whereas other stuff in the same situation is not.

Links to the above two element are public, but not very well documented.
I had to dig them up discussing on the gnubg mailing list and some links
were provided to me only in private mail (I don't know why).

I wrote "allegedly available" above because while I have been able to
retrieve all of it, I was not able to reproduce/rerun it. (Just to
mention the *first* hurdle one would encounter to do so in modern
Debian: it's all Python 2.)

My gut feeling is that gnubg might be salvagable to remain in main if
this GR is adopted (provided significant work is poured into it to test reproducibility at least once), but it'd be wise to have a broader
impact analysis, for the good reasons mentioned by Simon.

Cheers
--
Stefano Zacchiroli . [email protected] . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEE8ZooXsFA+JEz681OfH5Cj5NBJ5kFAmgOnSkACgkQfH5Cj5NB J5k7Vg//cDQOK3sJ/ONwfeVB1gpPTAUpjoUUJ6IfZipg+1YyN52/JMFkV8bNovUu Q/MqoyHxCNNCPTycDw0ZgbLXtMhpnvXgAEiCexxUrmJFm3NZva8YtGdlOMqYTQ1o 80dbkSnKh8KKnMrrM7IduEUwgBSgQOUb60hbByM81nOY9pk0hFh4Ih66ke4uBxDl nBOh5/UaaG+AkYj7aTmLpqs98euaHThhHnZfCQpXVK5QKJBEmz7VSDyLnCX1+DZe YztvH8ge5pb6TLP/XIJk0uSph6AJqQKPF7tYUIsvO4FbluR2mhpmwjEX1KxWMjDE TzL1zn/YMQaE9VbW2Q1WV38+YC3uLKILNGiqbQao4w1fSEN8Ma0Rmc6ieTX0b3c0 NI9qClQSmXMpVf4WqqiV7yUEq1khZllwN5Wbt/zZO0TWd4RlT3oETc0u/CA5IJ2F cw237sTqLg6cLqW/PygpTP6IzRsE//punVxrKLEIrmvIFvZYXYn/E9G+DsoeAlAs Kltfv0fihxvKW/92ua5onF

From Matthias Urlichs@21:1/5 to All on Mon Apr 28 03:00:01 2025

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------cNbXu2Yrs4iefd0qa27kpqWo
Content-Type: multipart/mixed; boundary="------------3bG1qfqKabRQnDNnBaPhBn00"

--------------3bG1qfqKabRQnDNnBaPhBn00
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMjguMDQuMjUgMDA6MDIsIFJ1c3MgQWxsYmVyeSB3cm90ZToNCj4gU29tZSBvZg0KPiB0 aGVtLCBob3dldmVyLCB3aWxsIGJlIHRyYWluZWQgb24gdGhpbmdzIHRoYXQgYXJlIHdpZGVs eSByZWNvZ25pemVkIHRvIGJlDQo+IG5vbi1jb3B5cmlnaHRhYmxlIGZhY3RzLCBzdWNoIGFz IHJlY29yZHMgb2YgYmFja2dhbW1vbiwgY2hlc3MsIG9yIGdvDQo+IGdhbWVzLg0KSG93ZXZl ciwgdGhlIG1hbmlmZXN0ZWQgY29sbGVjdGlvbiBvZiBzdWNoIGdhbWVzIG1pZ2h0IGJlIGNv dmVyZWQgYnkgDQpzb21lIGRhdGFiYXNlIGFnZ3JlZ2F0ZSBjb3B5cmlnaHQtb3Itc29tZXRo aW5nLWxpa2UtaXQgbGF3LCB3aGljaCANCmRvZXNuJ3QgaGVscC4NCj4gSG93ZXZlciwgZXZl biB0aGF0IGlzIHRyaWNreSwgYmVjYXVzZSB0aGUqYW5ub3RhdGlvbnMqIG9uIGNoZXNzIGdh bWVzIGNhbg0KPiBiZSBjb3B5cmlnaHRlZC4NCg0K4oCmIHdoaWNoIGlzIG9ubHkgYSBwcm9i bGVtIGlmIHRoZSBjaGVzcyBlbmdpbmUgY29uc2lkZXJzIHRoZW0gZm9yIA0KdHJhaW5pbmcs IHdoaWNoIEkgYXNzdW1lIHRoZXkgZG9uJ3QsIHNvIHdlIGNvdWxkIGNvbmNlaXZhYmxlIERG U0ctaXplIA0KdGhlIGRhdGEuDQoNClRoZSBmYWN0IHJlbWFpbnMgdGhhdCBvdXIgYnVpbGRl cnMgd2lsbCBiZSB1bmFibGUgdG8gcmVwcm9kdWNlIHRoZSANCnJlc3VsdGluZyBuZXR3b3Jr LCBmb3Igd2VsbC1rbm93biBwcmFjdGljYWwgcmVhc29ucy4gVGh1cyB3ZSANCm1vc3RseS1o YXZlLXRvLXRydXN0IHRoZSBvcmlnaW5hbCBwdWJsaXNoZXIgdGhhdCB0aGVpciBuZXR3b3Jr IGhhcyBiZWVuIA0KYnVpbHQgYXMgZG9jdW1lbnRlZCAob3IgZXZlbiAiZG9jdW1lbnRlZCIg Z2l2ZW4gdGhlIHN0YXR1cyBvZiBnbnViZykuIEluIA0KcHJhY3RpY2UgdGhpcyBpcyBub3Qg YSBwcm9ibGVtIGZvciBhIEJhY2tnYW1tb24gZW5naW5lLCBvciBldmVuIGZvciANClRlc3Nl cmFjdCBiZWNhdXNlIGFueSBzZXJpb3VzIHVzZSBjYXNlIHN1cHBvcnRzLCBpZiBub3QgcmVx dWlyZXMsIGh1bWFuIA0KdmVyaWZpY2F0aW9uIG9mIHRoZSByZXN1bHQg4oCUIGJ1dCBob3cg c3VyZSBjYW4gSSBiZSB0aGF0IGEgTExNIGludGVuZGVkIA0KZm9yIGhvbWUgYXV0b21hdGlv biBkb2Vzbid0IGNvbnRhaW4gYW4gT3BlbiBTZXNhbWUgYmFja2Rvb3IgdGhhdCB1bmxvY2tz IA0KbXkgKmhvbWUqJ3MgYmFjayBkb29yPw0KDQotLSANCi0tIHJlZ2FyZHMNCi0tIA0KLS0g TWF0dGhpYXMgVXJsaWNocw0KDQo=
--------------3bG1qfqKabRQnDNnBaPhBn00
Content-Type: text/vcard; charset=UTF-8; name="matthias.vcf" Content-Disposition: attachment; filename="matthias.vcf" Content-Transfer-Encoding: base64

QkVHSU46VkNBUkQNClZFUlNJT046NC4wDQpOOlVybGljaHM7TWF0dGhpYXM7OzsNCk5JQ0tO QU1FOlNtdXJmDQpFTUFJTDtQUkVGPTE6bWF0dGhpYXNAdXJsaWNocy5kZQ0KVEVMO1RZUEU9 d29yaztWQUxVRT1URVhUOis0OSA5MTEgNTk4MTggMA0KVVJMO1RZUEU9aG9tZTpodHRwczov L21hdHRoaWFzLnVybGljaHMuZGUNCkVORDpWQ0FSRA0K

--------------3bG1qfqKabRQnDNnBaPhBn00--

--------------cNbXu2Yrs4iefd0qa27kpqWo--

-----BEGIN PGP SIGNATURE-----

wsF5BAABCAAjFiEEr9eXgvO67AILKKGfcs+OXiW0wpMFAmgO0IsFAwAAAAAACgkQcs+OXiW0wpNL whAAj50g/haIVHDiyGzf252JTCwVDuHjT2C9hka6Qn5WMi6fP6xP3e9frvQmIQtNW7/jS1D/AF/w hiMsWhmtY/KCA5KFobRJrw+y6kp/bKFNp8oHrth5UXa7B/Zf7UggWdV1w7TDxPK9Op3t9bpZ1aKv aIuiU0BEWqKUt1cqxe9sJZGJqJot7LAnFV6eN9fe0OYM0V/Ov1f37vc4w1Irr+eXKAPaydOIrsRA cRLOYaZv+QmpDne0K1aScEKc1LYPI6NMRguSdJda5D8QEOryZ4LsA5ujZmBwrnKnjBbxR0DCyilu i1A8bkSlihlYjCwzyxJ8IhazQe9W2y0UR7Cg+DDR1zAxKw5S/A68qQ8GQww/Y1l7Y3zMUT6KzUZB XGUfmtQTzQ9z/Jc3Tl12hSpZ7edLQ+vy0LfPpiMyO8GWRBApQAq+nKJK7tFE8XyRWfGNGpGPADXW ao18EwryagHWeNsQmaV23NJoQNJH0rP41OS317au63lIVj42WqSV9edsy9b5ZPE9H8Ln+grCzzZK cA349LBKT2hVSBWe0lzXXPulTrkbRonMY3h1DrtKwSgg4F0FikQOq+8UvId9ulhqwJGtfB6uZoN+ cpbgEwzKeOXSHpWSkoZmOUgkG2eX1vduDkiPTQxdIHp/SGwA6lohbaEgB866Sf4rO6J/LKhzcaCd Bl0=
=l4Fk
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Matthias Urlichs on Mon Apr 28 03:50:01 2025

Matthias Urlichs <[email protected]> writes:

The fact remains that our builders will be unable to reproduce the
resulting network, for well-known practical reasons. Thus we mostly-have-to-trust the original publisher that their network has been
built as documented (or even "documented" given the status of gnubg). In practice this is not a problem for a Backgammon engine, or even for
Tesseract because any serious use case supports, if not requires, human verification of the result — but how sure can I be that a LLM intended
for home automation doesn't contain an Open Sesame backdoor that unlocks
my *home*'s back door?

Right, this is a known attack in the security literature with some
research behind it already. See, for example:

https://arxiv.org/abs/2204.06974

We could get some protection by retraining the model from the base
training data and substituting our constructed model for the
upstream-provided one, but (a) that puts a lot more weight on our ability
to rebuild the model than just verification, and (b) I would not assume
it's impossible to hide backdoor construction in the training data either, particularly if the training data is voluminous. See, for example:

https://nisos.com/research/building-trustworthy-ai/

and that's just the first of many links that I found in a quick search.

Obviously there are a bunch of use cases for these things that will never involve adversarial data, and not everything needs to be robust in order
to be included in Debian, but it's one of the things to be thinking about
if accepting even our current status quo position of including some ML
models in Debian main without being able to verify the model construction.

LLMs in particular are nascent techology with novel security flaws that researchers are only starting to explore. I think the chances are high
that their security will get much, much worse before it gets better. It is
one of the many reasons why I am generally an LLM skeptic.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From [email protected]@21:1/5 to All on Mon Apr 28 06:50:01 2025

CgpPbiBBcHIgMjgsIDIwMjUgMDk6NDcsIFJ1c3MgQWxsYmVyeSA8cnJhQGRlYmlhbi5vcmc+IHdy b3RlOgoKPiBXZSBjb3VsZCBnZXQgc29tZSBwcm90ZWN0aW9uIGJ5IHJldHJhaW5pbmcgdGhlIG1v ZGVsIGZyb20gdGhlIGJhc2UgCgo+IHRyYWluaW5nIGRhdGEKCgpGb3IgdGhlIG1vbWVudCwgbm9i b2R5IGluIHRoaXMgdGhyZWFkIGNvbnNpZGVyZWQgcmV0cmFpbmluZy4gSSBzYXkgdGhhdCBpbiBz b21lIGNhc2UsIGdpdmVuIHNvbWUgc3BvbnNvcnMsIGl0IHNob3VsZCBiZSBkb2FibGUuIEkgb2Yg Y291cnNlIGRvIG5vdCBlbnZpc2lvbiBydW5pbmcgMzAwaysgR1BVIGxpa2UgRmFjZWJvb2sgZG9l cywgdGhvdWdoIG5vZGVzIHdpdGggOHggSDEwMCBhcmVuJ3QgY29tcGxldGVseSBvdXQgb2YgcmVh Y2guIFRoZXJlIGFyZSBhIGZldyB3YXlzIHdlIGNvdWxkIGdldCBvdXIgaGFuZHMgb24gdGhlbS4g RWl0aGVyIGFza2VkIGZvciBWTXMgc3BvbnNvcmVkLCBvciBydW4gYSBjcm93ZWQgZnVuZGluZyB0 byBnZXQgdGhlIGNhc2ggb3duZWQgYnkgRGViaWFuLgoKClRoaXMgbWF5IGJlIHRoZSBzYW1lIHBy b2plY3QgYXMgYnVpbGRpbmcgYSBjbG91ZCBmb3IgRGViaWFuLgoKClBsZWFzZSBkb24ndCBzYXkg aXQgcmVxdWlyZXMgdG9vIG11Y2ggc2tpbGxzIG9yIHdvcmssIElUIElTIGRvYWJsZSB3aXRoIGVu b3VnaCBzcG9uc29ycy4gSU1PIHdlIHNob3VsZCBnaXZlIGl0IGEgdHJ5IGlmIHdlIHJlYWxseSBu ZWVkIGl0LgoKClRob21hcyBHb2lyYW5kICh6aWdvKQoKCg== PGh0bWw+PGJvZHk+PGJyPjxkaXYgZGlyPSJsdHIiPk9uIEFwciAyOCwgMjAyNSAwOTo0NywgUnVz cyBBbGxiZXJ5ICZsdDtycmFAZGViaWFuLm9yZyZndDsgd3JvdGU6PC9kaXY+CjxkaXYgZGlyPSJs dHIiPiZndDsgV2UgY291bGQgZ2V0IHNvbWUgcHJvdGVjdGlvbiBieSByZXRyYWluaW5nIHRoZSBt b2RlbCBmcm9tIHRoZSBiYXNlIDwvZGl2Pgo8ZGl2IGRpcj0ibHRyIj4mZ3Q7IHRyYWluaW5nIGRh dGE8L2Rpdj4KPGJyPjxkaXYgZGlyPSJsdHIiPkZvciB0aGUgbW9tZW50LCBub2JvZHkgaW4gdGhp cyB0aHJlYWQgY29uc2lkZXJlZCByZXRyYWluaW5nLiBJIHNheSB0aGF0IGluIHNvbWUgY2FzZSwg Z2l2ZW4gc29tZSBzcG9uc29ycywgaXQgc2hvdWxkIGJlIGRvYWJsZS4gSSBvZiBjb3Vyc2UgZG8g bm90IGVudmlzaW9uIHJ1bmluZyAzMDBrKyBHUFUgbGlrZSBGYWNlYm9vayBkb2VzLCB0aG91Z2gg bm9kZXMgd2l0aCA4eCBIMTAwIGFyZW4mIzM5O3QgY29tcGxldGVseSBvdXQgb2YgcmVhY2guIFRo ZXJlIGFyZSBhIGZldyB3YXlzIHdlIGNvdWxkIGdldCBvdXIgaGFuZHMgb24gdGhlbS4gRWl0aGVy IGFza2VkIGZvciBWTXMgc3BvbnNvcmVkLCBvciBydW4gYSBjcm93ZWQgZnVuZGluZyB0byBnZXQg dGhlIGNhc2ggb3duZWQgYnkgRGViaWFuLjwvZGl2Pgo8YnI+PGRpdiBkaXI9Imx0ciI+VGhpcyBt YXkgYmUgdGhlIHNhbWUgcHJvamVjdCBhcyBidWlsZGluZyBhIGNsb3VkIGZvciBEZWJpYW4uPC9k aXY+Cjxicj48ZGl2IGRpcj0ibHRyIj5QbGVhc2UgZG9uJiMzOTt0IHNheSBpdCByZXF1aXJlcyB0 b28gbXVjaCBza2lsbHMgb3Igd29yaywgSVQgSVMgZG9hYmxlIHdpdGggZW5vdWdoIHNwb25zb3Jz LiBJTU8gd2Ugc2hvdWxkIGdpdmUgaXQgYSB0cnkgaWYgd2UgcmVhbGx5IG5lZWQgaXQuPC9kaXY+ Cjxicj48ZGl2IGRpcj0ibHRyIj5UaG9tYXMgR29pcmFuZCAoemlnbyk8L2Rpdj4KPGJyPjwvYm9k eT48L2h0bWw+

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Zacchiroli@21:1/5 to Russ Allbery on Mon Apr 28 09:20:01 2025

On Sun, Apr 27, 2025 at 03:02:57PM -0700, Russ Allbery wrote:

However, I'm not sure it's very *practical* unless our position is that
we're simply not going to package software that uses machine learning
models (a decision that we could certainly make, but which seems a bit contrary to our normal desire to be a universal operating system).
Problems just off the top of my head include:

[...]

Let me just add one more to your list, which I hinted at in the previous message, but would like to make more explicit.

- It's more likely with training data than with (source) code that we
will encounter situations where material is not copyrightable. (Not in
most LLM cases, for the reasons you mentioned, but we will in other
cases, like very probably the gnubg one.) Hence we will have to make
an explicit decision that "this material, without an associated
copyright notice and license" is DFSG-free. That is a very different
kind of decisions than the ones we are used to make (via ftpmasters),
because it is a *case by case* one, rather than *license by license*.
As such, it scales much less.

Cheers
--
Stefano Zacchiroli . [email protected] . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEE8ZooXsFA+JEz681OfH5Cj5NBJ5kFAmgPKpIACgkQfH5Cj5NB J5loVhAAnjqdhKKJ0fNiETF2sRp9LT6JkvmaQl2+8fZHdkCdNqqedz+h0yrd0/j2 G0NrRql/ZqfHZgHpCoCDAaYTkfnpdH/NmgcBRGRe6j7QDmtZpT3Ef8w4E7MVU9EV eSRcDuRzo9JjcVghf1XYg9gA9w+CTxGBxSGhIVbkcdha8jd4p/3xiYim28GloZ3m SaPgZLvcBO/C4n0/gOHR2JISDJhfYI5tz0oSyBMCtF8gUEPt36rt/u/jvsdFMwiO n45BS3uefT62i1okr+sJ00om3x0GdIr6ZnJ5XkrAc+hOLpky/BahJgzwanR0Xz7o CWKJzKgt80I1j79ysAJOksubcxo/6cnulwOKOPYvbUVmF0WRr4Ad370w4RZjFlpT c/6Zi+nrW1xKHV11EIphbLmkd8OXQ+eI6pR7R8xeHldsCzaZU4Ye1qVUsRvkf9x7 gQkjs6SO6gp2X4p71hCG5ud5n1L+nCsFbB8cfn3BHniYL47OP+dTef75RzQbYB0R quOjbmdh8CoOjE5Zlj8Fz7

From Jonathan Dowland@21:1/5 to Holger Levsen on Mon Apr 28 14:20:01 2025

On Sun Apr 27, 2025 at 12:37 PM BST, Holger Levsen wrote:

I'm also very uncomfortable speaking about AIs similar like I don't
like the term IP=intellectual property...

It would be worthwhile to restrict ourselves to "LLM" for these things
since "AI" is a much broader term and many other technologies (past or
future) may be described as "AI" whilst not being LLMs.

--
Please do not CC me for listmail.

👱🏻 Jonathan Dowland
✎ [email protected]
🔗 https://jmtd.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Aigars Mahinovs on Mon Apr 28 18:50:02 2025

Aigars Mahinovs <[email protected]> writes:

If we take as a given that copyright does *not* survive the learning
process of a (sufficiently complex) AI system, then it is *not* necessary that all training *data* for training a DFSG-free AI to also be DFSG-free.
It is however necessary that:
* software needed for inference (usage) of the AI model to be DFSG-free
* software needed for the training process of the AI model to be DFSG-free
* software needed to gather, assemble and process the training data to be DFSG-free or the manual process for it to be documented

Without necessarily disagreeing with this, I want to highlight that
licensing is only *one* of the considerations behind the DFSG and we
shouldn't fixate only on it. The other question is whether the training
data constitutes source code in the sense of DFSG 2. I think there's at
least a prima facie case that it is: The final training model is quite
clearly not the preferred form of modification, and anyone who wanted to retrain the model would normally prefer to start with the existing
training data set (and then possibly augment or filter it).

Historically, we have not done this analysis, and we've basically ignored
this problem. I packaged gnubg for years and never included the training
data and treated the model weights like they were the source code, and no
one really noticed or complained. But I'm not sure that was a defensible position. It was just something I did by default without really thinking
about it. Now that the topic has come up and I've had a chance to think
about it properly, I'm not at all sure that was correct.

DFSG 2 is an independent requirement. Even if the source code to a package
is clearly DFSG-free, we still require that the source code be in main,
not off somewhere else where we promise it exists, really (but which is
not under our control). We have historically not applied that to the
training data for models, and maybe that's correct, but the correctness of
that position is certainly not obvious to me from the wording of the DFSG.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Jonathan Dowland on Mon Apr 28 18:50:02 2025

"Jonathan Dowland" <[email protected]> writes:

On Sun Apr 27, 2025 at 12:37 PM BST, Holger Levsen wrote:

I'm also very uncomfortable speaking about AIs similar like I don't
like the term IP=intellectual property...

It would be worthwhile to restrict ourselves to "LLM" for these things
since "AI" is a much broader term and many other technologies (past or future) may be described as "AI" whilst not being LLMs.

The GR as proposed would apply to a lot of things that are not LLMs,
though. I think the right terminology for what we're currently talking
about might be "machine learning model," which encompasses a wider set of onstructions from processed training data without limiting them to only large-language models.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gunnar Wolf@21:1/5 to All on Mon Apr 28 19:20:01 2025

Russ Allbery dijo [Mon, Apr 28, 2025 at 09:46:41AM -0700]:

I'm also very uncomfortable speaking about AIs similar like I don't
like the term IP=intellectual property...

It would be worthwhile to restrict ourselves to "LLM" for these things
since "AI" is a much broader term and many other technologies (past or
future) may be described as "AI" whilst not being LLMs.

The GR as proposed would apply to a lot of things that are not LLMs,
though. I think the right terminology for what we're currently talking
about might be "machine learning model," which encompasses a wider set of >onstructions from processed training data without limiting them to only >large-language models.

This is an important point, which I subscribe. Since its inception over 60 years ago, "Artificial Intelligence" is fluffy marketspeak (even if spoken
and embraced by reknown scientists). I have adopted the term "Apparent Intelligence", proposed by Offray Luna (as the generated complex answers/behavior that _are seen from the outside_ as something intelligent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gunnar Wolf@21:1/5 to All on Mon Apr 28 21:50:01 2025

Aigars Mahinovs dijo [Mon, Apr 28, 2025 at 12:03:19PM +0200]:

IMHO we are here having a very annoying mixture of technical, legal and >philosophical problems.

Hypothetical 1: Bob reads all programming manuals and all DFSG-free code in >Debian and GitHub and teaches themselves Python programming. They are asked >to solve a simple problem. Their answer basically matches sample solutions >from a few Python coding manuals.

Can Bob release this solution as DFSG-free code? Does it matter if the >specific programming manual or Python course manual was DFSG-free licensed
or not? Does it matter if the manual had a GPL license? What if they
learned in a university setting?

Hypothetical 2: An abstract AI Alice does the same learning process as Bob >and produces the same output in an answer to the same request.

Do the conditions on the output of Alice change? Is the change technical or >legal/philosophical? You could call this a Turing test for copyright.

The main difference between Bob and Alice is not judgeable(?) based on the different sets of outputs they will emit, but on the legal recognition of
the (alleged?) author's personhood: Bob will be recognized as a person, and
as such, the code he emits will be copyrighted to him. Alice lacks
personhood; everything it emits is just the output of a machine. And, yes, attribution is very hard to assert for whatever code it produces.

Naturally, if Alice were to be trained by the works of Shakespeare, it is
very unlikely she would be able to output proper Python, unlike the
situation you describe.

Note, of course, that Bob's personhood does not exempt him from plagiarism,
or from re-creating non-copyrighteable trivial code. For example, I am
writing the following from my internal neural network:

#include <stdio.h>
void main() {
printf("Hello world!\n");
}

Is that code mine? No. Is that code copyrightable to me, given it is a fact
I emitted it from my previous training? No. That's too trivial to
copyright.

Processing of experiences into expert opinion is IMHO not directly
comparable with compilation of source to a binary. Regardless if it's done
by a human or a software system. The copyright law makes a distinction here >for humans. And while no explicit legal precedent is yet set for any kinds
of AI (including LLMs), the very lack of massive copyright violation
lawsuits from very sue-happy corporations, like Disney, is already a >noteworthy precedent. If LLMs from Meta and OpenAI (and others) are not
being sued for massive copyright violations, then it is the consensus of
our society and of our legal system that the same kind of expert opinion / >learning protections that humans enjoy also seem to apply to complex-enough >artificial expert systems. One hand-wavy legal loophole could be that the >learning process splits the copyrighted works into chunks small enough that >none of those chunks would legally retain the copyright protection anymore. >But that is just one of many speculations until a law or a court
establishes such guidelines.

Right. Oh, but we are very, very, very good at extending our "knowledge" (training? Inexpert training?) of legal uses of our favored licensing
schemes that we want to look at everything base on our learnt patterns...

What does that mean in terms of this proposal (or a potential alternative >proposal)?

If we take as a given that copyright does *not* survive the learning
process of a (sufficiently complex) AI system, then it is *not* necessary >that all training *data* for training a DFSG-free AI to also be DFSG-free.
It is however necessary that:
* software needed for inference (usage) of the AI model to be DFSG-free
* software needed for the training process of the AI model to be DFSG-free
* software needed to gather, assemble and process the training data to be
DFSG-free or the manual process for it to be documented

In this perspective, we would be seeing the training data itself as
immutable and uncopyrightable facts of world and nature, like positions and >spectra of stars in the sky (because its copyright does not survive the >learning process). It is data that can be gathered again, maybe with slight >variation in results and it does not really change based on who does the >gathering (assuming similar resources get invested).

Wait — Training data are chunks of software. I understand where you are getting to, but in order to redistribute it, we must have the right to. How
do we say that training data are "immutable and uncopyrightable facts of
world and nature"? The heavily trained machine didn't learn from objects randomly happening in nature...

Greetings,

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Aigars Mahinovs on Mon Apr 28 22:10:02 2025

Aigars Mahinovs <[email protected]> writes:

*However*, models again are substantially different from regular
software (that gets modified in source and then compiled to a binary)
because such a model can be *modified* and adapted to your needs
directly from the end state. In fact, for adjusting a LLM for use in a particular domain or a particular company it actually *is* the "binary"
that is the *preferred* form to be modified - you take a model that
"knows" a lot in general and "knows" how your language works and you
train the model further by doing specialisation training for your,
specific data set. And a result you get from one "generic" binary
another - "specialized" binary.

I have to say that I'm not convinced by this argument that models are any different than other types of software. To me, this type of "modification"
is akin to using code as a library without modifying it. Yes, that is a
thing that people often want to do. It is by far the most common way to
use a library, because this is the whole point of a library. But we still
hold libraries to the DFSG. It's very, very rare for me to want to modify
libc, or for that to be a good idea, but we wouldn't ship libc without
source code, because sometimes we really do want to modify the library
itself.

One of the reasons why I'm so leery of a theoretical argument that tries
to say that a machine learning model isn't really software in the sense
that we think of it is that this conclusion is appealingly convenient.
It's very impractical and difficult to treat the training data as source
code, so I have a subconscious temptation to find some reason to justify
why it's not, which can lead to magnifying differences to a point that I
worry isn't justified.

I think I'd personally be more comfortable with tackling the real problem
head on: We're probably not capable, in general, of treating the training
data like source code, so now what? But I am one of those people who
prefers a system of broad and conflicting rights that require thoughtful balancing, rather than a system of narrow and absolute rights.

So, very precisely speaking, modification of a LLM does *not* require
the original training data. Recreating a LLM does. Also developing a new
LLM with different training methods or training conditions does need
some training data (ideally the original training data, especially to
compare end performance). But all in all a developer on a Desert Island
would be better off with a "binary" model to be modified than without
it.

This last argument is true of all proprietary software, though. One is
always better off, at least in some immediate practical sense, having
something with severe usage restrictions than not having anything at all.
This isn't the test we use for the DFSG, though. Debian's position is that
if we can't offer you all of the DFSG freedoms, we don't put the software
in main, even if it would still be very useful within those restrictions.

Say for example that an IDE saves its configuration state not in a
common text file, but as a binary memory dump. Say the maintainer of
such a package would use their experience of the IDE and years of
development to go through the GUI of this software to assemble a great
setup configuration that is great for anyone starting to use the IDE and
also has clues left around it how to tailor it further for your needs.
This configuration (as a binary memory dump of the software state) is
then distributed to the users as the default configuration. What is "the source" of it?

I agree that in this case there is no separate source code and this binary
data structure is the preferred form of modification. But that's because
this data structure was created by a human directly, not by an automated process. It is a configuration file that a user wrote via an editor (the
IDE).

Isn't this binary (that the GUI can both read and write) not the
preferred form for modification? The maintainer can describe how he
created the GUI state (document the training process), but not really
include all his relevant experience (training data) that led him to
believe that this state is the best for the new users.

I guess all I can say is that I disagree with this way of analyzing the situation on a whole lot of levels, philosophical, practical, and legal.
To me, this is making the unwarranted leap to assuming that machine
learning models are like Commander Data from Star Trek: independent life
forms that are morally equivalent to a human being and therefore should
receive the same special treatment in free software ethics as human
beings. To me, this is just obviously not the case, and I have absolutely
no qualms about treating human activity as fundamentally and completely different than computer activity in our ethics and in our free software guidelines.

Or Debian could go the MS TTF route - have the software in the archive,
but no models at all. And to get the software to work users would get
used to run a script that would be always pulling a model from
huggingface.co either manually or even during package installation.
Possible with a barely functional placeholder model in the package that
99% of users would replace in real usage. That would keep the "evil" AI
away from the archive, but will that benefit our users?

I would echo the pleas elsewhere to avoid loaded terms like "evil" or
"toxic" or whatever, because we don't have to agree on a morality in order
to agree on an ethical structure for deciding what is and isn't free
software.

I personally do not believe proprietary software is evil in some greater
moral sense. I know there are people in the free software community who
believe this, but I do not, and I am not required to believe this to participate in Debian. All that I'm required to do is to agree that Debian
is for a specific type of software that meets a set of ethical
requirements, and that software that does not meet those ethical
requirements, whether good or bad, useful or not useful, should not be
part of Debian. If I want to work on such software, I am free to do that,
just not here. Debian provides a general-purpose computing platform that I
can (and do) use to do all sorts of things that fall outside the scope of
the Debian Project.

We don't need to, and should not, decide that everything that falls
outside of Debian's DFSG is evil. That's not the purpose of our
guidelines. The purpose is to set the boundaries of what the project is
for. Different people in the project will agree to those boundaries for different reasons and with entirely different personal perspectives on the morality of them. We don't have, or need, conformity here.

My goal in this discussion is to advocate for clearly defining the
boundaries of Debian so that people can rely on those definitions when
deciding whether to do their work inside Debian or elsewhere. It's
perfectly fine for us to ask people to do some kinds of work elsewhere.
Debian is quite far from the only worthwhile software organization in the world. It's fine for us to limit our scope for many different reasons, including to avoid disruptive internal conflict, and that does not carry
any project-wide judgment on the things we have decided to not actively support.

Will that benefit the development of a freer and more accessible AI landscape?

This is not a goal of the Debian Project at present. It of course could be
if we decided to adopt it, but it's not at all clear to me that we would
choose to do so. (It may, of course, be a goal of some individuals within
the Debian Project, and that's fine, but that doesn't carry as much weight
in our project-wide decision-making process.)

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Zacchiroli@21:1/5 to Gunnar Wolf on Mon Apr 28 22:20:01 2025

On Mon, Apr 28, 2025 at 01:47:09PM -0600, Gunnar Wolf wrote:

Wait — Training data are chunks of software. I understand where you are getting to, but in order to redistribute it, we must have the right to. How do we say that training data are "immutable and uncopyrightable facts of world and nature"? The heavily trained machine didn't learn from objects randomly happening in nature...

The current running theory among most FOSS legal scholars I've spoken to
is that works solely generated by AI are non-copyrightable, at least in
the US, and hence in the public domain. (Not because they are "facts",
but because they are generated by a machine.) Under that hypotheses,
both we and our users will enjoy all traditional DFSG freedoms on
generated LLM content.

Exceptions to the above are: (1) when there is significant creative contribution by the AI user, but in that case the generated work is
copyrighted by the AI *user*, not by the copyright owners of material
present in the training dataset, and (2) when the output is a verbatim
copy of some training input (sometimes referred to as "recitation").
Note that recitation is nowadays something that commercially deployed
LLMs like Copilot are really good at *not* doing anymore, by applying plagiarism/code-clone detection techniques between the generated output
and the training dataset *before* returning any output to the user.

AFAIK this legal theory has not been tested in court yet. But the big commercial players (who, remember, have vetted interests in being
copyright absolutists) believe in it so much, that they go as far as
offering legal indemnity promises to users of their LLMs who encounter
legal issues due to the use of generated output. (Copilot does this,
provided that the protections against recitation are not disabled; they
are enabled by default.)

So I strongly advise that we do *not* base our voting decisions, or
strategic considerations for free software, on the hypothesis that LLM
outputs are derived works under copyright law of the training datasets
in the general case. Doing so is currently at high risk of exploding in
our hands spectacularly. (Note: we might decide to take the stance that
we *treat* LLM output as if it were derived work under copyright of
training datasets. What we should not do is anchoring that decision into copyright law determinations about this specific point.)

Cheers
--
Stefano Zacchiroli . [email protected] . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEE8ZooXsFA+JEz681OfH5Cj5NBJ5kFAmgP4PIACgkQfH5Cj5NB J5mNPBAAor0ctTB/gn0SV4QnkswkeK6gJxGuPsejGEGAO199bCD/99TDZEDQtN5x VfmmsRU5ydpwPVGDtCkPKYlIfDt79B0OyglYIY6SRh7bLwJoZ2uurGrxtDH54YJ7 ikFnMdbHc7q+3Snd1/fAok6hFy+zy/7lF+pGRLTZlBPv96/i9/vzyqwl2tYAnQBw FUgn13TXO1dKISIQNF9ZFf5en1qB5sCsYFU/CeILDLR5ti5DcQIupzYEM5ZBFdcD MKYgAnIovxc3IiL6GnfHQdHZ/yNIUAoYqmbpnwODSPRq4bzauPbZGnhYsEpCJchv QelJX6SihWxlThXphLx7AllgsTkVfe8A91OFsSY4razbPex0NjEyfFRBo7hSeLAk bEbZqfdMPJtsWCzqAEjblsV7/XsEsDQlQPYwMHk2dl91rqN0G3mC+8VClx6ZjZTG GEojPrPL8jNc7L0A+pdDjDLJ86GM2JE0KW/JypaRlrutRS51DAXHGhErPX1cdG7x Yr+WZzBWcYeY8jsJmDAiWz

From Gunnar Wolf@21:1/5 to All on Tue Apr 29 02:10:01 2025

Aigars Mahinovs dijo [Mon, Apr 28, 2025 at 09:24:04PM +0200]:

(...)
So, very precisely speaking, modification of a LLM does *not* require the >original training data. Recreating a LLM does. Also developing a new LLM
with different training methods or training conditions does need some >training data (ideally the original training data, especially to compare
end performance). But all in all a developer on a Desert Island would be >better off with a "binary" model to be modified than without it.

Say for example that an IDE saves its configuration state not in a common >text file, but as a binary memory dump. Say the maintainer of such a
package would use their experience of the IDE and years of development to
go through the GUI of this software to assemble a great setup configuration >that is great for anyone starting to use the IDE and also has clues left >around it how to tailor it further for your needs. This configuration (as a >binary memory dump of the software state) is then distributed to the users
as the default configuration. What is "the source" of it? Isn't this binary >(that the GUI can both read and write) not the preferred form for >modification? The maintainer can describe how he created the GUI state >(document the training process), but not really include all his relevant >experience (training data) that led him to believe that this state is the >best for the new users. So what is LLama if not a **very** complex nvim >configfile focused on autocomplete? :D Quite a few of those questions also >apply to fonts (IMO).

Exactly what I tried to present in my latest mail. I think you did a better
job at explaining than myself. Thanks!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Matthias Urlichs@21:1/5 to All on Tue Apr 29 08:10:01 2025

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------QRo08sVFQEaziRqY5P3eSTBT
Content-Type: multipart/mixed; boundary="------------lv8SDqyT0M3sZk0sqpeYtBDR"

--------------lv8SDqyT0M3sZk0sqpeYtBDR
Content-Type: multipart/alternative;
boundary="------------H3u5GONcT4Yeqlxh0PMeehgH"

--------------H3u5GONcT4Yeqlxh0PMeehgH
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMjguMDQuMjUgMjE6MjQsIEFpZ2FycyBNYWhpbm92cyB3cm90ZToNCj4gU28sIHZlcnkg cHJlY2lzZWx5IHNwZWFraW5nLCBtb2RpZmljYXRpb24gb2YgYSBMTE0gZG9lcyAqbm90KiBy ZXF1aXJlIA0KPiB0aGUgb3JpZ2luYWwgdHJhaW5pbmcgZGF0YS4gUmVjcmVhdGluZyBhIExM TSBkb2VzLg0KDQpJTUhPIHRoYXQncyBhIHJhdGhlciBhY2FkZW1pYyBkaXN0aW5jdGlvbi4g WWVzICpzb21lKiBtb2RpZmljYXRpb25zIA0KZG9uJ3QgcmVxdWlyZSBvcmlnaW5hbCB0cmFp bmluZyBkYXRhLCBtdWNoIGxpa2Ugc29tZSBtb2RpZmljYXRpb25zIHRvIA0KbGliYyBkb24n dCByZXF1aXJlIHNvdXJjZSBjb2RlIChJJ20gZG9pbmcgaXQgbXlzZWxmOyBpbiBvbmUgb2Yg bXkgDQpwcm9qZWN0cyBJIHBhdGNoIHRoZSBsaWJjIGxvYWRlciB0byBzZWFyY2ggaW4gL3Yv dS9saWIgaW5zdGVhZCBvZiANCi91c3IvbGliIGJlY2F1c2Ugb2YgcHNldWRvIG11bHRpLWFy Y2gpIGJ1dCBtb3N0IGRvLg0KDQpIb3dldmVyLCBhbmQgcmV0dXJuaW5nIHRvIHRoZSByb290 IG9mIHRoaXMgZGlzY3Vzc2lvbjogV2hlbiB3ZSB0YWxrIA0KYWJvdXQgc291cmNlIGFzIHRo ZSBwcmVmZXJyZWQgd2F5IG9mIG1vZGlmeWluZyBzb21ldGhpbmcsIHdlIG5lZWQgdG8gYXNr IA0KKndob3NlKiBwcmVmZXJyZWQgd2F5LiBUaGUgdXNlcidzPyBDZXJ0YWlubHkgbm90LCBv dGhlcndpc2Ugd2Ugd291bGRuJ3QgDQpuZWVkIHRvIHNoaXAgbnZpbSdzIHNvdXJjZSBjb2Rl Lg0KDQpUaHVzIGl0J3MgdGhlIGRldmVsb3BlcnMnIHByZWZlcnJlZCBzb3VyY2UsIHdoaWNo IGxlYXZlcyBwcmUtYnVpbHQgDQptb2RlbHMgb3V0IGluIHRoZSBjb2xkLg0KDQpIb3dldmVy wrIgSU1ITyB3ZSBuZWVkIHRvIGRpc3Rpbmd1aXNoIGJldHdlZW4gdGhpbmdzIGxpa2UgZ251 Ymcgb3IgDQp0ZXNzZXJhY3QsIGFuZCB0b2RheSdzIExMTXMgb3Igc2ltaWxhciAibGFyZ2Ui IG1vZGVscy4NCg0KV2UgY2FuLCBhYnNlbnQgbm8gY29weXJpZ2h0IHJlc3RyaWN0aW9ucywg bW9yZS1vci1sZXNzLWVhc2lseSByZWNyZWF0ZSANCnRoZSBmb3JtZXIncyBtb2RlbHMgZnJv bSB0aGVpciB0cmFpbmluZyBkYXRhLg0KDQpXZSBjYW4ndCBkbyB0aGF0IHdpdGggTExNcyBv ciBzaW1pbGFyLXNpemVkIG1vZGVscywgZXZlbiBpZiB3ZSBoYWQgDQpzb3VyY2UgY29kZS4N Cg0KVGhlaXIgZGV2ZWxvcGVycyBjcmVhdGUgYSBtb2RlbCdzIGFyY2hpdGVjdHVyZSwgcHJl c3VtYWJseSBzb21lIA0KUHl0aG9uLW9yLXdoYXRldmVyIHNvdXJjZSBjb2RlIGFuZC9vciBh IGRlc2NyaXB0aXZlIGxhbmd1YWdlLCB3aGljaCAqaXMqIA0KdGhlaXIgc291cmNlLiBXZSBk b24ndCBnZXQgdGhhdC4gVGhpcyBzb3VyY2UgZ2V0cyBjb21waWxlZCB0byB3aGF0ZXZlciAN Cih3ZSBhbHNvIGRvbid0IGdldCB0aGVzZSBiaW5hcmllcykuIFRoZSByZXN1bHQgaXMgdGhl biBydW4gaW4gdHJhaW5pbmcgDQptb2RlIG9uIGEgbGFyZ2UgY29ycHVzIHdoaWNoIERlYmlh biBjYW4ndCBkaXN0cmlidXRlIChhKSBmb3IgY29weXJpZ2h0IA0KcmVhc29ucyBidXQgYWxz byAoYikgYmVjYXVzZSBpdCdzIHRvbyBkYW1uIGxhcmdlLCBlbmQgdXAgd2l0aCBhIGJhc2Ug DQptb2RlbCB3aGljaCB0aGV5IGRvbid0IGdpdmUgdXMgZWl0aGVyIGFuZCB3aGljaCBnZXRz IHR3ZWFrZWQgYnkgZnVydGhlciANCnRyYWluaW5nIGFuZCBodW1hbiBmZWVkYmFjayAocGFy dGx5IGJ5IHBvb3JseS1wYWlkIGdpZyB3b3JrZXJzIGluIA0KZGV2ZWxvcGluZyBjb3VudHJp ZXMpLCB0aGVuIGRpc3RpbGxlZCBkb3duIHRvIG1hbmFnZWFibGUgc2l6ZSAoYnV0IHN0aWxs IA0KdG9vIGxhcmdlIGZvciB1cyB0byBkaXN0cmlidXRlIGluIG1hbnkgY2FzZXMpLg0KDQpT byBvdXIgY2hvaWNlIGlzIGJhc2ljYWxseSBiZXR3ZWVuIHNoaXBwaW5nIHNvbWV0aGluZyB3 ZSBkb24ndCBjb250cm9sIA0KYW5kIGNhbid0IGludHJvc3BlY3QsIGFuZCwgd2VsbCwgbm90 IGRvaW5nIHNvLg0KDQpUaGVyZSBpcyBubyB0aGlyZCBjaG9pY2Ugb2YgZGlzdHJpYnV0aW5n IGEgZnJlZSBhbHRlcm5hdGl2ZSwgYmVjYXVzZSANCmV2ZW4gaWYgd2UgZ2V0IHRoZSBhcmNo aXRlY3R1cmUncyBzb3VyY2UgY29kZSBhbmQgYXNpZGUgZnJvbSB0aGUgDQpjb3B5cmlnaHQg aXNzdWUgYW5kIHRoZSBodW1vbmdvdXMtc2l6ZSBpc3N1ZSBhbmQgdGhlIA0KbXVsdGlwbGUt bWFudWFsLWJ1aWxkLXN0ZXBzIGlzc3VlIGFuZCB0aGUgDQpzaG91bGRuJ3Qtd2Utc2F2ZS1l bmVyZ3ktZGFtbWl0IGlzc3VlIHRoZXJlJ3MgdGhlIGxvb21pbmcgcHJvYmxlbSB0aGF0IA0K YWxtb3N0KD8pIG5vbmUgb2YgdXMgaGF2ZSBldmVuIHJlbW90ZWx5IGVub3VnaCBHUFVzIHRv IHJlcHJvZHVjZSB0aGUgDQpyZXN1bHRpbmcgbW9kZWwgaW4gdGhlIGZpcnN0IHBsYWNlLg0K DQpNeSB2b3RlIGlzIG9uIG5vdCBkb2luZyBzby4gV2UgbWlnaHQgd2FudCB0byBzaGlwIHRo ZSByZXF1aXNpdGUgdG9vbHMgaW4gDQpjb250cmliIGFuZCBsZXQgcGVvcGxlIGRvd25sb2Fk IHRoZSBtb2RlbHMgZnJvbSBodWdnaW5nZmFjZSwgYnV0IHRoYXQncyANCmFzIGZhciBhcyBJ IHdhbnQgdG8gdGFrZSBEZWJpYW4gaW4gdGhhdCBkaXJlY3Rpb24uDQoNCi0tIA0KLS0gcmVn YXJkcw0KLS0gDQotLSBNYXR0aGlhcyBVcmxpY2hzDQoNCg== --------------H3u5GONcT4Yeqlxh0PMeehgH
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 28.04.25 21:24, Aigars Mahinovs
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CABpYwDUBjmsaED7KRCscQCz9V4apZesYKeyJwpAq2UDcn6UKYQ@mail.gmail.com">So,
very precisely speaking, modification of a LLM does *not* require
the original training data. Recreating a LLM does.</blockquote>
<p>IMHO that's a rather academic distinction. Yes *some*
modifications don't require original training data, much like some
modifications to libc don't require source code (I'm doing it
myself; in one of my projects I patch the libc loader to search in
/v/u/lib instead of /usr/lib because of pseudo multi-arch) but
most do.</p>
<p>However, and returning to the root of this discussion: When we
talk about source as the preferred way of modifying something, we
need to ask *whose* preferred way. The user's? Certainly not,
otherwise we wouldn't need to ship nvim's source code.</p>
<p>Thus it's the developers' preferred source, which leaves
pre-built models out in the cold.<br>
</p>
<p>However² IMHO we need to distinguish between things like gnubg or
tesseract, and today's LLMs or similar "large" models.</p>
<p>We can, absent no copyright restrictions, more-or-less-easily
recreate the former's models from their training data.<br>
</p>
<p>We can't do that with LLMs or similar-sized models, even if we
had source code.</p>
<p>Their developers create a model's architecture, presumably some
Python-or-whatever source code and/or a descriptive language,
which *is* their source. We don't get that. This source gets
compiled to whatever (we also don't get these binaries). The
result is then run in training mode on a large corpus which Debian
can't distribute (a) for copyright reasons but also (b) because
it's too damn large, end up with a base model which they don't
give us either and which gets tweaked by further training and
human feedback (partly by poorly-paid gig workers in developing
countries), then distilled down to manageable size (but still too
large for us to distribute in many cases).</p>
<p>So our choice is basically between shipping something we don't
control and can't introspect, and, well, not doing so.</p>
<p>There is no third choice of distributing a free alternative,
because even if we get the architecture's source code and aside
from the copyright issue and the humongous-size issue and the
multiple-manual-build-steps issue and the
shouldn't-we-save-energy-dammit issue there's the looming problem
that almost(?) none of us have even remotely enough GPUs to
reproduce the resulting model in the first place.</p>
<p>My vote is on not doing so. We might want to ship the requisite
tools in contrib and let people download the models from
huggingface, but that's as far as I want to take Debian in that
direction.<br>
</p>
<pre class="moz-signature" cols="72">--
-- regards
--
-- Matthias Urlichs</pre>
</body>
</html>

--------------H3u5GONcT4Yeqlxh0PMeehgH--

--------------lv8SDqyT0M3sZk0sqpeYtBDR
Content-Type: text/vcard; charset=UTF-8; name="matthias.vcf" Content-Disposition: attachment; filename="matthias.vcf" Content-Transfer-Encoding: base64

QkVHSU46VkNBUkQNClZFUlNJT046NC4wDQpOOlVybGljaHM7TWF0dGhpYXM7OzsNCk5JQ0tO QU1FOlNtdXJmDQpFTUFJTDtQUkVGPTE6bWF0dGhpYXNAdXJsaWNocy5kZQ0KVEVMO1RZUEU9 d29yaztWQUxVRT1URVhUOis0OSA5MTEgNTk4MTggMA0KVVJMO1RZUEU9aG9tZTpodHRwczov L21hdHRoaWFzLnVybGljaHMuZGUNCkVORDpWQ0FSRA0K

--------------lv8SDqyT0M3sZk0sqpeYtBDR--

--------------QRo08sVFQEaziRqY5P3eSTBT--

-----BEGIN PGP SIGNATURE-----

wsF5BAABCAAjFiEEr9eXgvO67AILKKGfcs+OXiW0wpMFAmgQZVcFAwAAAAAACgkQcs+OXiW0wpNV 7xAAvZLv3TvhU//F+2rfyfwjTLERRBmaIrhztPb+gLc+WDsnRo88VqmAlyojS4ww90b/t1WMc0cB Bqj3Gv3wfwa9QqrS4HHfTtjeDy17j9+8g/X19ZNRLPnPySelzSV+qrIeQJXloEnCgF5TVWiXKGTW 2LZW7+83s7KT7oF6vIdmqpjZaK+8M5ogubf5uKRbtGZh+498Od//+31Ic3wt4sDDr44yIJYDfB0j nv5ZLEQKDk/CElSoRTgewdrWXT

From Andrea Pappacoda@21:1/5 to Aigars Mahinovs on Thu May 1 17:10:02 2025

--e1353cb13c21fc9d4b5feb4247270e42bbf56b1132cb4ff16bc01f7c40d8 Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8; format=Flowed

On Mon Apr 28, 2025 at 9:24 PM CEST, Aigars Mahinovs wrote:

In fact, for adjusting a LLM for use in a particular domain or
a particular company it actually *is* the "binary" that is the
*preferred* form to be modified - you take a model that "knows" a lot
in general and "knows" how your language works and you train the model further by doing specialisation training for your, specific data set.
And a result you get from one "generic" binary another - "specialized" binary.

While this may be the case for extremely big LLMs, I wouldn't call the
trained model the preferred form of modification. You just want to do
fine tuning? Sure, go ahead, but you aren't modifying the model in the
real sense, and cannot study how it was implemented.

Moreover, this proposal is about "AI", and not LLMs in particular. Yes,
"AI" is too broad, but I'd say that deep learning models should be in.
And, speaking of deep learning, I recently had to play a bit with
a computer vision model for object recognition, and fine-tuning wasn't
enough --- I had to re-train the model from scratch. If the model were
not free, I couldn't have completed my project.

That *could* be the technical difference in definitions between what
is "DFSG-free AI" and what is "Debian-main-grade-free AI".

What? Isn't main supposed to represent DFSG-free stuff?

Bye :)

--e1353cb13c21fc9d4b5feb4247270e42bbf56b1132cb4ff16bc01f7c40d8
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iIoEABYKADIWIQS6VuNIvZRFHt7JcAdKkgiiRVB3pwUCaBOLxxQcYW5kcmVhQHBh cHBhY29kYS5pdAAKCRBKkgiiRVB3p4PuAP9ERKCpnlf2SJ94dqng6ci+pHi08lBN xVvKESf8fLukGgD9GlA2pO8Lj8s7Wvm3LLGPs3lPz1XC8bjcQPQ1GH2muAg=iWC8
-----END PGP SIGNATURE-----

--e1353cb13c21fc9d4b5feb4247270e42bbf56b1132cb4ff16bc01f7c40d8--

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon Richter@21:1/5 to Aigars Mahinovs on Fri May 2 10:30:01 2025

Hi,

On 4/28/25 19:03, Aigars Mahinovs wrote:

Processing of experiences into expert opinion is IMHO not directly
comparable with compilation of source to a binary. Regardless if it's
done by a human or a software system.

LLMs do not build a knowledge representation that is separate from the
natural language representation.

The copyright law makes a
distinction here for humans.

The distinction is that the expert opinion is covered by patents, not copyright. We've fought for the non-patentability of algorithms, and I
have very little objection to tools that can flawlessly reproduce an
algorithm in any programming language. However, there is no need for
such a tool, because software libraries already exist.

And while no explicit legal precedent is
yet set for any kinds of AI (including LLMs), the very lack of massive copyright violation lawsuits from very sue-happy corporations, like
Disney, is already a noteworthy precedent.

The LLM operators have gone to great lengths to avoid violating the
copyrights of anyone with a massive legal department (try asking ChatGPT
for pictures of Mickey Mouse). That they are choosing to ignore the
copyrights of anyone without the resources to go after them legally is
evidence of bad faith.

One hand-wavy
legal loophole could be that the learning process splits the copyrighted works into chunks small enough that none of those chunks would legally
retain the copyright protection anymore.

The chunks are small enough to make it difficult to bring a successful
suit, which discourages smaller entities and individual copyright
holders from asserting their rights.

Simon

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?utf-8?Q?Pierre-Elliott_B=C3=A9cue@21:1/5 to Gunnar Wolf on Sat May 3 00:30:01 2025

Gunnar Wolf <[email protected]> wrote on 28/04/2025 at 19:10:35+0200:

Russ Allbery dijo [Mon, Apr 28, 2025 at 09:46:41AM -0700]:

I'm also very uncomfortable speaking about AIs similar like I don't
like the term IP=intellectual property...

It would be worthwhile to restrict ourselves to "LLM" for these things
since "AI" is a much broader term and many other technologies (past or
future) may be described as "AI" whilst not being LLMs.

The GR as proposed would apply to a lot of things that are not LLMs, >>though. I think the right terminology for what we're currently talking >>about might be "machine learning model," which encompasses a wider set of >>onstructions from processed training data without limiting them to only >>large-language models.

This is an important point, which I subscribe. Since its inception over 60 years ago, "Artificial Intelligence" is fluffy marketspeak […]

To me, it's actually quite accurate: the intelligence is totally
artificial. :D

--
PEB

--=-=-Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQJDBAEBCgAtFiEE5CQeth7uIW7ehIz87iFbn7jEWwsFAmgVRwgPHHBlYkBkZWJp YW4ub3JnAAoJEO4hW5+4xFsLVa4QAMCbIwJOPVsuyYi0eY3J3y1ewJCtvBPFMOxN IyQOGTUuzOmwPpwrBeuxsUDlNr7oRdR+smYBgxGhLjlN7dj97KIYUvwvK2qhvPhB 9ZDmvDC7yG6oNzO1Bvg+mQuT+i+Li1LiOvVIBL4/Rjqf8nmRGlORet8h87URyNGR VRNWE22fDF0bsI5b5RrS2Z9ETV7gq5Xj0dpHFvYJR81rK3mzUK2Ru69RYPiv+Gmn YwCDZuLAFoevc3ljH1SsscZgVg+kaX2GeOcGw4VBI52FVMtNAI2bafxK71g/SIkY PCOJHj9TdFo+xtrDkQlXIhZc1wUmgpKeHaFBfdIUGaI466x1tuqO6WyGSwDHRBzA yErY8XAD+SSlWjMvJS3qa9U1R4E/zdrfetfu5qdurOgcC7TF4pJNGbPMvJ56Hhux 0SC4l1d9/IlPr/F1GEgbk5KOrJ/SKTrBf13Mk6t91fJCp7fbohQwm/bzOMCCxngv StTRt7pNGFU9Xx5TlyXNYDIzSgBOfItnIO5ti5sJlIysD42orOYfq+MSKZxbjo80 ceR1rKzx91aiVB/GpvTr0m1rE1h0hS1iSJ6UhxCZWP/nkC5OI6Jb6qILao1nUuoV MWbyqazm+Nn2EtSqc5Pn2YSvA1sMVG2fKe2QHSZcLSGyE5cXW6++8NJ7hSGhy2yQ
r726juBW
=XNYG
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Wouter Verhelst@21:1/5 to Aigars Mahinovs on Sun May 4 13:20:01 2025

On Tue, Apr 29, 2025 at 03:17:52PM +0200, Aigars Mahinovs wrote:

However, here we have a clear and fundamental change happening in the
copyright law level - there is a legal break/firewall that is happening
during training. The model *is* a derivative work of the source code of
the training software, but is *not* a derivative work of the training
data.

I would disagree with this statement. How is a model not a derivative
work of the training data? Wikipedia defines it as

In copyright law, a derivative work is an expressive creation that
includes major copyrightable elements of a first, previously created
original work (the underlying work). [1]

Which, as models are often able to regurgitate copyrighted works
(largely) verbatim, is to me a definition that applies to models.

[1] https://en.wikipedia.org/wiki/Derivative_work

This means that we also have to consider what exactly is training
data and how to deal with it, without automatically falling back to
equating it with source code.

We have a very wide definition of "source code" in Debian. To us, source
code is not limited to software written in a common programming
language; instead, our definition considers various things such as SVG
files, libreoffice documents, gimp XCF files, etc, to be source code
too. In this context, I don't think that equating training data to
source code is too wild a thing to do.

--
w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Matthias Urlichs@21:1/5 to All on Sun May 4 15:30:01 2025

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------c3G7QF3M42AsxO9gzkWIn6jN
Content-Type: multipart/mixed; boundary="------------HPTfUe6ZfgNDnjtaVtKprwEZ"

--------------HPTfUe6ZfgNDnjtaVtKprwEZ
Content-Type: multipart/alternative;
boundary="------------h9VIL9xJjV7YTKEa9Vtc12HC"

--------------h9VIL9xJjV7YTKEa9Vtc12HC
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMDQuMDUuMjUgMTQ6MjcsIEFpZ2FycyBNYWhpbm92cyB3cm90ZToNCj4gVGhlIHNpbXBs ZSBmYWN0IHRoYXQgbm9uZSBvZiB0aGUgTExNcyBoYXZlIGJlZW4gc3VlZCBvdXQgb2YgDQo+ IGV4aXN0ZW5jZcKgYnkgKmFueSAqY29weXJpZ2h0IG93bmVyIGlzIGRlIGZhY3RvIHByb29m IHRoYXQgaXQgZG9lcyBub3QgDQo+IHdvcmsgdGhhdCB3YXkgaW4gdGhlIGV5ZXMgb2YgdGhl IGp1ZGljaWFswqBzeXN0ZW0uDQoNClRoYXQgbWF5IG9yIG1heSBub3QgYmUgY29ycmVjdCBp biB0aGUgbG9uZyBydW4sIElBTkFMIGFuZCBhbGwgdGhhdC4NCg0KSG93ZXZlci4gQ29weXJp Z2h0IGlzIG9ubHkgb25lIGFzcGVjdCBvZiB3aGV0aGVyIG9yIG5vdCBtb2RlbHMgc2hvdWxk IA0KZW5kIHVwIGluIG1haW4uIFBsYWluIG9sZCByZXByb2R1Y2liaWxpdHkgaXMgaW1wb3J0 YW50IHRvIHVzIHRvby4NCg0KSWYgd2UgY2FuJ3QgaW5jbHVkZSB0aGUgdHJhaW5pbmcgZGF0 YSwgZm9yIG9idmlvdXMgY29weXJpZ2h0IHJlYXNvbnMsIA0KdGhlbiB0aGUgcXVlc3Rpb24g d2hldGhlciB0aGUgcmVzdWx0aW5nIG1vZGVsIGl0c2VsZiBpcyBjb3B5cmlnaHQtY2xlYW4g DQpkb2Vzbid0IG1hdHRlci4NCg0KLS0gDQotLSByZWdhcmRzDQotLSANCi0tIE1hdHRoaWFz IFVybGljaHMNCg0K
--------------h9VIL9xJjV7YTKEa9Vtc12HC
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 04.05.25 14:27, Aigars Mahinovs
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CABpYwDXJ4oMtNyLVynJqB7+upjWzVJuqugRFtBczHbeKwV-mwQ@mail.gmail.com">The
simple fact that none of the LLMs have been sued out of
existence by <b>any </b>copyright owner is de facto proof that
it does not work that way in the eyes of the judicial system.</blockquote>
<p>That may or may not be correct in the long run, IANAL and all
that.</p>
<p>However. Copyright is only one aspect of whether or not models
should end up in main. Plain old reproducibility is important to
us too.</p>
<p>If we can't include the training data, for obvious copyright
reasons, then the question whether the resulting model itself is
copyright-clean doesn't matter.<br>
</p>
<pre class="moz-signature" cols="72">--
-- regards
--
-- Matthias Urlichs</pre>
</body>
</html>

--------------h9VIL9xJjV7YTKEa9Vtc12HC--

--------------HPTfUe6ZfgNDnjtaVtKprwEZ
Content-Type: text/vcard; charset=UTF-8; name="matthias.vcf" Content-Disposition: attachment; filename="matthias.vcf" Content-Transfer-Encoding: base64

QkVHSU46VkNBUkQNClZFUlNJT046NC4wDQpOOlVybGljaHM7TWF0dGhpYXM7OzsNCk5JQ0tO QU1FOlNtdXJmDQpFTUFJTDtQUkVGPTE6bWF0dGhpYXNAdXJsaWNocy5kZQ0KVEVMO1RZUEU9 d29yaztWQUxVRT1URVhUOis0OSA5MTEgNTk4MTggMA0KVVJMO1RZUEU9aG9tZTpodHRwczov L21hdHRoaWFzLnVybGljaHMuZGUNCkVORDpWQ0FSRA0K

--------------HPTfUe6ZfgNDnjtaVtKprwEZ--

--------------c3G7QF3M42AsxO9gzkWIn6jN--

-----BEGIN PGP SIGNATURE-----

wsF5BAABCAAjFiEEr9eXgvO67AILKKGfcs+OXiW0wpMFAmgXarAFAwAAAAAACgkQcs+OXiW0wpOb wBAAjhtPrRpbmjJaSRhPlrZzbBxrzPdz804mS98WRb8D7A3+J6e2RVc7fFxRa9hiohzoQtbDRorZ x6EuaxRSzY0tac7W2tG1gPV58QvWJ0w7ypr6eraDLbu2pp+qFdWktiT2CRQ26hBxe22pD/KHE4YY AZ/crzGFfpFtz4Hz/SnxVe339Cu7RXBsAzZYXs0BFwv2qZqecs7ohmQ/+pOga5l1YyWpWHY7Uob2 qXvvH8Hg3EPTkYU1WT68It4Ozg

From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Matthias Urlichs on Sun May 4 15:50:01 2025

Hi,

On Sun, 2025-05-04 at 15:25 +0200, Matthias Urlichs wrote:

On 04.05.25 14:27, Aigars Mahinovs wrote:

The simple fact that none of the LLMs have been sued out of
existence by any copyright owner is de facto proof that it does not
work that way in the eyes of the judicial system.

That may or may not be correct in the long run, IANAL and all that.

However. Copyright is only one aspect of whether or not models should
end up in main. Plain old reproducibility is important to us too.

What is not reproducible (in the reproducible build sense Debian uses)
about, say, the Tesseract OCR models? Compared to say a pre-processed photograph (using non-free in-camera firmware) of a building or
landscape (which can't be shipped in main).

You can change the models against a different one, just as you can
replace a photo with a different one. But without the building readily available, it is hard to change perspective or other changes that would
be possible if the "source" was available.

Ansgar

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Matthias Urlichs@21:1/5 to =?UTF-8?B?QW5zZ2FyIPCfmYA=?= on Sun May 4 16:00:01 2025

To: [email protected]

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------HADiPvQe34EPTBGwA0Y9iyT4
Content-Type: multipart/mixed; boundary="------------ZLxGm80ZAUmog544WagKtRAy"

--------------ZLxGm80ZAUmog544WagKtRAy
Content-Type: multipart/alternative;
boundary="------------Zb5sGCUySQlDhW3O8jbFMVJM"

--------------Zb5sGCUySQlDhW3O8jbFMVJM
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMDQuMDUuMjUgMTU6NDQsIEFuc2dhciDwn5mAIHdyb3RlOg0KPiBXaGF0IGlzIG5vdCBy ZXByb2R1Y2libGUgKGluIHRoZSByZXByb2R1Y2libGUgYnVpbGQgc2Vuc2UgRGViaWFuIHVz ZXMpDQo+IGFib3V0LCBzYXksIHRoZSBUZXNzZXJhY3QgT0NSIG1vZGVscz8NCg0KTXkgcG9p bnQgaXMgdGhhdCByZXByb2R1Y2luZyBhIG1vZGVsIHJlcXVpcmVzIGlucHV0IGRhdGEsIHdo aWNoIHJlcXVpcmVzIA0KdXMgdG8gZGlzdHJpYnV0ZSBzYWlkIGRhdGEsIHdoaWNoIHJlcXVp cmVzIHRoZW0gdG8gYmUgb2Ygc3VpdGFibGUgY29weXJpZ2h0Lg0KDQpMYXN0IEkgaGVhcmQs IHRoaXMgd2FzIG5vdCB0aGUgY2FzZSBmb3IgVGVzc2VyYWN0LiBJZiB0aGF0J3MgaW5jb3Jy ZWN0LCANCnNvIG11Y2ggdGhlIGJldHRlcjsgaW4gYW55IGNhc2UsIHRoaXMgR1IgaXNuJ3Qg YWJvdXQgVGVzc2VyYWN0IHNwZWNpZmljYWxseS4NCg0KLS0gDQotLSByZWdhcmRzDQotLSAN Ci0tIE1hdHRoaWFzIFVybGljaHMNCg0K
--------------Zb5sGCUySQlDhW3O8jbFMVJM
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 04.05.25 15:44, Ansgar 🙀 wrote:<br>
</div>
<blockquote type="cite" cite="mid:[email protected]">
<pre class="moz-quote-pre" wrap="">What is not reproducible (in the reproducible build sense Debian uses)
about, say, the Tesseract OCR models?</pre>
</blockquote>
<p>My point is that reproducing a model requires input data, which
requires us to distribute said data, which requires them to be of
suitable copyright.</p>
<p>Last I heard, this was not the case for Tesseract. If that's
incorrect, so much the better; in any case, this GR isn't about
Tesseract specifically.<br>
</p>
<pre class="moz-signature" cols="72">--
-- regards
--
-- Matthias Urlichs</pre>
</body>
</html>

--------------Zb5sGCUySQlDhW3O8jbFMVJM--

--------------ZLxGm80ZAUmog544WagKtRAy
Content-Type: text/vcard; charset=UTF-8; name="matthias.vcf" Content-Disposition: attachment; filename="matthias.vcf" Content-Transfer-Encoding: base64

QkVHSU46VkNBUkQNClZFUlNJT046NC4wDQpOOlVybGljaHM7TWF0dGhpYXM7OzsNCk5JQ0tO QU1FOlNtdXJmDQpFTUFJTDtQUkVGPTE6bWF0dGhpYXNAdXJsaWNocy5kZQ0KVEVMO1RZUEU9 d29yaztWQUxVRT1URVhUOis0OSA5MTEgNTk4MTggMA0KVVJMO1RZUEU9aG9tZTpodHRwczov L21hdHRoaWFzLnVybGljaHMuZGUNCkVORDpWQ0FSRA0K

--------------ZLxGm80ZAUmog544WagKtRAy--

--------------HADiPvQe34EPTBGwA0Y9iyT4--

-----BEGIN PGP SIGNATURE-----

wsF5BAABCAAjFiEEr9eXgvO67AILKKGfcs+OXiW0wpMFAmgXcYoFAwAAAAAACgkQcs+OXiW0wpP+ 0BAAupquDnEIbearL3NUMeRb/rZu43zB2GxYQ/gtsg8FPjWdrYuzH/yevM8KkIM4BpM2Md58ovhm zhS89u+D60Q24XVdC5A72JuTLallgN4BEXL2pVuYtBn8mlqkxWd+BDYumLZt7BssJ/fV1VVI9IFg roDyCENnwxnqb8Cu9Et0WiLq233xbdouTkF5VS4JtfPy6RKXJbQpDCJ/r2e8U02kSKmIF/64vB9d TKXxD2Q+ZSHEi09Rk8lJ940JsA

From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Matthias Urlichs on Sun May 4 17:00:01 2025

Hi,

On Sun, 2025-05-04 at 15:54 +0200, Matthias Urlichs wrote:

On 04.05.25 15:44, Ansgar 🙀 wrote:

What is not reproducible (in the reproducible build sense Debian uses) about, say, the Tesseract OCR models?

My point is that reproducing a model requires input data, which requires us to distribute said data, which requires them to be of suitable copyright.

Ah, you mean in the sense of a from-scratch rebuilding of statistical
data including the possibility to do different analysis?

Debian doesn't require all data for a from-scratch reimplementation for packages to be available though. It would also run in many problems as
relevant documents (RFCs, ISO standards, design documents,
publications, ...) or cloned originals (UNIX or Windows APIs, games,
...) are often non-free.

This has so far also been the case for statistical data in Debian, such
as simple aggregates such as the number of packages in Debian, which
might be included in Debian without also including the entire Debian
archive as source, data about word or character frequencies in natural
language texts, and so on. I guess proponents of the original GR would
also find this problematic?

Ansgar

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Allombert@21:1/5 to All on Sun May 4 18:10:01 2025

Le Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou a �crit :

===============================================================================
Proposal A: "AI models released under open source license without original
training data or program" are not seen as DFSG-compliant. ===============================================================================

The "AI models released under open source license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

Could we avoid using the term 'Artificial intelligence' in the text of
the proposal (not in the appendix)? This term dates for 1970 and has had different meaning for eachdecades since then. In ten years it is likely
that, while the question this GR addresses will still be relevant, the
term 'Artificial intelligence' will refer to something quite different.

Wikipedia includes this citation:
"" However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, often without being
called AI because once something becomes useful enough and common enough
it's not labeled AI anymore."[2][3] "" <https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=1286364868>

Cheers,
--
Bill. <[email protected]>

Imagine a large red swirl here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Zacchiroli@21:1/5 to All on Sun May 4 19:00:01 2025

On Sun, May 04, 2025 at 04:49:49PM +0200, Ansgar 🙀 wrote:

This has so far also been the case for statistical data in Debian, such
as simple aggregates such as the number of packages in Debian, which
might be included in Debian without also including the entire Debian
archive as source, data about word or character frequencies in natural language texts, and so on. I guess proponents of the original GR would
also find this problematic?

There might be even simpler examples, such as PDFs or raster images
released under DFSG-free licenses (e.g., CC-BY-SA), which are freely
editable with Inkscape, Gimp, or the like, but were initially generated
by other means (like using Adobe Illustrator or Photoshop, which is unfortunately quite likely for professionally authored content these
days). I'm not sure nowadays we consistently require, check, and
enforce, that the "preferred form of modification" for these works are
present in the archive.

Cheers
--
Stefano Zacchiroli . [email protected] . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEE8ZooXsFA+JEz681OfH5Cj5NBJ5kFAmgXnJwACgkQfH5Cj5NB J5nWURAAh0lMVn+QkiNXRuqSjs6j9JRMZ/mvVF4AGKSB8DrXUKH/HxV5mW9SKZ2I SKa9uq8keiFwid0p4Mv04eNJuus6RknycPv1aGxUB5loWt+7tjEVkW5GPVbCHKQO K/S5n2i9GAqP4psJ2z7qMw9Hx68aTuew93v+ywnZEtKl7FQDWXCWOQpjfvqM7jZ8 ilmAmGPaN/k1dysESoFHd/WdE+y54U7UvhMJMFiRLV1Moezb+kMtwB7xYWP8LuAg gg44CfwWRW4khVl/bov9HoOsk7RjJ7CrhyZ7GhXsrsfepTOT/3E1fxYESopMoj+5 DJcFBxLSlokJQSD4vAUzaBoN4DBWNY9Q6NUmlHi+qX3x1NsyqgNE418DhmQkrZiA Kb6T0yU50PxMDtyc0PmUA+DcULVROr/Cl1XRbBRjo9M2qkurTsXc+2V8vjeLYx5z n2M0XxR9I8w4956fsba48+wPqUnW0MD8umvHAIHA5zelGMSJK4r69kOV1pMsygQm ztM05nFBLo1P3S6fvsWCYh

From Wouter Verhelst@21:1/5 to Aigars Mahinovs on Sun May 4 18:20:02 2025

Hi Aigars,

On Sun, May 04, 2025 at 02:27:46PM +0200, Aigars Mahinovs wrote:

On Sun, 4 May 2025 at 13:12, Wouter Verhelst <[1][email protected]>
wrote:

On Tue, Apr 29, 2025 at 03:17:52PM +0200, Aigars Mahinovs wrote:
> However, here we have a clear and fundamental change happening
in the
> copyright law level - there is a legal break/firewall that is
happening
> during training. The model *is* a derivative work of the source
code of
> the training software, but is *not* a derivative work of the
training
> data.
I would disagree with this statement. How is a model not a
derivative
work of the training data? Wikipedia defines it as

The simple fact that none of the LLMs have been sued out of
existence by any copyright owner is de facto proof that it does not
work that way in the eyes of the judicial system.

This statement is inaccurate, incorrect, and irrelevant.

It is inaccurate, because the legal system does not work that way: the
legality of an action is not defined by the presence or absense of a
lawsuit pertaining to that action. If it were, then any cold case in the history of mankind must therefore by definition have been legal. More
to the point, in this particular case the lack of lawsuits could be
explained by a variety of factors, including but not limited to the indifference of the grieved party; the inability to finance a lawsuit
against "big tech" companies such as microsoft or facebook; or the
believe on the side of the grieved party that they may not have a case
in the first place, even when they might have won would they have filed
suit.

It is incorrect, because the New York Times did in fact file suit
against Microsoft, OpenAI, and other parties related to copyright
infringement of their large library of news articles in creating
ChatGPT[1]. The case is still in court.

It is irrelevant, because in a Debian context, the law is relevant only
to the point that we must obey it in relevant jurisdictions[2]. It does not have any say over how we define our own rules and ethics. If we decide
as Debian that we believe the training data is in fact part of the
source of a model, then we can in fact set such a rule. We do not just
follow the law in deciding what to distribute and how to do it; if this
were the case, then there would never have been any need for a non-US, non-free, or non-free-firmware section of our archive, and the DFSG
would have been just this little thing, you know.

[1] https://www.courtlistener.com/docket/68117049/the-new-york-times-company-v-microsoft-corporation/
[2] where I define "relevant" as "any jurisdiction where not obeying the
law could result in significant problems for Debian", which in
practice probably means the US and most of Europe.

Wikipedia definition is a layman's simplification.

It may be a simplification, but that in and of itself does not make it incorrect.

I do think that a model is in fact a derivative work of the training
data, because of the fact that you use the training data to build the
model, and that without that training data, the model would be different
and it would not act the same.

Is that a legal definition? No. Is it a basis on which we could define
our own rules and ethics? Sure is.

Thanks,

--
w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gunnar Wolf@21:1/5 to All on Sun May 4 19:10:02 2025

Bill Allombert dijo [Sun, May 04, 2025 at 04:01:37PM +0000]:

Le Sat, Apr 19, 2025 at 01:56:17PM -0400, M. Zhou a écrit :

===============================================================================
Proposal A: "AI models released under open source license without original >> training data or program" are not seen as DFSG-compliant.
===============================================================================

The "AI models released under open source license without original training >> data or program", a particular type of files as explained above, are not seen
as DFSG-compliant. Hence, they can not be included in the "main" section of the
Debian archive. This proposal does not specify whether the "non-free" section
of Debian archive can include those files.

Could we avoid using the term 'Artificial intelligence' in the text of
the proposal (not in the appendix)? This term dates for 1970 and has had >different meaning for eachdecades since then. In ten years it is likely
that, while the question this GR addresses will still be relevant, the
term 'Artificial intelligence' will refer to something quite different.

Nitpicking... the term was coined around 1955. By 1966¹, when reports from various government agencies admit that automated translation is further
away than originally envisioned ("just a few years", estimated in 1954,
after the Georgetown-IBM experiment²³), funds were cut and it is
acknowledged the "first IA winter" began.

¹ https://web.archive.org/web/20110409070141/http://www.mt-archive.info/ALPAC-1966.pdf
² https://open.unive.it/hitrade/books/HutchinsFirst.pdf
³ https://en.wikipedia.org/wiki/Georgetown%E2%80%93IBM_experiment

Wikipedia includes this citation:
"" However, many AI applications are not perceived as AI: "A lot of cutting >edge AI has filtered into general applications, often without being
called AI because once something becomes useful enough and common enough
it's not labeled AI anymore."[2][3] "" ><https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=1286364868>

I agree with you. However, it is a term firmly set in the mind of too many people. Keep in mind Mo Zhou's proposal is in a large way an answer to
OSI's OSAID⁴, which many among us feel to be a gross mistake.

⁴ https://opensource.org/blog/the-open-source-initiative-announces-the-release-of-the-industrys-first-open-source-ai-definition

Too many people (both "in the trade" and not) recognize the term AI. I have several times called for "resignifying" it as "Apparent Intelligence",
because its outputs are good enough to fool some people into thinking there
is intelligence where there is none. But, given the scope where this
statement (if the GR passes) will apply, I think we should stick with the stupid AI moniker.

– Gunnar.

-----BEGIN PGP SIGNATURE-----

wr0EABYKAG8FgmgXns8JEOL2O0NT9FmJRxQAAAAAAB4AIHNhbHRAbm90YXRpb25z LnNlcXVvaWEtcGdwLm9yZwCUjS/1E/S4xKaQryzHqwh4ztBZizUGWzQwpi7YjHTp FiEEYLMJPZYQjly5cULv4vY7Q1P0WYkAAIv4AQCVuRvRfqA9SjDaPCOmAa9RGsN3 K9qmcfum6MXercobPAD+J9N+hEooLWYa4llBJ3mSgBC5NeSAopozCGdLwSiEpgo=
=21z6
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Wouter Verhelst on Sun May 4 19:10:02 2025

On Sun, 04 May 2025 at 17:30:21 +0200, Wouter Verhelst wrote:

We do not just
follow the law in deciding what to distribute and how to do it

I think it's important to distinguish between what we can't allow even
if we wanted to, because the law or other external factors don't allow it
(such as the need to avoid copyright infringement), and what we don't
allow because our own self-imposed rules don't allow it (such as the distinction between main and contrib).

The things we can't allow because of external factors are a constraint
that we mostly share with other OS distributions, although there can be differences if one distribution is more risk-averse than another (for
example I think Fedora is more cautious about patent risks than we are).

The things we don't allow as a self-imposed rule are entirely our
choice: for instance Fedora has chosen to allow non-Free firmware and
non-code objects (like documentation) in their equivalent of main, but
we have chosen not to allow that. These are a trade-off. If we are
overly strict with our self-imposed rules, we risk excluding important functionality from our distribution. Conversely, if we are overly
permissive with our self-imposed rules, our users don't consistently get
all of the benefits of FOSS, because they can't assume that all of our
packages actually have all of those benefits.

if this
were the case, then there would never have been any need for a non-US, >non-free, or non-free-firmware section of our archive

For non-free and non-free-firmware (and contrib), yes you're correct:
not allowing those packages to be included in main is a self-imposed
rule, which we have chosen to follow.

But as far as I'm aware, non-US *was* about legal restrictions imposed
on us from the outside (specifically the USA's export rules at the time,
as applied to cryptography), and it isn't a distinction that anyone in
Debian would have chosen to make if external factors didn't force it on
us.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Wouter Verhelst@21:1/5 to Aigars Mahinovs on Mon May 5 01:10:01 2025

On Sun, May 04, 2025 at 07:08:00PM +0200, Aigars Mahinovs wrote:

On Sun, 4 May 2025 at 17:30, Wouter Verhelst <[1][email protected]> wrote:

> Wikipedia definition is a layman's simplification.
It may be a simplification, but that in and of itself does not make
it
incorrect.

I have specifically addressed this point with examples in my reply.
Copyright very clearly does not survive learning and then generation of
new solutions. In humans that is a given.

Indeed.

For software I would assume the equivalence, unless proven
differently.

This is not a fact; this is your opinion. You base the rest of your
argument on it, so I'll call it an axiom: something to accept in order
for the rest of the argument to hold.

The problem is, I disagree with your axiom.

To me, software and humans are two very different things. We know how
computers work; we can therefore reason what the output of a software
program is going to be based on the input that you give it. Whether that program is a compiler or a trainer program for a deep learning model is
just a detail in that context. One computer chip of a given model and
stepping is 100% equivalent to another, and so any process that runs on
one of these chips will produce the same output on another.

The same is not true for human brains; we do not fully understand how
they work, we cannot predict what the resulting experience of a given
person is going to render based on the training that person has
received, and therefore we cannot predict how a given person is going to
write a particular piece of software. Different brains will result in
different programming styles given the same training. In fact, I may
write a solution to the same problem differently on two different days.
An LLM will not do that; when given the same inputs, it will produce the
same output (as long as we consider "the internal state of its
randomizer" as part of its inputs).

Given that I don't agree with your initial axiom, and given that the
rest of your argument is based on that axiom, I'm not surprised that I
didn't agree with your full argument. That is also why I did not see the
need to reply to that part of your argumentation.

--
w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Hartman@21:1/5 to All on Mon May 5 21:30:01 2025

"M" == M Zhou <[email protected]> writes: ===================================

M> Proposal A: "AI models released under open source license without
M> original training data or program" are not seen as
M> DFSG-compliant.
M> ===============================================================================

M> The "AI models released under open source license without
M> original training data or program", a particular type of files as

I find the use of Open Source License in a Debian context problematic.
The DFSG is not the OSD, and we should care whether a license is DFSG
free not OSI approved.

I hope that you would be willing to accept an amendment to replace all
uses of open source in your proposal.

-----BEGIN PGP SIGNATURE-----

iHUEARYKAB0WIQSj2jRwbAdKzGY/4uAsbEw8qDeGdAUCaBkP4wAKCRAsbEw8qDeG dBxMAP0QiSTldlJd9gpDmDnrjgNYPYocoNC8LvJ0p9NJiE1G3wD/TZLH43WtfBbK knSTrBfsyPP2kSVdPHUOHWVuV0wKFQ0=
=ipZX
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From M. Zhou@21:1/5 to Sam Hartman on Mon May 5 21:50:01 2025

The issue is also discussed here:
https://lwn.net/Articles/1019028/

A better wording goes:
s/open source license/DFSG-compatible license/g

It is fixed in the git repo: https://salsa.debian.org/lumin/gr-ai-dfsg/-/commit/9496f9fb6405db5a99fff1672cd4bad66c925c24

The proposal after amendament:

=============================================================================== Proposal A: "AI models released under DFSG-compatible license license without
original training data or program" are not seen as DFSG-compliant. ===============================================================================

The "AI models released under DFSG-compatible license license without original training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

On Mon, 2025-05-05 at 13:22 -0600, Sam Hartman wrote:

"M" == M Zhou <[email protected]> writes:

===================================
    M> Proposal A: "AI models released under open source license without     M> original training data or program" are not seen as
    M> DFSG-compliant.
    M> ===============================================================================

    M> The "AI models released under open source license without
    M> original training data or program", a particular type of files as

I find the use of Open Source License in a Debian context problematic.
The DFSG is not the OSD, and we should care whether a license is DFSG
free not OSI approved.

I hope that you would be willing to accept an amendment to replace all
uses of open source in your proposal.

-----BEGIN PGP SIGNATURE-----

iQJFBAABCgAvFiEEY4vHXsHlxYkGfjXeYmRes19oaooFAmgZFFoRHGx1bWluQGRl Ymlhbi5vcmcACgkQYmRes19oaoplOg//YJwwyedj8ZyLHPvdLEQQxcriHOHBb4ON eS9Bn4vaoPbejecWOppNEjf+mERKbWVmVpx4H6/gsXK29ylDYwUhhrQM4qsAH4A2 QOL7qyodebcyBtYsnNby96ygIXIShyXqe8rcud+c5qaY5zkhyZ08y88N4LZWGlyn EQK00BG6cfpaDHuaZNmc0AMdA0VvhziT8eDqDjgjw82Jaoj/JWaUHDhWxvYxfc6k uBO/RN5wP4eMWJQ1DJly+yNaI3h5syNaANm4e8UA2ujMpiFTeAD+f9VwPQ2ajiAs iEMDfQ7gnr7cyUyGgatDfdzPOthfYsIuoqdgMpyIOxCEesLTPErUgHwOOktfNTf0 H8qHHJSWiRJHayDr/RiwpL0XjlegjZOvqJWLffLBsqc23joMIy61YXLr2rqd14d1 pLy5E2DXci6epreMDRoAdKlCUOczXdcDU6j9Q8SXmeuTkdned4s8wyGoTZsdiTqG k5+0cbV93hVHH1e94ntqjAuSkKKtsn8qs6VfjrcwAXweUHF+b4Q6OE1inx4dZNDb uQi3G9Hbeefybp0sxbh5XNM/WqssQgoZ/eQMmmhxDcytRZBr4oRjzA4aABFnMUZL W2xxuwiQ0Wtph85PNWAhmqJV/zZtGlDYbTua/7sB5SQzqATRbgsyrLB00+R6VQiE
s1oH34fv5cM=
=uO5v
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gunnar Wolf@21:1/5 to All on Mon May 5 22:00:02 2025

M. Zhou dijo [Mon, May 05, 2025 at 03:41:14PM -0400]:

The issue is also discussed here:
https://lwn.net/Articles/1019028/

A better wording goes:
s/open source license/DFSG-compatible license/g

It is fixed in the git repo: >https://salsa.debian.org/lumin/gr-ai-dfsg/-/commit/9496f9fb6405db5a99fff1672cd4bad66c925c24

The proposal after amendament:

===============================================================================
Proposal A: "AI models released under DFSG-compatible license license without
original training data or program" are not seen as DFSG-compliant.
===============================================================================

The "AI models released under DFSG-compatible license license without original >training data or program", a particular type of files as explained above, are >not seen as DFSG-compliant. Hence, they can not be included in the "main" >section of the Debian archive. This proposal does not specify whether the >"non-free" section of Debian archive can include those files.

Thank you very much. In case it is in doubt, I sponsored the original
wording, and I confirm I'm sponsoring the amended version.

-----BEGIN PGP SIGNATURE-----

wr0EABYKAG8FgmgZFooJEOL2O0NT9FmJRxQAAAAAAB4AIHNhbHRAbm90YXRpb25z LnNlcXVvaWEtcGdwLm9yZ0dsgZP52gm/e23FXLHX94EQIu8LxSyjJxMgrA+fY2V2 FiEEYLMJPZYQjly5cULv4vY7Q1P0WYkAAGyMAQDJQJr1wlQKYf0hhRXsiIcErZdp BDwcrtEVHWChEGAAVQEApxx1myi/LCYK7Hxc6Tqa7t/fQKDI00bolhsNj1RrNQY=
=7O2+
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Hartman@21:1/5 to All on Mon May 5 22:20:02 2025

"Gunnar" == Gunnar Wolf <[email protected]> writes:

Gunnar> M. Zhou dijo [Mon, May 05, 2025 at 03:41:14PM -0400]:
>> The issue is also discussed here:
>> https://lwn.net/Articles/1019028/
>>
>> A better wording goes: s/open source license/DFSG-compatible
>> license/g
>>
>> It is fixed in the git repo:
>> https://salsa.debian.org/lumin/gr-ai-dfsg/-/commit/9496f9fb6405db5a99fff1672cd4bad66c925c24
>>
>> The proposal after amendament:
>>
>> ===============================================================================
>> Proposal A: "AI models released under DFSG-compatible license
>> license without original training data or program" are not seen
>> as DFSG-compliant.
>> ===============================================================================
>>
>> The "AI models released under DFSG-compatible license license
>> without original training data or program", a particular type of
>> files as explained above, are not seen as DFSG-compliant. Hence,
>> they can not be included in the "main" section of the Debian
>> archive. This proposal does not specify whether the "non-free"
>> section of Debian archive can include those files.

The amended wording addresses the issue I raised; thanks.

--=-=-Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iHUEARYKAB0WIQSj2jRwbAdKzGY/4uAsbEw8qDeGdAUCaBkcVwAKCRAsbEw8qDeG dLWtAQC6yeJGH1LiwObkZ772/P4BV7+OTmsQ7E+XtqbhCawAqwEA5bZnHenBhhx9 HwQ12A8hdzL29a5+auerNyZn9A8u3gI=hBSL
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Allombert@21:1/5 to All on Tue May 6 00:00:02 2025

Wikipedia includes this citation:
"" However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, often without being
called AI because once something becomes useful enough and common enough it's not labeled AI anymore."[2][3] "" <https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&oldid=1286364868>

I agree with you. However, it is a term firmly set in the mind of too many people.

But for how long ?

Keep in mind Mo Zhou's proposal is in a large way an answer to
OSI's OSAID⁴, which many among us feel to be a gross mistake.

⁴ https://opensource.org/blog/the-open-source-initiative-announces-the-release-of-the-industrys-first-open-source-ai-definition

Too many people (both "in the trade" and not) recognize the term AI.

But in 5 years, they will associate AI to something quite different from today LLM,
but Debian will be stuck with this term in its policy documents, causing confusion.

Cheers,
--
Bill. <[email protected]>

Imagine a large red swirl here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrea Pappacoda@21:1/5 to All on Tue May 6 00:10:01 2025

Il 5 maggio 2025 21:50:45 CEST, Gunnar Wolf <[email protected]> ha scritto: >Thank you very much. In case it is in doubt, I sponsored the original >wording, and I confirm I'm sponsoring the amended version.

Same for me. Thanks all!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Allombert@21:1/5 to All on Tue May 6 00:20:01 2025

Le Mon, May 05, 2025 at 11:44:30PM +0200, Aigars Mahinovs a �crit :

On Sun, 4 May 2025 at 17:30, Wouter Verhelst <[email protected]> wrote:

It is incorrect, because the New York Times did in fact file suit
against Microsoft, OpenAI, and other parties related to copyright infringement of their large library of news articles in creating ChatGPT[1]. The case is still in court.

[1] https://www.courtlistener.com/docket/68117049/the-new-york-times-company-v-microsoft-corporation/

Thanks for this link, it has been a very interesting read.

Another one:

https://arstechnica.com/information-technology/2023/07/book-authors-sue-openai-and-meta-over-text-used-to-train-ai/
https://arstechnica.com/tech-policy/2025/02/meta-torrented-over-81-7tb-of-pirated-books-to-train-ai-authors-say/

Cheers,
Bill.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From M. Zhou@21:1/5 to All on Tue May 6 04:00:02 2025

On Tue, 2025-04-22 at 22:59 +0200, Ansgar 🙀 wrote:

b) Removal of image recognition software (like opencv[2])

Not likely. I'm the uploader of src:opencv.
This is a pretty large library that contains lots of functionalities
that does not require a "model" to function. For opencv, it is at
most adding one more file for the +dfsg file exclusion, or spliting
the model to maybe a non-free package and set Recommends to pull
the package.

The thing is similar for src:nltk (a natural language processing toolkit).
It's models are not packaged, and I marked the bug requesting the model
package as wontfix. This is also a large library where lots of useful
functions do not need a model to run.

There are lots of software upstreams who do not release
model inside the source code tarball. Instead, downloading
is triggered when the user calls the API that relies on the
particular model, and calling the model is usually not the
sole functionality of the software.

As explained by the "Scope" part of the original proposal email,
traditional software written in C++/Python like OpenCV is not involved.
What involved is just the "model" file itself.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Clint Adams@21:1/5 to M. Zhou on Tue May 6 16:10:01 2025

On Mon, May 05, 2025 at 09:59:12PM -0400, M. Zhou wrote:

Not likely. I'm the uploader of src:opencv.
This is a pretty large library that contains lots of functionalities
that does not require a "model" to function. For opencv, it is at
most adding one more file for the +dfsg file exclusion, or spliting
the model to maybe a non-free package and set Recommends to pull
the package.

Is it possible that none of the packages that depend upon or
transitively build-depend upon opencv-data, including
libopencv-dev, don't actually depend upon any of the contents
of the opencv-data package? If not, it seems to me that the
impact is much greater than what is implied above.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Allombert@21:1/5 to All on Tue May 6 22:20:01 2025

Le Tue, May 06, 2025 at 01:26:19AM +0200, Aigars Mahinovs a �crit :

This one is much simpler. Maybe because the lawyers being used are not too good.

https://www.courtlistener.com/docket/67538258/tremblay-v-openai-inc/

Authors claim a lot of stuff, basically a generic shotgun of copyright claims, but all secondary claims get dismissed by the court at pre-trial stage due to bad legal reasoning and failing to detail or prove any actual wrongdoing. And specifically a claim that all outputs from a LLM are
derived works of all inputs is dismissed based on already decided case law.

Only the claim of direct copyright infringement of using a text of a book
in the training process of a model still stands to avait the actual trial. And there OpenAI is citing a lot of good reasons why that does not
constitute distribution at all and why the result of the work is transformative and thus is protected by fair use. Just the fact of
accessing some data at some point does not create copyright infringement.
The whole lawsuit is very sloppy IMHO, IANAL.

If you want to know how the case is going, look at the second link I
provided (this is the same case).

Cheers,
Bill.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon Josefsson@21:1/5 to Stefano Zacchiroli on Wed May 7 14:20:02 2025

Stefano Zacchiroli <[email protected]> writes:

AFAIK this legal theory has not been tested in court yet. But the big commercial players (who, remember, have vetted interests in being
copyright absolutists) believe in it so much, that they go as far as
offering legal indemnity promises to users of their LLMs who encounter
legal issues due to the use of generated output.

I think that this analogy is generally not a good one: commercial
players do not have vetted interest in being copyright absolutists, on
the contrary: commercial players do what's in the best interest of their shareholders, normally profits. If there is low commercial risk in
doing something, and some commercial gains, that's what should be done,
modulo bugs in implementation and changing external factors.

This motivation better explains many actions related to software
licensing better than any legal theory.

Another example is the practice to drop copyright years from copyright
notices. Some commercial players do this because they save developer
time, and believe that the likelyhood that a copyright claim will have commercial effects depending on the presence or lack of the copyright
year is low. This is more true if you are a large commercial player:
then you can use other arguments to win your case, even if the other
side would have a valid argument in a missing copyright year.

The copyright absolutist approach would be to look at laws and prior
cases and recommend what is the safest and most conservative approach.
As far as I know, that is still to do increment copyright years.

Comparing the situation to LLM outputs here, commercial players have a
vetted interest in being able to offer LLM products, and that depends on
good models (trained on good inputs) being legally distributable. So
they make the legal idemnity promise to be able to end up in court with copyright holders and take the fight there, and assumes that they will
reach a commercially beneficial deal in the end. I don't think they
reason that they necessarily have the legal right to do what they do,
that's not particulary relevant for making commercial decisions.

/Simon

-----BEGIN PGP SIGNATURE-----

iQNoBAEWCAMQFiEEo8ychwudMQq61M8vUXIrCP5HRaIFAmgbTIMUHHNpbW9uQGpv c2Vmc3Nvbi5vcmfCHCYAmDMEXJLOtBYJKwYBBAHaRw8BAQdACIcrZIvhrxDBkK9f V+QlTmXxo2naObDuGtw58YaxlOu0JVNpbW9uIEpvc2Vmc3NvbiA8c2ltb25Aam9z ZWZzc29uLm9yZz6IlgQTFggAPgIbAwULCQgHAgYVCAkKCwIEFgIDAQIeAQIXgBYh BLHSvRN1vst4TPT4xNc89jjFPAa+BQJn0XQkBQkNZGbwAAoJENc89jjFPAa+BtIA /iR73CfBurG9y8pASh3cbGOMHpDZfMAtosu6jbpO69GHAP4p7l57d+iVty2VQMsx +3TCSAvZkpr4P/FuTzZ8JZe8BrgzBFySz4EWCSsGAQQB2kcPAQEHQOxTCIOaeXAx I2hIX4HK9bQTpNVei708oNr1Klm8qCGKiPUEGBYIACYCGwIWIQSx0r0Tdb7LeEz0 +MTXPPY4xTwGvgUCZ9F0SgUJDWRmSQCBdiAEGRYIAB0WIQSjzJyHC50xCrrUzy9R cisI/kdFogUCXJLPgQAKCRBRcisI/kdFoqdMAQCgH45aseZgIrwKOvUOA9QfsmeE 8GZHYNuFHmM9FEQS6AD6A4x5aYvoY6lo98pgtw2HPDhmcCXFItjXCrV4A0GmJA4J ENc89jjFPAa+wUUBAO64fbZek6FPlRK0DrlWsrjCXuLi6PUxyzCAY6lG2nhUAQC6 qobB9mkZlZ0qihy1x4JRtflqFcqqT9n7iUZkCDIiDbg4BFySz2oSCisGAQQBl1UB BQEBB0AxlRumDW6nZY7A+VCfek9VpEx6PJmdJyYPt3lNHMd6HAMBCAeIfgQYFggA JgIbDBYhBLHSvRN1vst4TPT4xNc89jjFPAa+BQJn0XTSBQkNZGboAAoJENc89jjF PAa+0M0BAPPRq73kLnHYNDMniVBOzUdi2XeF32idjEWWfjvyIJUOAP4wZ+ALxIeh is3Uw2BzGZE6ttXQ2Q+DeCJO3TPpIqaXDAAKCRBRcisI/kdFopvxAP98H9GLr5qi bbVqVlu/IxYBdHFceQ4OYulrSAx5Uy9B9AEAqMfhRVarqvHDqRqdPvoqyWFXSi8v ahZAWHGTKCXZqA4=
=3UaF
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tiago Bortoletto Vaz@21:1/5 to All on Thu May 8 04:10:01 2025

Hi,

[...]

===============================================================================
Proposal A: "AI models released under DFSG-compatible license license without
original training data or program" are not seen as DFSG-compliant.
===============================================================================

The "AI models released under DFSG-compatible license license without original
training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

Could you just fix a double 'license' here in 'DFSG-compatible license license'?

Thanks!

--
Tiago Bortoletto Vaz

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEOAYLMZqeqHbTW+jfgVxjVIQAXyEFAmgcEgcACgkQgVxjVIQA XyF0jg//ZhTVv3m76LYZdayXatr68oeNySFwehtiW5FnLmpTdZ5Uz3F3a6aaiQZW wPEHBb6VR3p7vQq92VfFY3XXE37zDGz+9QIJMA4/Ci0+NMk75o+Yj4crJeW+DWWe VaPyWjtex1q/tpgc/ugqAsCN+slQuZWYUrkPPFI8Otcl3n+HbNwMFdoU8Nx28kgh 8Mvlrr2b5dYairIOjf6rw3mIKfdEs693gUcMEjCd5zZqPiGlfGdQOeENkmZxEJuk JEBNjYpHI2Pdv7WhXjhkLvuvcVh8O/SnHT9KVUgmLRq3+EqfEVpk38PGIM2EelX6 GTa0rrJyJ8aaFtbxBlc58TmNsRs9GufiBFZpazkl4W22IVfyStzSUkrcTUpzrbOO Opt6XSnSbarQEheagEtA/YvHEgewKeP6N1A1DeyeMw5/oszCaasPkzRc+exdGYTV huKV/pFZNUmxCbteb+PpkEKe7wBCpCs4sPTF/x8KvmyC1LQPMQZcORWTzlzzaoik OfsC4yvHts6FDKmO160KGSeVEtRfRhSuQjiVA1ozme8/CmnmRo36tYK7DcIj21ib 916EvjHcF4zkzqcLe90EsC21ZACov6QVErWB7Ltrc0cPitYFcibCJ0ySjlRevqEa e5XkkGMp21c/iJwa0Yqrbbsg3VCoVxZ/D/OEEGP7zftryY8MPSA=
=gLu3
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Wouter Verhelst@21:1/5 to Aigars Mahinovs on Thu May 8 12:50:02 2025

On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote:

The transformative criteria here is that the resulting work needs to be
transformed in such a way that it adds value. And generating new texts
from a LLM is pretty clearly a value-adding transformation compared to
the original articles. Even more so than the already ruled-on Google
Books case.

OK, let me change it around a bit, because I don't think this discussion
is going in any direction that is relevant for Debian.

The only way in which you can build a model is by taking loads and loads
of data, running some piece of software over it, and storing the result somewhere.

How can we do this legally, reproducibly, and openly if we do not have
the rights to redistribute the said "loads and loads of data"?

The answer is, we can't.

Therefore, I conclude that, practically, we cannot include models in
Debian if we want them to be reproducible.

I think we have a goal, as a project, to make Debian reproducible. I
think the reproducibility of our software -- *all* our software, not
just the programs and libraries but also the data -- is an important
goal with important repercussions.

Dropping that goal would be required if we were to accept models in
main.

We also declared, over 20 years ago, that "Debian will remain 100%
free". Not just the programs and libraries, but *everything* -- also the
data.

Ergo, either we need to drop our dual goals of becoming reproducible and remaining Free, or models without training data can't go into main.

This to me is one practical reason as to why training data is part of
the source code of a model: because without it, we can't build the model
and we can't build it reproducibly.

The fact that the model does something vaguely and remotely similar to a biological process of training and learning in humans, and that
therefore some people have taken to naming the process of running
advanced statistical analysis over data to build such a model also
"training" is a red herring. The two processes are very different and
cannot be compared as a practical matter.

I have noticed that you have gone ahead and proposed an alternative
ballot option to accept this misguided idea that the training data is
not source. I can't tell you how much I disagree with that option. The
OSI has already taken steps to legitimize this blatantly obviously wrong
idea; if Debian were to follow down those tracks (and this would
definitely do that, IMO), I have to seriously reconsider whether I want
to still be a part of this.

Thanks.

--
w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.

-----BEGIN PGP SIGNATURE-----

iJUEABMJAB0WIQQZhIYJILYM7Y0TCTdH038p5i64/wUCaByK4gAKCRBH038p5i64 /6rMAYCop7giLOpBoGaKH/BL2sDTg97sv+pki9pd6P8LlnXiITe2ap1viO72aozs jIxI4FEBgMyxi5A/8SPF9/aAMyJyK++6WBOyYI358zOMWwPrqQk9N6G6sk91aXzo
VumxWkfNbw==
=z26n
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gunnar Wolf@21:1/5 to All on Fri May 9 03:30:01 2025

Wouter Verhelst dijo [Thu, May 08, 2025 at 12:43:50PM +0200]:

On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote:

The transformative criteria here is that the resulting work needs to be >> transformed in such a way that it adds value. And generating new texts
from a LLM is pretty clearly a value-adding transformation compared to
the original articles. Even more so than the already ruled-on Google
Books case.

OK, let me change it around a bit, because I don't think this discussion
is going in any direction that is relevant for Debian.

The only way in which you can build a model is by taking loads and loads
of data, running some piece of software over it, and storing the result >somewhere.

How can we do this legally, reproducibly, and openly if we do not have
the rights to redistribute the said "loads and loads of data"?

The answer is, we can't.
(...)

I agree with your conclusions, except for one point: Currently, Debian
*aims* at being fully reproducible, but _has never achieved it_ so far (although we have a quite high degree of reproducibility).

I am not saying we should ignore this — only that this IMO would not be as compelling an argument as you position it to be.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Matthias Urlichs@21:1/5 to All on Fri May 9 10:10:02 2025

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------j0rqiUJLj1LQJsTvcAakFmGj
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMDguMDUuMjUgMTI6NDMsIFdvdXRlciBWZXJoZWxzdCB3cm90ZToNCj4gVGhlcmVmb3Jl LCBJIGNvbmNsdWRlIHRoYXQsIHByYWN0aWNhbGx5LCB3ZSBjYW5ub3QgaW5jbHVkZSBtb2Rl bHMgaW4NCj4gRGViaWFuIGlmIHdlIHdhbnQgdGhlbSB0byBiZSByZXByb2R1Y2libGUuDQoN Ck5vdGUgdGhhdCB0aGVyZSBhcmUgb3RoZXIgb2JzdGFjbGVzIHRvIHJlcHJvZHVjaWJpbGl0 eSBiZXNpZGVzICJ3ZSBjYW4ndCANCmRpc3RyaWJ1dGUgdGhlIGRhdGEgJ2NhdXNlIHRoZXkn cmUgbm9uLWZyZWUiLg0KDQpXZSBtaWdodCBub3QgYmUgYWJsZSB0byBiZWNhdXNlIHRoZXkg YXJlIHRvbyBkYW1uIGJpZy4NCg0KUmVidWlsZGluZyB0aGUgbmV1cmFsIG5ldCBtaWdodCBy ZXF1aXJlIHNwZWNpYWxpemVkIGhhcmR3YXJlIChvciBqdXN0IA0KY29tbW9kaXR5IGhhcmR3 YXJlIGJ1dCB0b28gbXVjaCBvZiBpdCkgd2hpY2ggbmVpdGhlciBvdXIgYnVpbGRlcnMgbm9y IA0Kb3VyIGRldmVsb3BlcnMgaGF2ZS4NCg0KUmVidWlsZGluZyBtaWdodCB0YWtlIHRvbyBs b25nIHRvIGJlIHVzZWZ1bCB0byBhbnlib2R5Lg0KDQpUaGUgYnVpbGQgcHJvY2VzcyBtaWdo dCBiZSBub25kZXRlcm1pbmlzdGljIChob3cgKmRvKiB5b3UgY29vcmRpbmF0ZSANCnVtcHRl ZW4gR1BVcyB0aGF0IGFsbCB3b3JrIHRvIGluY3JlbWVudGFsbHkgYWRqdXN0IHRoZSB3ZWln aHRzIG9mIGEgDQpsYXJnZSBuZXR3b3JrPyBBbnN3ZXI6IHlvdSBkb24ndCkgYW5kIHRodXMg bm90IGJlIHJlcHJvZHVjaWJsZSBpbiBwcmluY2lwbGUuDQoNCg0KTkIgSSBhZ3JlZSB0aGF0 IHBlb3BsZS1pbnRlbGxpZ2VuY2UgaXNuJ3QgdGhlIHNhbWUgYXMgDQpjb21wdXRlci1pbnRl bGxlbmNlLCBidXQgdGhhdCdzICdjYXVzZSBjdXJyZW50bHkgdGhlcmUncyBzb21ldGhpbmcg DQptaXNzaW5nIGZyb20gdGhlIGxhdHRlcjsgdGhlIG1pc3NwZWxsaW5nIGlzIGludGVudGlv bmFsLiBDYWxsIGl0IGNvbW1vbiANCnNlbnNlLCBncm91bmRpbmcgaW4gcGh5c2ljYWwgcmVh bGl0eSwgbGVhcm5pbmcgZnJvbSBhZHZlcnNlIGFzIHdlbGwgYXMgDQpleGVtcGxhcnkgdHJh aW5pbmcgZGF0YSwgb3IgaW5oZXJpdGluZyBhIGJyYWluIHN0cnVjdHVyZSB0YWh0J3MgYmVl biANCnJlZmluZWQgYnkgYSBjb3VwbGUgbWlsbGlvbiBnZW5lcmF0aW9ucyB2aWEgZXZvbHV0 aW9uIGluc3RlYWQgb2YgYSBmZXcgDQp0ZW4gaXRlcmF0aW9ucyBvZiBuZXVyYWwgbmV0d29y ayBlbmdpbmVlcmluZyBieSBodW1hbnMNCg0KVGhlIHByb2JsZW0gaXMgdGhhdCBhbGwgdGhv c2UgbWlzc2luZyBmYWN0b3JzIGFyZSBkZXN0aW5lZCB0byBnbyANCnVuLW1pc3Npbmcg4oCU IGFuZCB0aGVuIHdoYXQ/IFdlIGNhbid0IGJhc2Ugb3VyIHJ1bGVzIG9uIGJpb2xvZ2ljYWwg DQpleGNlcHRpb25hbGlzbS4NCg0KLS0gDQotLSByZWdhcmRzDQotLSANCi0tIE1hdHRoaWFz IFVybGljaHMNCg0K

--------------j0rqiUJLj1LQJsTvcAakFmGj--

-----BEGIN PGP SIGNATURE-----

wsF5BAABCAAjFiEEr9eXgvO67AILKKGfcs+OXiW0wpMFAmgdsUAFAwAAAAAACgkQcs+OXiW0wpP4 Hg/9GAXIoUDVxRNO13OPOyJX0fCorMwiuZlIbgJeAl0IMSndZ46bjKWC96tNqZrAYND3fLCfTnqF IKZ/G60HOClTSV1XfPGiXxYs5wRMLG9xkN2IWCrVyC0gDRNnpo4wYpb85Cl+Obx1a27B0YpqrTuK CnGeXzSME6ziS5oCBeclVF434FPFlX1M0F60zHsRq86d1+OVJKkxtzuuczsvG618vFEYdoRHuz9+ keGtwLksvcrKIiwGLGvZZgT6V38VaLlaDc5B1atbmKi11q1oRijC5SNf7CQyDVaBIYM73kuzH6/1 i/ZnQ9HkUe8iW8F/AHgLMAncSXOsN7QduwzuruGsQykkJ3Wi9WMQwKbnx5kNYBduFA+j5ddod9Bp X28597KnzzXjvNFxdgu52II9uY5Cu0W9mahFqZPorkowHGNBeLWyTOjVPpaxqD2QAYsa3I+1xOjp BnM9Ei0+Am78yMfgFbZV1jgWLrLInZMxnQJzw+AGet3e+gvemUY5XJpBb4qUiGHSwwceZ2w1JUlm 2umkouatLR4Htv8o5cLDasJkiQVGqTFuVY9V//WeUDZjD78tGew/hXJ+0WsbgygzShcxBfgXCL6Z h6Cqc9tY8AV2cSw6zW6kzuwjxcr/oTPorW8AOL4SRAzYDxZc0iSZl1ioCpzVQWU9wg1AbPEm7oDI GEc=
=t4hb
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Wouter Verhelst@21:1/5 to Aigars Mahinovs on Fri May 9 16:30:01 2025

On Fri, May 09, 2025 at 12:11:25PM +0200, Aigars Mahinovs wrote:

On Thu, 8 May 2025 at 12:46, Wouter Verhelst <[email protected]> wrote:

On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote:

The transformative criteria here is that the resulting work needs to be
transformed in such a way that it adds value. And generating new texts
from a LLM is pretty clearly a value-adding transformation compared to
the original articles. Even more so than the already ruled-on Google
Books case.

OK, let me change it around a bit, because I don't think this discussion
is going in any direction that is relevant for Debian.

The only way in which you can build a model is by taking loads and loads
of data, running some piece of software over it, and storing the result somewhere.

How can we do this legally, reproducibly, and openly if we do not have
the rights to redistribute the said "loads and loads of data"?

The answer is, we can't.

Sure we can. It is a technical problem, actually. As long as the data
is still available, you can store and redistribute information about
which data you gathered, from where and how it looked like - hashes of copyrigthed content are not copyrighted ;)

The fact that we don't need to do something technically doesn't mean
it's not a good idea.

We don't *need to* distribute source packages to build software, but we
choose to anyway.

We don't *need to* distribute the latex, docbook, or libreoffice sources
for PDF documentation in our packages, but we choose to anyway.

In a similar vein, yes, you're right that technically we don't need to distribute the input data for the models, but that doesn't make it a bad
idea.

I mean. Honestly. If you're going to use "the law" as an argument one
more time, I'm going to *scream*.

I shouldn't even have to explain this to you; "the law" has no bearing
on the difference between "main" and "non-free". Yes, the decision on
whether something can go into our non-free repository is purely and
simply "is it legal for us to put that there". If the answer to that
question is "yes", then it can, and if the answer is "no", then it
can't. It's as simple as that.

But for our main repository, the story is different. So even if "the
law" states that something is fine and legal to do, that doesn't mean we
*have* to state that it therefore automatically satisfies *our*
standards of "free software".

This is what I'm trying to say, and you're not going to convince me that something can go into main because of any argument that is based on "the
law".

In my opinion, a model is not free if we don't have the rights to build
that model, and if we don't have the rights to redistribute everything
that is needed to build that model. Anything else fails DFSG1, DFSG2,
DFSG7 and DFSG8, and it *does not matter* whether copyrights attached to
those files transfer to the model or not.

--
w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Wouter Verhelst@21:1/5 to Aigars Mahinovs on Fri May 9 19:20:01 2025

On Fri, May 09, 2025 at 06:28:40PM +0200, Aigars Mahinovs wrote:

Training data is not source code,

Opinion, not fact. I am not saying that it is an invalid opinion for you
to hold, but it is an *opinion*, and one I disagree with.

--
w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Matthias Urlichs on Fri May 9 19:30:01 2025

Matthias Urlichs <[email protected]> writes:

The problem is that all those missing factors are destined to go
un-missing — and then what? We can't base our rules on biological exceptionalism.

Why not? The entirety of law, politics, and civilization is designed by
humans, for humans. Free software is a movement of humans that attempts to provide other humans with specific freedoms and guarantees around the
software they use. I don't work on free software because I want to make something easier for Google's LLM. I work on free software because I want
to give freedom and control to human beings.

We're the ones building the system. Why should we not design the system
for us, to help us, to make our lives better?

The LLMs are by and large the creations of corporations because they have collective resources that dwarf the resources of nearly all individual
humans. Where this line of reasoning goes in practice is to (further)
create a legal system that treats corporations and their tools as the most important actors and humans as secondary material for corporations to
consume. We already have too much of that.

We *absolutely* should base our rules on what's best for human beings, not corporate constructs. That is the entire point of the free software
movement.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ian Jackson@21:1/5 to Russ Allbery on Fri May 9 20:00:01 2025

Russ Allbery writes ("Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models"):

Why not? The entirety of law, politics, and civilization is designed by humans, for humans. Free software is a movement of humans that attempts to provide other humans with specific freedoms and guarantees around the software they use. I don't work on free software because I want to make something easier for Google's LLM. I work on free software because I want
to give freedom and control to human beings.

We're the ones building the system. Why should we not design the system
for us, to help us, to make our lives better?

The LLMs are by and large the creations of corporations because they have collective resources that dwarf the resources of nearly all individual humans. Where this line of reasoning goes in practice is to (further)
create a legal system that treats corporations and their tools as the most important actors and humans as secondary material for corporations to consume. We already have too much of that.

We *absolutely* should base our rules on what's best for human beings, not corporate constructs. That is the entire point of the free software
movement.

*applause*

--
Ian Jackson <[email protected]> These opinions are my own.

Pronouns: they/he. If I emailed you from @fyvzl.net or @evade.org.uk,
that is a private address which bypasses my fierce spamfilter.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Holger Levsen@21:1/5 to Ian Jackson on Fri May 9 20:10:01 2025

On Fri, May 09, 2025 at 06:54:46PM +0100, Ian Jackson wrote:

Russ Allbery writes ("Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models"):

Why not? The entirety of law, politics, and civilization is designed by humans, for humans. Free software is a movement of humans that attempts to provide other humans with specific freedoms and guarantees around the software they use. I don't work on free software because I want to make something easier for Google's LLM. I work on free software because I want to give freedom and control to human beings.

We're the ones building the system. Why should we not design the system
for us, to help us, to make our lives better?

The LLMs are by and large the creations of corporations because they have collective resources that dwarf the resources of nearly all individual humans. Where this line of reasoning goes in practice is to (further) create a legal system that treats corporations and their tools as the most important actors and humans as secondary material for corporations to consume. We already have too much of that.

We *absolutely* should base our rules on what's best for human beings, not corporate constructs. That is the entire point of the free software movement.

*applause*

/me joins the cheering!

--
cheers,
Holger

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

No mas pobres en un pais rico!

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmgeQ7EACgkQCRq4Vgaa qhxOhw//eUFfs+yIF3Tpn3bWtdT/R5d++Alx4tR1OHCLksM1unxkvRV+IxldEIUS E+YxiTcOC6pRhAMpcYEKOrXO7itj/UY89Hw0Zy5BHTlIkbWak3iNHd66xiA0viVI fedB8TEtROsK7xzuUYY1O1Zg+r+B2/7PJHC7tBN/8kAKykG9bvDzcho6k43k2HYq D3aAM00eKigV466iRjG5yi5G002iqmeRaKcfXwaoisq+BhrvIzbfoJaQ4ExHFCbi hvZRksz2Ua7mWRQDVdQbQER2sQ0fyKIDd3HzUNqRv7UhD0nmj7OAjzNR/QSePmvd Pywtim3oNNhnkog4roJJep0Wen6qECu1ri6/263y9obBwG3izo6pDKsOdJje8JxM bazkMAPb+KffDLJnV0Xfr4XJRObbFojU0k3d8smaiefMdjjHz++Dxcs1p6K/JOSr IrKCjf8Hfr3XLf4btppnoWIRHoZrZhzIu1bu8MLDJ/v3e6plCvFmJaojpJ8QF1DC /tBtrLUInlp8A2INfASq4+Bp3Tm23sANw4JsYBmFPhmev1La257sdh4C8KG6+MdK urclJr9Q40rXRC/pvSVCmMbRUftnf6tf8dT0X8M3fMUAD91MqKJ0aChynMfNs0/V q7DlNapTDLBbN+MD

From Clint Adams@21:1/5 to Holger Levsen on Fri May 9 20:30:02 2025

On Fri, May 09, 2025 at 06:04:34PM +0000, Holger Levsen wrote:

We *absolutely* should base our rules on what's best for human beings, not >> > corporate constructs. That is the entire point of the free software
movement.

*applause*

/me joins the cheering!

With all this enthusiasm, maybe it's time to resurrect https://www.debian.org/vote/2004/vote_002

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrey Rakhmatullin@21:1/5 to Clint Adams on Fri May 9 20:50:01 2025

On Fri, May 09, 2025 at 06:22:35PM +0000, Clint Adams wrote:

We *absolutely* should base our rules on what's best for human beings, not >>>> corporate constructs. That is the entire point of the free software
movement.

*applause*

/me joins the cheering!

With all this enthusiasm, maybe it's time to resurrect >https://www.debian.org/vote/2004/vote_002

Why not, it makes sense to ask the project fundamental questions every 20 years.

--
WBR, wRAR

-----BEGIN PGP SIGNATURE-----

iQJhBAABCgBLFiEEolIP6gqGcKZh3YxVM2L3AxpJkuEFAmgeTKotFIAAAAAAFQAP cGthLWFkZHJlc3NAZ251cGcub3Jnd3JhckBkZWJpYW4ub3JnAAoJEDNi9wMaSZLh jdQP/0Hl/Rk3EH16BEYwX+YK5oNGK1tjU5adtLSjw+0ZJ0UL8vlJA3c7y2j9P9hh ZiOcE8uekzci4+8qH9coNZVDKPMKMpP+SO+cG4eLpteHf7bBdTix7/nuSrpVwr/M +wq62I6ZZOz150deNXs2jt0vzDz1FwXpoUVIDZ6JJRzeu2HBzEB1AH5VQamqqEwp efVkVkHFc35reBq6OqPfNMpv7vpPbXE8DEqV/pLYsb2Oo5y2YiPgP8fwnsn4ps4u mQil1+HmfMviGsxz63/2n5pxhe9WQo1+e/7wjBGHoK7QK51gp54ZeCrhF1xputth ywggcd9eibv3P5UIZrxgJNNCBYPFq3gJe1L457DYFhvoB2q9Z6FzSLB1QMMNF/uo F8FFiGxdEgcguZvp83JAxK9OAf5pGIrHB+HjiTaXaCaaPr5SJbFXMVIPajm2zoL1 lu+KzvIUalB7zNgte/IPgxGi9xQtTbQ9Cu7l8lG2Xelpi1IawhCqq5UbMAchcj9O sn6+56AmPojgIB6vOmQ9DsShk5B0QpKFEd803d89btTnN6YrZIoLPW6nuLK+iljD 91RbnesQLfwC0Sq+v38QItzI86hnKsy2TfgkRl6LWBqR4RyYn+U9qsBG5sC1T5/k SH5UnpcH/tP6yhmZiTxRway31PvM0cruS16fmYgT+CTSGQGM
=Y/Mb
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Matthias Urlichs@21:1/5 to All on Fri May 9 23:40:01 2025

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------jQ47LBIsvp7TGtNEikgM8sxZ
Content-Type: multipart/alternative;
boundary="------------GiVMKHEDVQthPM0guQQA63su"

--------------GiVMKHEDVQthPM0guQQA63su
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMDkuMDUuMjUgMTk6MjAsIFJ1c3MgQWxsYmVyeSB3cm90ZToNCj4gTWF0dGhpYXMgVXJs aWNoczxtYXR0aGlhc0B1cmxpY2hzLmRlPiB3cml0ZXM6DQo+DQo+PiBUaGUgcHJvYmxlbSBp cyB0aGF0IGFsbCB0aG9zZSBtaXNzaW5nIGZhY3RvcnMgYXJlIGRlc3RpbmVkIHRvIGdvDQo+ PiB1bi1taXNzaW5nIOKAlCBhbmQgdGhlbiB3aGF0PyBXZSBjYW4ndCBiYXNlIG91ciBydWxl cyBvbiBiaW9sb2dpY2FsDQo+PiBleGNlcHRpb25hbGlzbS4NCj4gV2h5IG5vdD8gVGhlIGVu dGlyZXR5IG9mIGxhdywgcG9saXRpY3MsIGFuZCBjaXZpbGl6YXRpb24gaXMgZGVzaWduZWQg YnkNCj4gaHVtYW5zLA0KDQpJJ20gbm90IGRpc3B1dGluZyBhbnkgb2YgdGhhdC4gKk9mIGNv dXJzZSogd2Ugc2hvdWxkIHdyaXRlIG91ciBydWxlcyBhbmQgDQpsYXdzIHRvIGJlbmVmaXQg aHVtYW5zIC8gaHVtYW5pdHksIG5vdCByb2JvdHMgb3IgQUlzIG9yIGNvcnBvcmF0ZSANCnBy b2ZpdGVlcmluZyBvciB3aGF0LWhhdmUteW91Lg0KDQpBbGwgSSdtIHNheWluZyBpcyB0aGF0 IHRoZSBpZGVhICJhIGh1bWFuIGNhbiBleGFtaW5lIGEgbG90IG9mIA0KY29weXJpZ2h0ZWQg c3R1ZmYgYW5kIHRoZW4gcHJvZHVjZSBub24tY29weXJpZ2h0ZWQgb3V0cHV0IGJ1dCBhIGNv bXB1dGVyIA0KY2Fubm90IiBtaWdodCBzdGlsbCBob2xkIHNvbWUgd2F0ZXIgdG9kYXksIGJ1 dCB0aGUgYnVja2V0IGlzIGxlYWt5IGFuZCANCmdldHRpbmcgbGVha2llciBldmVyeSBjb3Vw bGUgb2YgbW9udGhzLCBpZiBub3Qgd2Vla3MuDQoNCi0tIA0KLS0gcmVnYXJkcw0KLS0gDQot LSBNYXR0aGlhcyBVcmxpY2hzDQoNCg==
--------------GiVMKHEDVQthPM0guQQA63su
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 09.05.25 19:20, Russ Allbery wrote:<br>
</div>
<blockquote type="cite" cite="mid:[email protected]">
<pre class="moz-quote-pre" wrap=""><pre wrap=""
class="moz-quote-pre">Matthias Urlichs <a
class="moz-txt-link-rfc2396E" href="mailto:[email protected]"
moz-do-not-send="true"><[email protected]></a> writes:

</pre><blockquote type="cite" style="color: #007cff;"><pre wrap=""
class="moz-quote-pre">The problem is that all those missing factors are destined to go
un-missing — and then what? We can't base our rules on biological exceptionalism.
</pre></blockquote><pre wrap="" class="moz-quote-pre">Why not? The entirety of law, politics, and civilization is designed by
humans,</pre></pre>
</blockquote>
<p>I'm not disputing any of that. *Of course* we should write our
rules and laws to benefit humans / humanity, not robots or AIs or
corporate profiteering or what-have-you.</p>
<p>All I'm saying is that the idea "a human can examine a lot of
copyrighted stuff and then produce non-copyrighted output but a
computer cannot" might still hold some water today, but the bucket
is leaky and getting leakier every couple of months, if not weeks.</p>
<pre class="moz-signature" cols="72">--
-- regards
--
-- Matthias Urlichs</pre>
</body>
</html>

--------------GiVMKHEDVQthPM0guQQA63su--

--------------jQ47LBIsvp7TGtNEikgM8sxZ--

-----BEGIN PGP SIGNATURE-----

wsF5BAABCAAjFiEEr9eXgvO67AILKKGfcs+OXiW0wpMFAmgebYcFAwAAAAAACgkQcs+OXiW0wpM5 Hw//R8IR9mFE6qXqLemKKrCKfJ6bgy7f3/tN9bpF1hjlGp1t8bw1avy1CN9FhYxPKkpSgtcmn0E/ J0uo7ZyAmDP+dtWRFmXanEzoY5utXzCLBykRNN9OZRYPwD0eubETEchC1zFOPbh28LldHpWQmT9U snLQzXSzzDt4aqN3dNBbQcTrzeg0W/VQhIOA2ql7BgFJ2buIKmFH8SB8/gCQw7W8b/Oxs9BNvv3e McfusyPA9Gew+9qd2cKtsf9Onb/7kNF8Ew7jUO/Ow9cs3v6Dll8PgbK61q+J7DkNB6KPq+RPtLou wFCtxOMYCoKTvYqnbt4vsfhJPpkUA/LFFupDrsejESS9LRaIdNwBgCAvSU9lveKy5VA8NBGP3U/f MSvv9qKyiGnmMBeUDDQvRS3uZ6n7F4t7L9T6Pu0NZ4NM67lTHOAVg2tRcl19L1oD5Q29MiWjGDQv 3RHSY0lBQzzebWn933CggoEM+NVp2pXIsLTw7YfiTNv5h283mext3BWbLhN40b+u0+PDiQqFyD3B YZ2MTzUkodQjcR5c6vRzWrn6R6fNTKZlXv+aK0l7lo3IjA752VpkP379uIAUDeuinUiqCWyhQ3JB wpTfSsKCDpVMlJM4R/3bVsAIVlW1MnOp2UOm1FRH7SEdWXrYYQV1L4dA6XcbaOskTpl4rCoO37U6 +2E=
=6zES
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05

From Aigars Mahinovs@21:1/5 to Russ Allbery on Sat May 10 00:30:01 2025

On Fri, 9 May 2025 at 19:21, Russ Allbery <[email protected]> wrote:

Matthias Urlichs <[email protected]> writes:

The problem is that all those missing factors are destined to go
un-missing — and then what? We can't base our rules on biological exceptionalism.

Why not? The entirety of law, politics, and civilization is designed by humans, for humans. Free software is a movement of humans that attempts to provide other humans with specific freedoms and guarantees around the software they use. I don't work on free software because I want to make something easier for Google's LLM. I work on free software because I want
to give freedom and control to human beings.

I find that thinking to be rather limited. LLM are not self-aware or self-operating entities. There is always a human that uses an LLM.
It's their freedom that you are discounting.

Moreover - there are *far* more people that can use an LLM to benefit
from its gathered knowledge compared to the number of people that have
spent decades learning programming like we have. Hating on LLMs hurts
the freedom of a lot more people.

I can get the idea that an LLM "avoiding" copyright by learning a top
of GPL code and then generating new code that (according to LLM
proponents) no longer has to be GPL-licensed, feels unfail to the
authors of that GPL code. But this, hypothetical, damage needs to be
balanced by the other side of the equation - with the use of that LLM
_several magnitudes_ more people can create new, useful (to them)
software. These are the human beings that only get their freedom and
control from the use of such work. These are the human beings being
helped by making LLM more available (and more free).

And in addition to that, if only 1% of people using LLM to generate
code like what they see and what they get and continue learning and go
back to contributing to the source projects ... we would eventually
have a steady and massive flow of new developers contributing to open
source. Closing the loop.

Explore the HuggingFace or other resources. There are literally
thousands of all kinds of LLMs and other AI models out there. Focusing
on Google or Meta or another big player is missing the point. For each corporate AI model there is a hundred AI models created by a hobbyist.
Maybe not as large, as expensive, as generic, but in some specific
cases just as useful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Matthias Urlichs on Sat May 10 01:30:01 2025

Matthias Urlichs <[email protected]> writes:

I'm not disputing any of that. *Of course* we should write our rules and
laws to benefit humans / humanity, not robots or AIs or corporate profiteering or what-have-you.

All I'm saying is that the idea "a human can examine a lot of
copyrighted stuff and then produce non-copyrighted output but a computer cannot" might still hold some water today, but the bucket is leaky and getting leakier every couple of months, if not weeks.

This is why I'm trying to plug the holes in the bucket as best I can, in
the small part of the world that I can affect, at least until someone
deploys a better legal system than copyright for protecting human
interests.

I know I am trying to occupy an awkward middle position. I participate in
free software because I don't like the way copyright (among other tools, including simple secrecy) is abused in software to prevent people from modifying their own tools, and then within that free software community
I'm arguing in favor of some uses of copyright to protect artists from exploitation, even if that restricts some things a user may wish to do
with their work. This is going to sound contradictory to purists in both
camps. But I think this is one of the cases where there are real competing interests that need to be balanced, not simply dismissed by declaring one
of the interests superior.

I am in part making an argument in favor of Chesterton's fence [1] and
those arguments are never very popular (and, I admit, are also often
misused). But I do think it's worth opposing the ethos of "move fast and
break things."

[1] https://en.wikipedia.org/wiki/G._K._Chesterton#Chesterton's_fence

We do not have to simply accept the direction society appears to be going
in. We can try to change it, and we can label and add nuance and
selectively decline to participate in the portions of it that we find
harmful. That's how the free software movement started, in considerably
more hostile terrain than we face today, and we still made such
significant gains that we fundamentally changed the entire software
industry.

I think there is a very strong gravitational well in technical communities
that pulls people towards the idea that if something is possible, it will happen anyway, and therefore we may as well embrace it because there's no
way to stop it. But this is just not true. Societies have outlawed all
sorts of things and thereby significantly reduced their frequency because
they were harmful to people. Human ingenuity is not a god; we do not owe
it passive obedience. We can weigh new developments against our morals and ethical judgments and find some of them wanting.

The ironic part is that this makes me sound like some sort of
conservative, when I am probably on the left radical side of most of the
folks here. But I want to argue for changing things thoughtfully and
arguing seriously about a sense of shared ethics, not just assuming we
have to accept what other people decide to do.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Holger Levsen@21:1/5 to Aigars Mahinovs on Tue May 13 10:30:01 2025

On Sat, May 10, 2025 at 12:21:26AM +0200, Aigars Mahinovs wrote:

I find that thinking to be rather limited. LLM are not self-aware or self-operating entities. There is always a human that uses an LLM.
It's their freedom that you are discounting.

this freedom needs to be valued against what it costs. sure i'm free to
fly my private jet whereever I want, yet it has some costs for everyone.
same with LLMs.

Moreover - there are *far* more people that can use an LLM to benefit
from its gathered knowledge compared to the number of people that have
spent decades learning programming like we have. Hating on LLMs hurts
the freedom of a lot more people.

citation needed, I could also say: this sounds like a human hallucination to me.

LLMs don't work, all companies building them loose enourmous amounts of
money so far and their best plan how to make them profitable is to make
LLMs figure out that part. LOL!

--
cheers,
Holger

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

Every time you see the word "smart" used to describe a device, replace it with "surveillance." Surveillance watch. Surveillance streetlights. Surveillance oven. Surveillance toilet. Surveillance car. Surveillance city. (@mollyali)

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmgjAcEACgkQCRq4Vgaa qhwvcg/8DghmQKr1zE41Qyjzs0Xeh9kfxhDdPz4kJtCWgqJKT2xwKGXxUAGlOgqI iwHSvPhpsgtkiVhyDZbGP5IOWeYQkSLFHHtfrhD6+SbuYnyrp+XZuCaHzY/hW86i +U2eL0nJeOjGpD1yhOqwGrqM6LqFSkRa3A8Y0yQ9uhZ9TTkXh8uiybLuvADpTwrE GIeMwNNfeLzdHW6u3r76rD+hsFYpSjckMLioaZQT0palazsIJMP91k9DI8EK1YsL UgP6TIK1hUPAjTueLpX8AUTTjH0Fpn3+DdWkTW8uSoKqNMQLNv4dTAqWBzflZdGE EaI4Gl2PMOYDQJavYzLPYQI+aY3P7vMeJH3V1GrvyKJn1xoSfr7cH/buiH1YuN4R aWWdSgLGW0S+fkJolSFCk5nrL702/u2bxfj7qqdBGuE0sDfi0y+32IfihwFPZvkS
vRq4hNTyg60o

From Russ Allbery@21:1/5 to Aigars Mahinovs on Tue May 13 19:00:01 2025

Aigars Mahinovs <[email protected]> writes:

This was in response to Russ articulating that: "I don't work on free software because I want to make something easier for Google's LLM. I
work on free software because I want to give freedom and control to
human beings."

The false assumption here being that making "something easier for LLMs"
will only benefit Google (who are nowhere near top in terms of AI development, btw) and not "human beings", which quite obviously fails to
take in account any freedom and control that a LLMs *does* in fact give
its users, who are also human beings.

Aigars, it would be a lot easier to have this conversation with you if you
pay somewhat closer attention to what other people are really arguing.
first you launched into extended tours of current legal thinking about
this for people who could not possibly care less what the law says because their arguments were ethical and moral and law is not a reliable guide to either, and now you're trying to pick a fight with me over the message
where I was *actively agreeing* with your motives.

This is not the point of our disagreement. We are on the same side of
this: we are both wanting to make ethical choices that benefit humans and
that help human flourishing. I was arguing against the concern that we
might have to avoid something like biological exceptionalism, meaning that
we have to consider treating some future LLMs as agents with moral rights
of their own. I never said you agreed with this; I don't think you *do*
agree with this.

Our disagreement is not about motives, at least in that sense. Our
disagreement is over consent, and your willingness to break the social
contract with artists and override their consent in order to get the
supposed benefits of LLMs. This is not an anti-human-being position; it's
a sort of populism that sacrifices the economic model that artists
currently rely on to survive in the name of empowering the masses. You
went so far as to start talking about LLMs as magic copyright-washing
machines that would let you destroy the basis of copyright and use other people's works for whatever you want. This is where we disagree.

I'm sure that what you *think* you're doing is democratizing art. People
who hold this position always think that. I think you have been duped into
a politics of art without the artist, a belief that one can decouple the creation of meaningful art from artist control of their work by turning artistic work into a sort of factory work-for-hire regime, and that this
is somehow a good thing for the world. I don't think that *you personally* would classify your position that way; I suspect you just think of it as expanding the commons and giving artists better tools. But in our current society this is, to me, the very obvious and predictable outcome of your political position.

I'm not particularly interested in arguing about this with you either,
which is why I've not responded to your last few messages. I've heard all
the arguments here before and I find them very tedious. But now you've apparently decided to pick a fight with me over things I'm not even
saying, so I guess I have to clarify. We are indeed vigorously opposed on
this topic, but it's not because I think your position is anti-human. It's because I think you are naive about how the creation of art happens in
human society, and that the implications of your position stand a good
chance of cheapening and undermining whole fields of human endeavor and seriously hurting the lives of people who have made things that have given
me a great deal of joy. I don't want to see Debian participate in this.

Debian will not in any way be pivotal to this fight, so I guess in some
sense it doesn't really matter what we do, but still, I feel some
obligation to try to argue for an ethical position in the places where I
may have some influence.

But I certainly don't think your position is anti-human! Quite the
contrary: I think it's techno-populist in all of the worst senses of
populism, the sort of populism that is being used as an unwitting tool by
the richest people on the planet who resent that art currently (unevenly, imperfectly) favors labor over capital, and whose goal is to ensure
capital reigns surpreme in all areas of society.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ian Jackson@21:1/5 to Russ Allbery on Tue May 13 20:00:01 2025

Russ Allbery writes ("Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models"):

Aigars, it would be a lot easier to have this conversation with you if you pay somewhat closer attention to what other people are really arguing.

This seems to be a CoC-approved way to say much the same thing as
the insulting message we read the other day. [1]

I don't condone that other poster's way of expressing themselves
there. But it is important that we recognise that harsh words are not
the only form of abuxive conduct, nor of CoC violation.

We should also recognise as problematic patterns of mailing list
posting which are characterised by many of the following:
* Posting very many messages
* Not really properly reading what other people say
* Arguing against misrepresentations of other peoples' positions
* Repeating many similar points
* Making spurious analogies and other spurious arguments

I realise that Debian has chosen to have almost no effective mailing
list moderation, and therefore we tolerate almost every kind and
quantity of bad behaviour. So it seems likely that those of us who
participate here will see more of these messages.

But, I don't like it, and also I want to point out that there are
forms of abuse other than finally snapping and producing an outburst.

(While I'm here: deeds can also be abusive, even if they fall within a maintainer's discretion and aren't forbidden by any formal policy.)

Oh, and: once again I agree with everything Russ has said.

Ian.

[1] <[email protected]>

--
Ian Jackson <[email protected]> These opinions are my own.

Pronouns: they/he. If I emailed you from @fyvzl.net or @evade.org.uk,
that is a private address which bypasses my fierce spamfilter.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kurt Roeckx@21:1/5 to M. Zhou on Tue May 13 23:30:01 2025

On Mon, May 05, 2025 at 03:41:14PM -0400, M. Zhou wrote:

The issue is also discussed here:
https://lwn.net/Articles/1019028/

A better wording goes:
s/open source license/DFSG-compatible license/g

It is fixed in the git repo: https://salsa.debian.org/lumin/gr-ai-dfsg/-/commit/9496f9fb6405db5a99fff1672cd4bad66c925c24

The proposal after amendament:

===============================================================================
Proposal A: "AI models released under DFSG-compatible license license without
original training data or program" are not seen as DFSG-compliant.
===============================================================================

The "AI models released under DFSG-compatible license license without original
training data or program", a particular type of files as explained above, are not seen as DFSG-compliant. Hence, they can not be included in the "main" section of the Debian archive. This proposal does not specify whether the "non-free" section of Debian archive can include those files.

I did a s/license license/license/ on that.

Kurt

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Soren Stoutner@21:1/5 to All on Tue May 13 15:03:17 2025

On Tuesday, May 13, 2025 12:06:05 PM Mountain Standard Time Ilu wrote:

2. What is the preferred form of modification? This is IMHO the
deciding, relevant question.
Aigars says weights and I've heard that from several other people active
in machine learning. OSI says the same.
Mo Zhu says training data is. I haven't heard that from anybody else.

I thought several other people besides Mo Zhu had also said that on this list, but just in case they haven’t, I would like to go on the records that I also feel that training data is one of the preferred forms of modification in machine learning and should be thus considered for anything being included in main.

In my opinion, it is fine to include otherwise distributable ML applications without available training data in non-free.

--
Soren Stoutner
[email protected]
-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEJKVN2yNUZnlcqOI+wufLJ66wtgMFAmgjwaYACgkQwufLJ66w tgN61g/+Mxk35CCQwZDH2GTleD5D2MQwxnuFL29lkfrZfE9YFj1cnpNXsfzVzKTO tmXVq86IDKchTraHp5gZ6YMv6MqMi+9jZoEumYYMJXinA6lCnnTHS+HP9T4nUoDt seImz6GCkhO/o9/gYX+kld8Ti9P5qn7h9TxQhgqQ26eJJBpGMgcN12zi6/VT+MA2 t+TOeU3BfH9jghjFr/mwISdow0oR62PLTgUtczgnzMH7a73t50tSf2ghU42U4eRF Cea802o8+6rC18JFbwL74/HozJX95swq8+8FpUjiqZgf5ps7WlrfDl28e9cZijjk ShqwoPcDzcxfSm8LVgits7KNPwNJCoTLz5HwqiH3ATmJa6dn67gROXjmx7JRYxX0 lAKB9+FPdFKZmxRgZSLpG/NAnzlHG3Sjnpu9gIgSwXlL9tO+oMbmv+ZCklW8mUP3 1GgEZEBf2yYFvsedc3+pLhMoMxiqAp4k9Q4Fuvah9ZvlXxocdooaBhk3pp8hxmGJ BZvr/G8MDNvW9mkzqQ3e0PoHJNg54M2IE3zxdmLPDdn2m4GzOrVukZHc3kHYKYZL 4N8YphE2KZkXZe9ozjknCH+O+yMArWwe3TfzZCu+Ggyi/H7Ue0HgWWuZdHNbtyej p8ilSVQdVzkBLIKnHIAcfWa3G75ZgtHh6ozFbz455APgH7OIRdk=
=5l2/
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Aigars Mahinovs on Wed May 14 07:10:01 2025

Aigars Mahinovs <[email protected]> writes:

It would be a lot easier to have a conversation with you, if you would
spend more time articulating and detailing *your own* position, instead
of guessing about the positions of others (and then talking down to
those positions). Ideally in the actual manner that matters to you.

I feel like I did this already, and even reiterated part of it in the
message that you are responding to. I'll go over this one more time, and
then I'm going to stop responding to this thread until such time as we
have another GR before us because I find this specific political debate incredibly annoying and am not going to voluntarily subject myself to more
of it if there is no active GR.

This is going to be a really long mail message and I'm sorry. I'm not
making it long to try to browbeat people; I'm making it long because I
don't know how to express how I feel in fewer words and still try to
capture the nuance and complications.

I care about three different things when it comes to machine learning
models in Debian.

1. Source

I care about the standard principle of free software ethics that we should include the source code for the software that we ship (and I consider AI
models to be software). My personal definition of source code is
expansive: I think that it not only should satisfy the "preferred form of modification" test that we take from the GPL, but that it should also be auditable and transparent and should reveal the way the software is
constructed to humans or their tools.

I have read the arguments that weights constitute source and am not
convinced by any of them. They are clearly sufficient for *some*
modifications that people want to make, but I am unconvinced that they are sufficient for *all* modifications people want to make. More profoundly,
they don't appear to serve any of the other purposes of source. I cannot analyze them to understand how the model was constructed, what choices
were made in labeling, or any of the other things I think constitute the
normal expectations one has of free software.

The strongest argument that I see against providing source for machine
learning models is that the training data is voluminous and we don't
really know how to archive or provide it. I agree that this poses serious practical issues, but I don't think it's a good enough excuse to abandon
our ethical policy of providing source.

I continue to hold this position even if upstream never retained the
training data, because I think source in free software should mean more
than a theoretically even footing with upstream (and also because
constantly scraping the web for training data is actively hostile to the Internet and is making it increasingly difficult to run independent web
sites [1], but that's another rant).

[1] https://lwn.net/Articles/1008897/

2. Malicious or doctored models

This is to some extent an extension of the previous point, but it's
important enough to me that I want to break it out separately.

There is already some literature on how to implant a sort of "back door"
in an LLM that controls responses to specific types of prompts if you have control over a small portion of its training data. This joins an already substantial and long-standing literature on how to do the same with
simpler machine learning models such as image classifiers. As use of
machine learning models grows, these sorts of attacks will become more sophisticated, and the motives for doing this sort of tampering will
expand. To take an obvious example, there is a clear financial incentive
for companies releasing open weights LLMs to find ways to embed
advertising bias in those models, so that the models will generate
positive descriptions of that company's products or the products of people
who paid them for this service. I'm sure there are many other variations
on this that I haven't thought of (and I assume the concerns about racial, religious, and ethnic bias are so obvious as to not need discussion in
detail).

Detecting this sort of tampering is not going to be easy; detecting back
doors in source code never is. Having source code (i.e., training data) at least makes it *possible*. Even if that data is very large, I am sure that people will build tools to try to find it, quite possibly using machine learning models! (Hopefully tools with free training data.)

If we don't have source code for the models, then detecting this sort of tampering becomes a sort of reverse engineering problem. I am sure that
people will also develop tools to do that work, but it's a far harder
problem, precisely because of the violation of free software ethics that
hides much of the available information that otherwise could be used to
look for problems.

Free software's position on this has always been that you are allowed to
embed advertising and similar properties into your free software, but
everyone else is allowed to remove them if they want. That is one of the
rights that free software protects. This principle is maintained by having training data available, so that hopefully the relevant training data can
be found and isolated and, at least for smaller machine learning models,
the model can be rebuilt without the bias.

Abandoning our commitment to source availability makes it more difficult
to remove this sort of back door, if it is even possible without
compromising the utility of the model. (The details of this are obviously
going to vary wildly depending on the specific nature of the model.)

3. Consent

This is going to be really long becuse my position has a lot of nuance. I apologize for that in advance.

I'm also going to say up-front that I'm not willing to try to respond to line-by-line criticism of the specifics of the argument (indeed, I'm
probably not going to continue the discussion on the list at all), because that's just not where I want to spend my energy. Some of the details are doubtless wrong in some specifics and I do not have hours to fact-check
myself or I'll never write this at all. Please take the thrust of the
argument as a whole in the spirit in which I'm trying to offer it: an
answer to your question about what my ethical and moral vision is in this debate.

3.1. The current compromise

I believe that we (as in "most countries of the world" we) have agreed to
a tricky and complicated, conditional and awkward, flawed and abused, but nonetheless very widely implemented compromise between two competing
interests: the right of creators [1] to profit from and maintain what they consider to be the artistic integrity of their work, and the right of
society to reuse, rework, and build upon common cultural artifacts. The
nature of that compromise varies in some details from place to place, but
we have been quite earnest about attempting to universalize it with laws
and treaties.

[1] Ugh, I don't really like that word, but it seems to be the one that
the English-speaking world has standardized on to refer to artists
broadly, including writers, musicians, etc.

This message is, as requested, a discussion of the moral and ethical
principles that I personally am arguing for, so I want to be clear that
I'm talking about the general principle that backs popular support for the copyright system, not about the specific legal details of that system.
There are things that I do not like about the Berne Convention
specifically, and there are far more things about copyright law writ large
that I think are actively abusive and unethical. I truly do not care at
all, for the purposes of this argument, whether my ethics exactly match
what is currently written down in copyright law. I know that they do not;
I did not arrive at my ethical or moral position because of what the law
says.

We have a fairly good idea of what would happen if we simply had no
copyright because, as others have pointed out, that compromise is
relatively recent in some places and, even where it has been available domestically for a long time, was often not enforced internationally. We
can therefore look at history to see what happens: Corporations take any
work that they find interesting and capture the market with their products based on that work, usually squeezing the original creator out of the
market and ensuring that most or all of the profits from that work flow to
the corporation that had no part in creating it. This is "good for
consumers" in some sense because the corporation is driven only by supply
and demand and is usually happy to make lots of cheap copies and thus make copies of the art quite affordable. Nonetheless, we decided that wasn't
the society that we wanted to live in (and again, the "we" here is pretty
broad and encompasses most countries in the world to at least some
extent).

A central property of this compromise is that it is time-limited. The
creator gets control of their work for a limited period of time, and then
it stops being property and is freely available to all. A lot could be
said about the *length* of that time period (the point at which I most vigorously disagree with the specifics of the current compromise is that I think this period is much too long), but I agree with this basic
principle, or at least I consider it a workable compromise.

3.2. Fair use

Another very complex part of the current compromise is around "fair use,"
[2] which is, in essence, the rules for when you have to ask the consent
of the creator and when you don't. I completely agree with the ethical principle behind the existence of fair use: no one should own an idea, I
should be able to review books and include plot summaries and quotes (and indeed I do, quite reguarly), I should be able to sing a song to a friend
or learn to play it on a guitar, and people should not have to constantly
worry about incidental use of portions of things they've read or heard. We
live in a society and art is part of society.

[2] This is US terminology. Different countries call this different
things, but my understanding is that there is some similar principle
of boundaries around what requires creator consent in pretty much all
legal systems with copyright, although the boundaries are put in
different places.

The rules around fair use are absurdly complicated, vary between countries
even more than most aspects of the current compromise, and are, in short,
a confusing mess. I do not know of anyone, including practicing copyright attorneys, who likes the current rules or thinks they're coherent or sane.
But the *idea* is sound.

One of the most common claims in this debate is that training a model on artistic work should be fair use. This is not, on the surface, an
obviously unreasonable claim, particularly as a legal claim given the complexity and vagueness of the current compromise here.

I do think *some* training of models on an artistic work is fair use, but
I do not believe training models on artistic works in the way that LLMs do
is fair use. This is not a position based on an analysis of the current
legal framework (like I said, I disagree with a lot of the current legal framework around fair use). It's an ethical and a political position based
on what I see as the likely outcomes of allowing machine learning models
to be freely trained on artistic work within the period of creator
control.

I also think there are some ways to train models on artistic work that
should be legal (fair use) but which are not free software. This is
primarily cases where the amount of data extracted by the model is a very
small portion of the information in the work, and the model itself is
intended for an entirely separate area of endeavor and in no way competes
with the creator's market for their work. [3]

[3] Yes, these are two of the standard US criteria for free use, and I
have probably been influenced by that, but I do separately think both
of these principles make sense ethically.

For example, I think training an image classifier to recognize cats is
probably fair use of photographs of cats, regardless of the license of the photographs. However, I don't think such a model can be free software
unless the consent of the photographers has been obtained because the
labeled training data cannot otherwise be redistributed under free
software terms, which means the result does not have source. This is, to
me, a form of proprietary software, just as if some useful piece of
software had a chunk of source code under a proprietary license that
prevented it from being distributed under free software terms.

However, I think LLMs, and some other machine learning model cases, fail
even this test, and I think it should be illegal (and certainly is
unethical) to train them in that way for anything other than private,
personal use without the permission of the creators of the works they are trained on. I think this breaks the copyright compromise. This is because
LLMs do not extract only limited information from the work. They do deep statistical data mining of the work with an explicit goal of being able to create similar works. They therefore directly compete with the creator's
market for their work. I consider this an unethical and hostile act
towards the creator in our current copyright compromise.

3.3. Ethics of consent

There are a lot of different ways to arrive at this conclusion, but fundamentally my argument is that creating an artistic work is a special
human endeavor that does, and should, hold a special place in our system
of morality. Artistic creation is fundamental to so many things that are integral to being human: communication, empathy, creativity, originality,
and self-expression. Even if you disagree with my moral position below, I
think there are problems of politics and practical ethics that argue substantive use of the work within the time-limited span of copyright
should require the consent of hte artist.

One reason is that the number of people (and collectives of people, such
as corporations) who will use other people's works maliciously is
significant, ranging from plagiarism through conterfeiting to fraud. I
know that the counterargument here is that each of those malicious
activities can be outlawed independently without needing the copyright compromise, but I think that position is wildly unrealistic. Society is
not suffering from an excess of tools to prevent people from using other people's work maliciously; quite the opposite at the moment. As a matter
of practical politics, I am opposed to discarding existing tools for doing
so, as long as those tools are ethical, and I think the copyright
compromise is.

The other reason, which I've talked about a lot, is that this is how
creators can afford to make art as their job without being independently wealthy or using only their spare time. That in turn provides the world
with better artistic works, not to mention being the mechanism whereby
society demonstrates its value that art is important and should be
encouraged. This is, for example, the stated reason for copyright law in
the United States, and I believe in other countries.

I think there was a comment in this thread that copyright enforcement is
only a tool for rich people because only rich people can sue. This is definitely not the case. The structure of copyright law is the entire
reason why, for example, the book publishing market exists in its current
form and forms the legal framework behind the author compensation model
(which is very much *not* work for hire in the common case). People, even people who are not rich, do sue to enforce their rights (and win), and it doesn't take very much of that to create a deterrence effect, which is how
the law is generally designed to work.

I've also seen comments that all art should be funded by payment for the creation of the art, not by royalties after it's complete. This is
equivalent to arguing all art should be funded using a Kickstarter model,
and I hope it's obvious why that isn't viable. (Among many other problems,
this mostly only works if you're already famous.) This is one of the
places where I have to urge people to go listen to what creators say about their funding models. They're not shy about how important the financial structures enabled by copyright are, or why the alternatives would often
make it impossible for them to continue to make art. I am personally the
most familiar with the book publishing industry, and I could list many,
many writers who are not in any way wealthy (who probably have less money
than most of the people reading this) who are quite eloquent on exactly
how they rely on the copyright compromise to be able to write at all.

Now, I am from the US, and in another part of this thread I've been told
that the EU has a totally different funding model for creators and
supports them without the need for them to sell their work for money, and therefore these are more US-specific problems that we should fix by fixing
our societal method for supporting artists. I freely admit that I don't
know very much about EU law or about EU creator support mechanisms, so
maybe this is correct, and if so, that's fantastic. I am all in favor of
the US fixing all sorts of things we're doing poorly, including that one.
I am a *little* dubious because I have followed the work a lot of
creators, including ones from the EU, and I've never heard this from a
creator. They all still seem quite concerned with the income they can
derive from their work. But I will freely admit that a better economic
support model for creators would remove a chunk of this argument.

However, I don't think it removes all of the argument. In addition to the
point above about misuse of artistic work, I also consider it a moral imperative to obtain the consent of the creator if the intent of the use
is to make substantial use of the work. This is my personal moral belief
and I don't expect everyone to share it, nor do I think it's necessary for
the rest of my argument, but, well, you asked for my moral position. I
believe that someone's artistic work is often a deeply meaningful personal communication and that should be treated with respect. I consider this to
be part of a basic obligation to respect human dignity and the special
place of art in human society. This does fade with time; eventually, the
work has become part of the culture and gains some separation from the
artist. But I don't think this should happen immediately.

(How does this apply to corporations? Hang on, I'm getting to that.)

3.4. Computers are not the same as humans

One of the arguments that has come up in this discussion is that one can
model the human brain as a type of computer and therefore "doing deep satistical data mining of the work with an explicit goal to be able to
create similar works" is just a description of a huge percentage of normal human activity throughout all of history. This is what we all do when we
learn something artistic: we look at the work of people who already know
how to do it and we figure out how they did it and we learn by copying
them. And therefore it should also be permitted to do this with a
computer; it's just the automation of a normal human activity.

I mostly agree with all of this except the last sentence. It is not, in
fact, moral to automate any and all human activity. It is, in fact,
different when a human does something, because often our laws and even the basic principles of our society are designed for human capabilities and
would catastrophically fail under corporate capabilities.

Furthermore, my morality and ethics are centered around humans and, to a somewhat lesser extent, other sentient beings. I care about human
learning, human flourishing, human art. I could not possibly care less
about computer flourishing or computer art because computers are not
sentient, they don't feel pain, and they are not moral actors. Computers
are a complicated tool that we make for human activities.

I am not one of the people who thinks it is theoretically impossible to
*ever* make a sentient being on some sort of silicon platform. If we ever invent Commander Data, I agree that we will have some challenging ethical decisions to make. But I think LLMs are so far away from that today that
they are not in the same galaxy, and I am quite confident this will not
happen in my lifetime. I don't believe it's even theoretically possible to
do so with LLM technology due to how LLMs work, so if we do someday
manage, it will be with some different technology than this.

This is not a point on which I'm going to try to convince people. I know
that some people disagree with me on this, and I think those people are
quite obviously wrong, and I'm afraid that's all you're going to get from
me on that topic because to me the differences are so obvious that I don't think I have enough of a common frame of reference with people who
disagree to have an intelligent debate.

The relevant point for consent is that allowing humans to learn from
artistic work is part of our copyright bargain (and has, indeed, been part
of human understanding of art for as long as we have had art), but
allowing computers to be trained on artistic work is *not*. Training
computers does not automatically generalize from training humans because
humans get special status, not only in law but also in morality and
ethics.

One of the concrete practical reasons for this is that humans have rights:
they have to be paid a fair wage for their work, they cannot be enslaved
[5], and they are legally and morally independent of their employers. None
of this is true of computers, and therefore allowing computers to do
things is practically and politically equivalent to allowing corporations
to do those things at scale. Differences in scale often become differences
in kind, and this is one of them. Corporations (and states, and other collectives that are not individual humans) wield levels of economic power
far beyond that of individual humans and can crush individual humans if
that power is not balanced. One of the ways that human societies balance
that power is by extending special rights to humans that are not available
to corporate machinery. I believe learning from art with or without the
consent of the artist is, and should be, one of those special rights.

[5] I know, I know, I know, I'm again laying out my moral beliefs, not the
horrific abuses "my" government participates in.

3.5. Corporate abuse of copyright

I grew up on the Internet, I copied music from my friends and shared music
with my friends, I remember the RIAA lawsuits against college students,
and I am quite viscerally aware that all of the principles I am talking
about are abused by corporations, often (but not always) directly against
the wishes of the humans who made that art.

I'm also involved in the free softare movement and therefore am of course
aware of the ways that copyright has been abused to take away power from
people over their own devices and tools of everyday living (even medical devices that are literally keeping them alive). I obviously do not support
that or I wouldn't be here to write this.

I have been sympathetic to the argument that we should throw out copyright entirely for multiple decades. I understand where it comes from, and I do
think there is a moral argument there. But I don't think it's entirely
correct; I think it's too absolute of a position and will hurt individual
human creators, ones who are not wealthy and who are not abusing their copyright, more than it will hurt corporations.

If you don't agree with that, and I realize many people here won't, I'm
not sure that I can give you a compelling argument that will convince you.
I can only state the basis of my own moral position, which is that I know
a whole lot of people who make art of various kinds, many of whom hate corporate abuses of copyright as much as any of you do, and I have
listened to them talk about how central the (broken, badly-designed,
always in need of reform) copyright compromise is to their ability to
continue to make art.

The way I personally reconcile these two positions is two-fold.

First, just like I don't consider computers to be the moral equivalent of humans, I *certainly* do not consider corporations to be the moral
equivalent of humans, and I would be quite happy to substantially diminish their ability to hold or enforce copyrights as long as the rights of the underying humans are preserved. Our current legal system is not set up to
do this, but I can imagine ones that would be, and I would be
wholeheartedly in favor of those. I'm of course also opposed to the
excesses of the corporate copyright system, such as disproportional
penalties and intrusive software limitations. Just because I agree in
principle with requiring the consent of the human creator does not mean I
agree with many of the mechanisms that are used to, in theory, enforce
that consent, but in practice to line the pockets of corporations with
little or no benefit to the creator.

Second, in the specific case of *software*, I think our current compromise
is over-broad in what it protects. Software is frequently *not* a deeply meaningful creative human communication that reflects its creator. It's
often algorithmic, mechanical, and functional, attributes that, elsewhere
in our copyright compromise, define works that are not protected by
copyright. I don't consider protecting every software program as strongly
as a novel or painting to be morally justifiable.

I am running out of energy for this write-up (this is just absurdly long),
so I'm not going to go back and show how I would test software against all
of the principles I talked about earlier, but my summary is that I think
these types of creator rights should only apply to artistic works that
are, well, artistic. Some programs qualify; most probably don't. It
matters, very deeply, for my moral position whether the work of art is a
work of personal expression, and I do think that our current copyright compromise has this balance badly wrong for software in particular.

On top of that, there is the very strong argument that people should have
a right of control over objects they have purchased and other sorts of
personal property, and most obviously medical devices that are keeping
them alive. This is a powerful moral principle and to me it overrides some
of the rights of the creator when it comes to software because, again,
software is functional and we cannot allow the protection of the creative component to destroy people's right to control their own lives. This
argument does not apply to things like novels or paintings in the same
way; it is much harder to construct a scenario where one must be able to
make copies of some specific novel in order to exercise personal freedom, because those types of art are not functional in the same way.

3.6. Opt-out

I can't help myself -- I have to say something about this absurd idea that
an opt-out mechanism is sufficient to establish consent, because I find
the entire idea morally repugnant.

Opt-out systems are often the first refuge of scoundrels who had no intent
of honoring consent at all but got caught and realized that position was
too unpopular to be viable. In practice, they are almost always frauds.
The point is usually to defuse the moral argument of the small minority in
any debate who have the time, energy, and knowledge to vociferously
object, while continuing to ignore the wishes of everyone else. There is a
very good reason why corporations almost immediately turn to opt-out
systems as their "compromise" position; they know that they largely don't
work and will give them effective carte blanche to do nearly all of the
things they wanted to do.

"I can do whatever you want unless you yell 'no' loudly enough and in
precisely the right way" is not a defensible moral position.

I think one should be highly suspicious of even *out-in* systems when they involve significant imbalances of power, because often the consent,
although explicit, is partly coerced in all sorts of subtle ways. But
opt-in is the bare minimum; opt-out is just a public relations campaign
for ignoring consent.

3.7. Conclusion

Probably no one is still reading, because this is an exercise in "be
careful what you wish for." :) But for those who jumped to the bottom,
I'll try to sum up my third concern.

Creation of art is a special and morally significant aspect of humanity
that I believe warrants respect and careful ethical treatment. For works
of artistic personal expression (often *not* the case for software), I
think the best ethical path is to start from an assumption that consent
from the creator is required for substantive use within some time-limited period of protection. We can then carve out some sensible exceptions, but
this should be the default. I personally do not particularly care about corporate consent, only about human consent, but for Debian's purposes we probably don't have a reasonable way to draw that distinction.

The practical impact for machine learning models and free software is
that, under this moral principle, models that make substantive use of the
work (including but not limited to the kind of statistical extraction done
by LLM training) should be trained on consentual training data. The
license of the training data is how the free software community
establishes creator consent.

There is space here for machine learning models that I consider ethical
with respect to creator consent, but do not consider free software. For example, a creator could consent to the use of their work to train the
model but not consent to that work becoming publicly available; that's an entirely reasonable thing that I could see a creator doing (in exchange
for money, presumably), and that's their choice. I don't see anything
unethical about that. The result just wouldn't be free software or
therefore eligible for Debian main due to other free software principles discussed above.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jonas Smedegaard@21:1/5 to All on Wed May 14 07:50:01 2025

Quoting Russ Allbery (2025-05-14 07:00:01)

This is going to be a really long mail message and I'm sorry.

Don't be: no need.

I care about three different things when it comes to machine learning
models in Debian.

Thank you so, so much for writing and sharing this elaborate reflection,
Russ. Very good read, all of it, and in my opinion far more relevant for
the discussion than several other shorter contributions in this thread.

- Jonas

--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
* Sponsorship: https://ko-fi.com/drjones

[x] quote me freely [ ] ask before reusing [ ] keep private

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon Josefsson@21:1/5 to Russ Allbery on Wed May 14 09:00:02 2025

Russ Allbery <[email protected]> writes:

This is going to be a really long mail message and I'm sorry. I'm not
making it long to try to browbeat people; I'm making it long because I
don't know how to express how I feel in fewer words and still try to
capture the nuance and complications.

Thanks for writing this! I find myself in agreement with most (if not
all) of it, but what is puzzling me is how this differs from the other proposals presented earlier.

Is it possible give a short summary of principles that differs in your
thinking from Thorsten Glaser's proposal? I find find myself agreeing
with both of you.

To me I think we have at least two camps:

1) We must have DFSG-compliant licensing of source code for everything
in main, and that source code should encompass everything needed for a
skilled person to re-create identical (although possibly not bit-by-bit identical) artifacts.

2) It is acceptable to not have DFSG-compliant licensing for things that
aren't important for Debian and still ship those, because doing so helps
our users and helping users is more important than DFSG-licensing. My perception is that this is already the practical defacto situation for
Debian today, given the exceptions people prefer to ignore. I further
think that many DD's are okay with this situation and are frustrated
when we even concern ourselves with this problem (witness the reaction
from several DD's on the GPLv2-git vs OpenSSL license issue).

I think your thinking and Thorsten's proposals are in the first camp,
and Aigar and Sam Hartman's proposal are in the second camp.

What other positions are there? Maybe there are some nuanced camps
between the two above, muddling things.

Of course, I may also be completely misunderstanding the positions
people have here. I'm just trying to summarize for my own understanding
so I appreciate clarifications.

Neither position has much to do with AI models as far as I can tell. Is
there any complication beyond size and infrastructure to recreate models
that are a factor here? Or is this "just" a re-hash of the perpetual
main vs non-free discussion?

/Simon

--=-=-Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQNoBAEWCAMQFiEEo8ychwudMQq61M8vUXIrCP5HRaIFAmgkPyIUHHNpbW9uQGpv c2Vmc3Nvbi5vcmfCHCYAmDMEXJLOtBYJKwYBBAHaRw8BAQdACIcrZIvhrxDBkK9f V+QlTmXxo2naObDuGtw58YaxlOu0JVNpbW9uIEpvc2Vmc3NvbiA8c2ltb25Aam9z ZWZzc29uLm9yZz6IlgQTFggAPgIbAwULCQgHAgYVCAkKCwIEFgIDAQIeAQIXgBYh BLHSvRN1vst4TPT4xNc89jjFPAa+BQJn0XQkBQkNZGbwAAoJENc89jjFPAa+BtIA /iR73CfBurG9y8pASh3cbGOMHpDZfMAtosu6jbpO69GHAP4p7l57d+iVty2VQMsx +3TCSAvZkpr4P/FuTzZ8JZe8BrgzBFySz4EWCSsGAQQB2kcPAQEHQOxTCIOaeXAx I2hIX4HK9bQTpNVei708oNr1Klm8qCGKiPUEGBYIACYCGwIWIQSx0r0Tdb7LeEz0 +MTXPPY4xTwGvgUCZ9F0SgUJDWRmSQCBdiAEGRYIAB0WIQSjzJyHC50xCrrUzy9R cisI/kdFogUCXJLPgQAKCRBRcisI/kdFoqdMAQCgH45aseZgIrwKOvUOA9QfsmeE 8GZHYNuFHmM9FEQS6AD6A4x5aYvoY6lo98pgtw2HPDhmcCXFItjXCrV4A0GmJA4J ENc89jjFPAa+wUUBAO64fbZek6FPlRK0DrlWsrjCXuLi6PUxyzCAY6lG2nhUAQC6 qobB9mkZlZ0qihy1x4JRtflqFcqqT9n7iUZkCDIiDbg4BFySz2oSCisGAQQBl1UB BQEBB0AxlRumDW6nZY7A+VCfek9VpEx6PJmdJyYPt3lNHMd6HAMBCAeIfgQYFggA JgIbDBYhBLHSvRN1vst4TPT4xNc89jjFPAa+BQJn0XTSBQkNZGboAAoJENc89jjF PAa+0M0BAPPRq73kLnHYNDMniVBOzUdi2XeF32idjEWWfjvyIJUOAP4wZ+ALxIeh is3Uw2BzGZE6ttXQ2Q+DeCJO3TPpIqaXDAAKCRBRcisI/kdFon2YAP448Hs7BdmN N97YsNknubSuTH04PyFWKbkRRnabiAy9LAEA+8GpunbVKURA7Fhuq3UuqCjEPPLK l9a9gbmLtQkldAQ=H5Hq
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Russ Allbery on Wed May 14 11:50:01 2025

On Tue, 13 May 2025 at 22:00:01 -0700, Russ Allbery wrote:

Second, in the specific case of *software*, I think our current compromise
is over-broad in what it protects. Software is frequently *not* a deeply >meaningful creative human communication that reflects its creator. It's
often algorithmic, mechanical, and functional, attributes that, elsewhere
in our copyright compromise, define works that are not protected by >copyright. I don't consider protecting every software program as strongly
as a novel or painting to be morally justifiable.

I think this is a good thing to think about for a lot of the content
that we distribute, and how we apply FOSS principles to it. I don't
think this is a binary, I think it's a spectrum, with purely functional
things at one end and purely artistic/expressive things at the other. We
could say that engineering is closer to the functional end of that scale
than art is, but neither are actually at the extremes - a lot of
engineering has some amount of creativity and even aesthetics involved
(a bridge needs to stay up, but a *good* bridge also doesn't look ugly)
and a lot of art requires some amount of necessary pragmatism to make it something that can exist in the real world (it doesn't matter how
beautiful your statue would hypothetically be if it collapses under its
own weight).

Many of the executable programs we ship are mostly functional and only a
little bit artistic/expressive, but for example the recently-introduced Ceratopsian themes for Debian 13 are mostly artistic and only a little
bit functional. I don't think we should take it for granted that the
same self-imposed rules for both of those are necessarily appropriate:
the closer something is to the functional end of the scale, the more
important I think the DFSG's principles are for it.

One of the reasons I'm more comfortable with packaging non-Free games
than on other classes of non-Free software is that I consider games -
and in particular the non-executable parts of games, like the levels and textures and so on - to be closer to the artistic/expressive end of that
scale than the pragmatic/functional end. Game engines and scripts are
more pragmatic/functional than the non-executable data they act on, but
not as much so as for example the text editor I'm using to compose this
email; and as a result I'm more willing to tolerate missing source code
for a game than missing source code for a text editor.

(As with elsewhere in this thread, I'm carefully avoiding saying
"software" because of the ambiguity between "software is any work in a
digital format" and "software specifically means executable programs".)

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon Josefsson@21:1/5 to Aigars Mahinovs on Wed May 14 12:10:01 2025

Aigars Mahinovs <[email protected]> writes:

On Wed, 14 May 2025 at 08:58, Simon Josefsson <[email protected]> wrote:

To me I think we have at least two camps:

1) We must have DFSG-compliant licensing of source code for everything
in main, and that source code should encompass everything needed for a
skilled person to re-create identical (although possibly not bit-by-bit
identical) artifacts.

2) We must have DFSG-compliant licensing of source code for everything
in main, but training data is not part of source code. Instead source code for
training models would be code and protocol describing how to generate
or gather training data in such a way that a skilled person would be able to re-create functionally the same (although not identical) artifacts. If re-creation
is impractical (due to compute costs) then the model must also be modifiable after training by a skilled person with tooling in the archive.

Thank you for articulating this! How is that any different from my 2)?

2) It is acceptable to not have DFSG-compliant licensing for things that aren't important for Debian and still ship those, because doing so helps
our users and helping users is more important than DFSG-licensing.

I don't see any real difference between the positions above. Consider replacing 'aren't important' in my definition with 'aren't source' to
see how they map. In both situations, there would be some things (e.g.,
AI models) in 'main' that aren't licensed DFSG-compliant. How to get
there seems mostly about changing the right definition of some words.
But maybe your position is enough different in that there are at least
three clear camps.

/Simon

-----BEGIN PGP SIGNATURE-----

iQNoBAEWCAMQFiEEo8ychwudMQq61M8vUXIrCP5HRaIFAmgkasEUHHNpbW9uQGpv c2Vmc3Nvbi5vcmfCHCYAmDMEXJLOtBYJKwYBBAHaRw8BAQdACIcrZIvhrxDBkK9f V+QlTmXxo2naObDuGtw58YaxlOu0JVNpbW9uIEpvc2Vmc3NvbiA8c2ltb25Aam9z ZWZzc29uLm9yZz6IlgQTFggAPgIbAwULCQgHAgYVCAkKCwIEFgIDAQIeAQIXgBYh BLHSvRN1vst4TPT4xNc89jjFPAa+BQJn0XQkBQkNZGbwAAoJENc89jjFPAa+BtIA /iR73CfBurG9y8pASh3cbGOMHpDZfMAtosu6jbpO69GHAP4p7l57d+iVty2VQMsx +3TCSAvZkpr4P/FuTzZ8JZe8BrgzBFySz4EWCSsGAQQB2kcPAQEHQOxTCIOaeXAx I2hIX4HK9bQTpNVei708oNr1Klm8qCGKiPUEGBYIACYCGwIWIQSx0r0Tdb7LeEz0 +MTXPPY4xTwGvgUCZ9F0SgUJDWRmSQCBdiAEGRYIAB0WIQSjzJyHC50xCrrUzy9R cisI/kdFogUCXJLPgQAKCRBRcisI/kdFoqdMAQCgH45aseZgIrwKOvUOA9QfsmeE 8GZHYNuFHmM9FEQS6AD6A4x5aYvoY6lo98pgtw2HPDhmcCXFItjXCrV4A0GmJA4J ENc89jjFPAa+wUUBAO64fbZek6FPlRK0DrlWsrjCXuLi6PUxyzCAY6lG2nhUAQC6 qobB9mkZlZ0qihy1x4JRtflqFcqqT9n7iUZkCDIiDbg4BFySz2oSCisGAQQBl1UB BQEBB0AxlRumDW6nZY7A+VCfek9VpEx6PJmdJyYPt3lNHMd6HAMBCAeIfgQYFggA JgIbDBYhBLHSvRN1vst4TPT4xNc89jjFPAa+BQJn0XTSBQkNZGboAAoJENc89jjF PAa+0M0BAPPRq73kLnHYNDMniVBOzUdi2XeF32idjEWWfjvyIJUOAP4wZ+ALxIeh is3Uw2BzGZE6ttXQ2Q+DeCJO3TPpIqaXDAAKCRBRcisI/kdFotIPAP4wVtPjw6nb dx1sM5MssMII/v+pIqmEjlw2necmMq8fMgEAsirzP+Ek7bCSscERRkToRPqUVo7g 4so2l0YuD1HHvAk=
=NkYg
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Simon Josefsson on Wed May 14 17:50:02 2025

Simon Josefsson <[email protected]> writes:

Thanks for writing this! I find myself in agreement with most (if not
all) of it, but what is puzzling me is how this differs from the other proposals presented earlier.

Is it possible give a short summary of principles that differs in your thinking from Thorsten Glaser's proposal? I find find myself agreeing
with both of you.

I don't know that it does differ! To be honest, I was trying to reduce the amount of time I was spending on this discussion (I certainly failed at
that), and therefore was doing the old debian-vote hack of only paying
close attention to proposals that had reached the required number of
seconds and relying on other people to winnow down the number of things I
had to examine in detail. So I have not taken a close look at the
specifics of Thorston's proposal (other than the side point that I don't
think we know whether non-free firmware contains machine learning models
and therefore aren't really in a position to make rules about that).

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Aigars Mahinovs on Wed May 14 19:10:01 2025

Aigars Mahinovs <[email protected]> writes:

Thank you very much for the write-up. I do highly appreciate the time
taken to express your position. Now we do have a clear and coherent
moral position that could be read and understood.

I clearly disagree on a few key things, but, as you said, there is
little point discussing it if no decision is to be made right now.

Thank you, and also I owe you an apology. The discussion here has been
helpful to me in figuring out what my position is (I didn't have a firm position at the start of the discussion), but once I figured that out, I
should have stopped trying to argue and just laid it out. It's the trap
that I keep falling for where it feels like it would take less energy to
just make this one point, but it makes the conversation less coherent and,
with replies and misunderstandings and the temptation to turn up the
rhetoric, doesn't take up less energy either.

You are correct that I started characterizing how your arguments came
across to me rather than just stating my own position, and in the process failed at disagreeing without being disagreeable. I am sorry for doing
that. My annoyance clearly broke through in ways that weren't productive
or particularly respectful.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Soren Stoutner@21:1/5 to All on Wed May 14 12:12:53 2025

On Wednesday, May 14, 2025 5:31:53 AM Mountain Standard Time Aigars Mahinovs wrote:

On Wed, 14 May 2025 at 00:03, Soren Stoutner <[email protected]> wrote:

On Tuesday, May 13, 2025 12:06:05 PM Mountain Standard Time Ilu wrote:

2. What is the preferred form of modification? This is IMHO the
deciding, relevant question.
Aigars says weights and I've heard that from several other people active in machine learning. OSI says the same.
Mo Zhu says training data is. I haven't heard that from anybody else.

I thought several other people besides Mo Zhu had also said that on this list, but just in case they haven’t, I would like to go on the records

that

I also feel that training data is one of the preferred forms of modification in machine learning and should be thus considered for

anything

being included in main.

Could you expand a bit on this topic, so I can understand this position better?

Say that we are talking about an otherwise-free LLM model trained on a multi-gigabyte data set. Data from the dataset may be downloaded from
the Internet (but may not redistributed by Debian). Let's assume that
the source code of the LLM also includes a script that would, if
executed, do all the downloading and formatting of the training data
from Internet sources for you. The data *may* even be binary identical
to the original training data (if it is only trained on snapshotted
data mining collections that one can download from torrent via a
magnet link for example), or it may be in a newer state than when it
was trained originally (if you choose to switch to newer snapshots or
if data collection happens directly from source servers or their
proxies). You can add, remove or filter data sources to modify the
contents of the training data on a high or granular level.

Would that be a sufficient definition of training data to satisfy the preferred form of modification criteria for you?

If Debian cannot redistribute the training dataset (part of your description above), then it cannot be in main. If the LLM model source code is DFSG-free but depends on this non-DFSG free training data or weights derived from it, then it is fine if it goes in contrib. The weights derived from this non-DFSG free training data can go in non-free as long as Debian can redistribute them.

If there is a scenario where the LLM can work with several different data sets derived from different training data, and some of those data sets are DFSG- free while others are not, then the free data sets and the model can go in main. It can depend on, recommend, or suggest the DFSG-free weights and data sets in main. But it can only suggest those in non-free.

I find this understanding to accomplish two things.

1. It is a consistent application of DFSG principles to machine learning applications.

2. It makes the benefits of non-DFSG ML applications available in non-free to those who would like to use them.

If any use of the original training data (or of its description as
above) requires 100 000 Nvidia H100 cards running for a month using a
few billion USD of investment and several million dollars of
electricity, does that training data *still* satisfy the criteria for "preferred form of modification"?

I find discussions about how much hardware it takes to process the training data to be orthogonal to a discussion of whether a ML training dataset is DFSG-free, so I don’t feel it is useful to discuss here.

And, to ask explicitly, is raw training data a better form of
modification for you compared to a description of that same training
data, in automated form that would generate the training data for you
on request?

1. Raw training data is non-negotiably required for me to consider a ML application DFSG-free.

2. Additionally, a description of that training data would also be nice, but I don’t think it would be non-negotiably required. However, I might be open to arguemnts that both should be required.

Is it important for you if the training data *only* comes to you from
Debian mirrors? Or is the same data coming to you from other sources
also fine?

For main, yes, I think it must come from Debian mirrors.

For non-free, I don’t see a difference between Debian mirrors or a script that
downloads the data from some other source on the internet, as your describe in your example above.

I should note that I do not feel as strongly about this point as I do about the training data being available under a DFSG-free license to be in main. So, if Debian decides that hosting the training data on some Debian approved location that is not an official Debian mirror is acceptable, I wouldn’t push
back against that as long as the training data itself was DFSG-free could be included in main in the future if we ever decided to do so.

In my opinion, it is fine to include otherwise distributable ML

applications

without available training data in non-free.

Technically - yes, and I would be fine to include OSI-free AI in
Debian non-free, but IMHO it does nothing to resolve ethical concerns.
If we limit that to only OSI-free AI then that would also be giving
the same kind of guidance to the AI community - with both upsides and downsides.

I would go beyond that to say that we can host things on non-free that even OSI does not consider free as long as we have the rights to distribute it. We already do that for a number of other things in non-free.

I think the practical result of the policy I describe above would be that most ML applications would end up in non-free. But I also think that a smaller number of ML applications would end up in main, especially as developers start intentionally creating fully DFSG-free ML applications and training data sets.

Also, as has been mentioned on this list, the discussion about LLMs has caused us to look more closely at other MLs already in main, like some games or image and audio processing applications. In the past, I, along with others, didn’t
think too deeply about the training data used to create the weights used by those MLs. If the above standards are adopted, it would require moving some of these games and image and audio processing applications to contrib with their weights in non-free. In other cases, it might be possible to retrain them on DFSG-free training data sets, especially if upstream is interested in doing so.

I think this would actually benefit the free-software movement in the long term, even if it requires a bunch of work to address it now. If such a change were made, I would be in favor of doing so at the beginning of a release cycle so that Debian and upstream developers have a couple of years to figure out how to either keep the software in main or move it to contrib and non-free in such a way that it is not disruptive to users.

--
Soren Stoutner
[email protected]
-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEJKVN2yNUZnlcqOI+wufLJ66wtgMFAmgk6zUACgkQwufLJ66w tgOwCQ/7BHmfva901XBtEgcqQC+cAuEMow768eNhJg9GGMju3Xa3YLC32drCILFH KQx0TvZw5MHGFzhpmyVLTa3hRxaGxfdtnN8ZvlcblxcqioGEWXdEaRZEB4z9LHMt VlI6rs8RUK9rUaQxQkzU7h47CV5h2pYM8yyF8girLBkLYeomeDbXXC41Xef4odqF 6xtBVAXIZbUxXRgtTlTkP9+47Vqu+c+6kN5XeWgMscJqW8/dVwxKB6wvCaCnJX36 UkM61dv5zQhG5t0cJr4ch18SBNKC2Gx9iBMW9ONJsLomiFkQVhOE9HhBesQd+p05 1YFIzskcG6NOjrZy4bE39ZZ8fyurm3BTVkBruRTThuF/2Z5UDEbAnnN3antVoTDN XLi0DUPNEfyRv5EObKdcehVuM1sPlI5/go8c+Nu54lnqtasOkURgIleR6iBD4hu5 3NGRZWekmiln1acUj4SgUDTOT3pTa8gsU7/pIR4MkdKiqVyR72bL7uHXFoz8Pm6T 2ieAZe+OTl/jASc18/m4ND6uD6sTak2lT4APUVn/cX79zvz2KpGqxKTIqrGxDzR8 orN4BjBXsZ+NF5hKePSeDwgZH+ieTdTCgYI3md4qY/RcL1HuQzl4xTuJeyjv7JmD JWtIMH+bPSC9nhg8JV7JAaJTXOdv78h/0v1db+WLRJy68f6Nl70=
=UjTi
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Soren Stoutner@21:1/5 to All on Wed May 14 14:12:35 2025

On Wednesday, May 14, 2025 1:51:27 PM Mountain Standard Time Aigars Mahinovs wrote:

That is not what I asked. Redistributing is a completely different
question from a different point of DFSG and even from interpretation
of whether DFSG even applies to the training data as such. And that in
turn very specifically depends on a very isolated question - what is
the preferred form of modification. And that is why I am
*specifically* asking how your opinion that "training data is the
prefered form of modification" works in real world examples.

Only that specific criteria. Not about Debian, not about main or
non-main. Not for other people or for the project.

What does "preferable form of modification" mean for *you*? For
example in that case above. Is the raw training data *really* _the_ preferable form of modification? Or is it the data definition? Which
would you *prefer* to *modify*?

In my opinion, the preferred form of modification is the raw training data. I apologize if I did not make this clear in my previous email. I thought I had.

For an ML to be included in main, the raw training data must also be included in main.

To be more specific, to be included in main, the ML program source code, the weights (which can be thought of as the compiled training data), and the raw training data (which can be thought of as the training data source code) must all be DFSG-free and must all be included in main.

Otherwise, the DFSG-free parts can be included in contrib and the non-DFSG- free parts can be included in non-free (assuming we have the right to distribute them).

I know there are people who disagree with this assessment of the raw training data being the preferred form of modification, and I can understand and respect those who feel that way. That is, after all, the core of this entire discussion, which would not exist if there weren’t disagreements on this issue. However, as you asked my opinion of what Debian’s policy should be, I
endorse the above.

--
Soren Stoutner
[email protected]
-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEJKVN2yNUZnlcqOI+wufLJ66wtgMFAmglB0MACgkQwufLJ66w tgOSCw/8DoiTDlrhyIIp6xlGgdm8s1HHmMFIeEZpTGwPts8llgFcaDB6PSdayIUd b4bthmhcSmeuOrNcGckRkcW4O8O/doIQ+Zafs1VMb0N1KmJXv0pzlG25vMhib8CX gNmszuUse8WFsDbjiQU+tHl9wISqX4x/O5ZNjVM8kCWUdC0aQHe15HCGEnTIqvwl flsxzqxN2nmBPkbnMjViOrVFOMO69z/txie7xxBq4nkB438St42IkTXp8DKckNjg JgN3MpSR1hdakL3qP+Ppt9PbszwgB1TCWCV25QaBWTZsy6ASX9vhjmXpAsa3fRol Xx7fUOhJb+QorU15ylUOo32AO9oCv21KBTtHjavt4CJBRQzbxnixwwKlLiBTKDVb EHocYNE9pFsqJHszqalQUnIQwmjks53yRBlTxk6cA8ItXj/8MKrw1ucXQrwSU4Qu KhGjbDSyuvQwv27Ufmph2c8iyMLetMCq5SvZ+P7modVrQ7c4gjVanLHCduHa3O+C T71V8KRB/cQ2cZpZGXfTtkwKAaUHI/VidI9VuQptEoZFWaemLS0ZaxiPbp/ZeREP 0LhkPinn+SLbOYZIQLwGhQAEUaSbDd6KpB9+edG74wehzRHOW8U9x7X8ckIdnUPW eeW/B8SIa0/Ol9KzBu2jDsUsKYEwjWHvbJMa/TpUbZAZEh9Ul4g=
=aVxr
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Arian Ott@21:1/5 to [email protected] on Thu May 15 02:10:01 2025

Good afternoon,

I have followed the ongoing discussion concerning the interpretation of the DFSG in relation to AI models and would like to contribute my personal perspective. To provide a coherent framework for my argument, I will begin
by outlining how I perceive Debian’s role in the broader technological ecosystem, before elaborating on the implications for AI and data openness.

Debian is widely regarded as a cornerstone in the open source landscape.
Its influence spans from individual desktop users to global hyperscale deployments. The project is trusted not only by private users and
educational institutions, but also by major corporations such as Amazon and Microsoft. This widespread adoption is a testament to Debian’s technical reliability and its principled commitment to software freedom.

At the heart of Debian lies a shared philosophy: to promote the
distribution of free and reproducible software. While the DFSG provides a
more detailed policy framework, this philosophy is fundamentally rooted in
the ideals of FLOSS. It is precisely this commitment that has enabled
Debian to serve as the foundation for numerous derivative systems and
research initiatives.

In my dual studies in Business Informatics with a specialisation in Data Science, I have consistently observed a strong alignment between the FLOSS philosophy and academic best practices, particularly in the context of transparency and reproducibility.

Today, many domains contend with vast and complex data landscapes.
Analysing such data, whether for research, operational insight, or
innovation, is increasingly beyond the scope of manual effort alone.
Machine learning and AI methods are indispensable tools in this process. However, they must be applied in a manner that is methodologically sound
and ethically responsible.

During the course of my semester thesis on Retrieval-Augmented Generation (RAG), I encountered a compelling example wherein an AI model identified a previously unknown biomarker associated with cancer. This discovery was
only possible because the researchers had access to the underlying dataset. Without that access, the model’s findings would have been opaque and potentially unverifiable.

This brings me to a central concern: when data scientists are given a model
to work with, their first question is often:
“What data was used to train it?”
This question is not incidental. It is fundamental to understanding the model’s behaviour, biases, and limitations. It is also essential for scientific reproducibility.

In the course of the earlier email exchange, it was argued that the
hardware requirements for training large-scale models place them out of
reach for anyone without a budget in the range of 100 M€. While this may be true for frontier-scale models, I believe it overlooks a significant
portion of real-world use cases.

In my undergraduate work, we frequently relied on publicly available
datasets from sources such as Kaggle. These enabled us to train our own
models, interpret results, and explore data-driven questions in a hands-on manner. Providing access to training data empowers researchers,
institutions, and independent developers to create models adapted to their specific needs. Moreover, it facilitates the composability of data, an essential feature in interdisciplinary research and real-world applications.

Debian’s commitment to reproducibility and openness logically extends to
the realm of AI. Distributing a model without its corresponding training
data violates this principle and undermines the ability of users to
validate, audit, or adapt the model for their own contexts.

If Debian were to allow AI models to be packaged without the accompanying
data, it would risk reducing its standards to those of existing platforms
such as Hugging Face, where reproducibility is often not enforced. In
contrast, requiring training data to be available fosters trust, academic rigour, and long-term sustainability.

*The strategic value of Debian enforcing open data is clear:*

Data scientists and developers can rely on Debian-hosted datasets being
legally sound and freely reusable.

This lowers the barrier to entry for high-quality, ethical AI development.

It also positions Debian as a trusted ecosystem for research-grade and production-ready AI tooling.

There are, of course, multiple perspectives within this discourse, and I am fully open to engaging with alternative views or refining these points
further.

*In summary, my vision would include:*

Making datasets available through apt or similar tools

Treating AI models as first-class citizens in Debian’s packaging ecosystem

Enforcing that models included in Debian main must be accompanied by the training data that enables their reproducibility

Kind regards,
Arian Ott
Student in Business Informatics – Data Science
Member | IEEE
Email: [email protected]
LinkedIn: in/arian-ott

On Wed, 14 May 2025, 23:38 Aigars Mahinovs, <[email protected]> wrote:

On Wed, 14 May 2025 at 23:13, Soren Stoutner <[email protected]> wrote:

On Wednesday, May 14, 2025 1:51:27 PM Mountain Standard Time Aigars

Mahinovs

wrote:

That is not what I asked. Redistributing is a completely different question from a different point of DFSG and even from interpretation
of whether DFSG even applies to the training data as such. And that in turn very specifically depends on a very isolated question - what is
the preferred form of modification. And that is why I am
*specifically* asking how your opinion that "training data is the prefered form of modification" works in real world examples.

Only that specific criteria. Not about Debian, not about main or non-main. Not for other people or for the project.

What does "preferable form of modification" mean for *you*? For
example in that case above. Is the raw training data *really* _the_ preferable form of modification? Or is it the data definition? Which would you *prefer* to *modify*?

In my opinion, the preferred form of modification is the raw training

data. I

apologize if I did not make this clear in my previous email. I thought

I had.

You would *actually* technically, in reality, prefer digging through gigabytes of text files and do some kind of manual modifications in
that sea of raw data? Modifications that are basically impossible to
track in any kind of change tracker. That are excessively hard and
time consuming to actually do and check. Instead of just adjusting
input parameters on the ingest script? *That* is what I consider to be frankly very hard to believe.

I rather get the impression that you prefer expressing this position
because of the logical consequences on the discussion. Especially if
you immediately change the topic from prefered form of modification to redistribution and DFSG and main and other things that are entirely irrelevant to the question of what is the prefered form of
modification. Technically. In practice. Not morally or spiritually.

However, as you asked my opinion of what Debian’s policy should be, I endorse the above.

That is *very* explicitly *not* what I asked your opinion on. I asked
you to consider very specific examples and what is the prefered form
of modification in those cases. Really consider.

--
Best regards,
Aigars Mahinovs

---
Arian
[email protected]

<div dir="auto"><div dir="auto">Good afternoon,</div><div dir="auto"><br></div><div dir="auto">I have followed the ongoing discussion concerning the interpretation of the DFSG in relation to AI models and would like to contribute my personal perspective.
To provide a coherent framework for my argument, I will begin by outlining how I perceive Debian’s role in the broader technological ecosystem, before elaborating on the implications for AI and data openness.</div><div dir="auto"><br></div><div dir="
auto">Debian is widely regarded as a cornerstone in the open source landscape. Its influence spans from individual desktop users to global hyperscale deployments. The project is trusted not only by private users and educational institutions, but also by
major corporations such as Amazon and Microsoft. This widespread adoption is a testament to Debian’s technical reliability and its principled commitment to software freedom.</div><div dir="auto"><br></div><div dir="auto">

From Soren Stoutner@21:1/5 to All on Wed May 14 17:22:46 2025

On Wednesday, May 14, 2025 5:04:03 PM Mountain Standard Time Arian Ott wrote:

During the course of my semester thesis on Retrieval-Augmented Generation (RAG), I encountered a compelling example wherein an AI model identified a previously unknown biomarker associated with cancer. This discovery was
only possible because the researchers had access to the underlying dataset. Without that access, the model’s findings would have been opaque and potentially unverifiable.

This brings me to a central concern: when data scientists are given a model to work with, their first question is often:
“What data was used to train it?”
This question is not incidental. It is fundamental to understanding the model’s behaviour, biases, and limitations. It is also essential for scientific reproducibility.

That is a good, concrete example. It is interesting that access to the original training data has value that goes beyond a desire to retrain the model and extends into *using* the model to its fullest extent.

In the course of the earlier email exchange, it was argued that the
hardware requirements for training large-scale models place them out of
reach for anyone without a budget in the range of 100 M€. While this may be true for frontier-scale models, I believe it overlooks a significant
portion of real-world use cases.

In my undergraduate work, we frequently relied on publicly available
datasets from sources such as Kaggle. These enabled us to train our own models, interpret results, and explore data-driven questions in a hands-on manner. Providing access to training data empowers researchers,
institutions, and independent developers to create models adapted to their specific needs. Moreover, it facilitates the composability of data, an essential feature in interdisciplinary research and real-world applications.

Out of curiosity, how much hardware did you need to train your own models on these data sets? I think sometimes we forget that many of the MLs use data sets that are much smaller than LLMs scraping the entire web.

--
Soren Stoutner
[email protected]
-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEJKVN2yNUZnlcqOI+wufLJ66wtgMFAmglM9YACgkQwufLJ66w tgPDlRAAp3CpyL1/lBgFmkBulTkzI9xEfPPAmLaTfdVCiz4tIaiA5Io1R/OjAlBW RUlcspO0d36MI2I/Gxh4az7FGOAURbaC5Nw0sNYWP+ibSBJ/P4Qa8LwCxBPykAr4 d+5O8/q50J508PmQ2QABmcBUNPA0dRcBOrlElroNup7DOl4N51OxgPxU+7zxuk67 AfbZF66B6imTDCEQpPwElAdgK4HO3fV6M4dND1YpLVr5nEuK8vIgxIWl2uVuiDBE k0Lig4rll1OG+/myu99QgL/9R1XhTkHatm8WR61S1nm/taHDiM/JbrAGJEbzo9/W N308lN5DLH0gZmIjOasdMooZ98WWtPzGkgcsox8lhrk8LqmRXOU0vUiVzc36U4cx uu9cESHlfU2tDqveY/Pbb2e3+FZktjvSh4CKRaT54pvWH/poCuoLRw/szh/AxOOR byHxnApVDc3NGHjTHfTPIO1IBdngxvQ4BYF/IY5tLeg4ABjxyiagoo1tWvXKySk+ H28jCFLEWXA7vaLiUxJPcpWBjfS12ak7QNLh1bW+cV/+v1d+EEG6C6WtUPPjisbY r3A0HYXVmG26GA44QbVk0uyvW61Mh5mX8Zc8mUxhTDaUsm4+OmR0K+RE5ylkmxEg D36xTOY6+xTWgushFEQWViJMYN/TG9Eb0y9t5HsttoY/Uw8lI1A=
=hwOQ
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Zacchiroli@21:1/5 to Aigars Mahinovs on Thu May 15 10:00:01 2025

On Wed, May 14, 2025 at 10:51:27PM +0200, Aigars Mahinovs wrote:

It is absolutely critical to a very specific DFSG question on what is
the "prefered form of modification".

This is maybe a side terminological note, but note that the DFSG talks
about "source code", not "preferred form of modification". The latter expression comes historically from the GPL, rather than DFSG (or OSD).

I'm personally attached to the notion of "preferred form". I also
believe Debian is currently interpreting "source code" as equating the
stronger notion of "preferred form"; e.g., we require to have the
non-minified versions of JavaScript bundle in main. But ML models might
be the first case in which there can conceivably be multiple "source
codes" for the same artifact (ML weights), modifiable in different ways
(e.g., fine-tuning vs re-training), which are "preferred" by different
users for different use cases.

I don't know if this justify reconsidering our stance on what "source
code" means, or our stance on the fact that we consider all types of
artifacts (source code, images, data, etc.) as being equal. I just
wanted us to keep in mind that "source code" and "preferred form of modification" are not *necessarily* the same thing.

Cheers
--
Stefano Zacchiroli . [email protected] . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEE8ZooXsFA+JEz681OfH5Cj5NBJ5kFAmglnWkACgkQfH5Cj5NB J5kHdA//YEfeRhka7HgZfUrjdkD4fwJJrsEA1ssoTMILx0G9V4KMMdGdIo8IRE4q WTSHBjbhPCYj1eh76djM0YKpiuABfcMxtvqyN/KjiE3ppSsviPIQGhdaV4vNc5bv LRg9aaFq9OJfAFRjBN9kAyp/e/A+vRepSqHkVH3Mjyo9xd49XIvVc+vwsxo7AIZU PvysDb8zyY8LbGaWPAKQmcowYuCGZwcGhhZNot05TGJovSCKX9ovWcprLjhxA3Wt wWihoZgdA/yfZyhsHfgAPZ43JYfivpWKQaq+MzzCezF5SYXEk+7QZAKLpfWbngIg Mt5FyejO05AyYuOItiJEG55Nqtg5dG7Y9hUQOGXiqm/e1Ete9nbNn+877cC1VCeM GF6JZtAaTDRPqF2iwOQH8M6pOnC3pRv3bYx4TfX13jIwSdU3ZxM4AlK0XqO1szJR +17ppMPOHtYBLtVcJujfwQ0HbGRp/Q0xs28kpnsPQx7qT0rPq3Yfq/ydO2hho588 BG+XPLoQX4DT3YCnLk4skO

From Stefano Zacchiroli@21:1/5 to Aigars Mahinovs on Thu May 15 10:10:01 2025

On Wed, May 14, 2025 at 11:38:02PM +0200, Aigars Mahinovs wrote:

You would *actually* technically, in reality, prefer digging through gigabytes of text files and do some kind of manual modifications in
that sea of raw data? Modifications that are basically impossible to
track in any kind of change tracker. That are excessively hard and
time consuming to actually do and check. Instead of just adjusting
input parameters on the ingest script? *That* is what I consider to be frankly very hard to believe.

Aigars, I'm sympathetic to your general stance in this debate, but I
think you push it too far, in the following sense.

It is undeniable that *some* modifications of a trained ML models are
possible starting directly from the model weights. I also personally
agree that, at least for big models, *most* modifications (counted in
terms of use cases and/or users actually doing them) will happen
starting from model weights via techniques like fine tuning.

But I don't think it is disputable that the *most general* way of
modifying an ML model is achievable only starting from the full training dataset and pipeline. There are simply things that you cannot do
starting from the trained model. You are right that, for big models at
least, it will be unpractical to do those changes, and that most actors (including Debian) will not have the resources to do the re-training.
But that should not lead us to equate the scenario in which training
data is available to that in which it isn't. I think our debate in
Debian should be about where do we put the bar of what is *required* to
be in main, without dismissing the fact that it *is* better to have
training data than not having it.

Do you agree with the above?

Cheers
--
Stefano Zacchiroli . [email protected] . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEE8ZooXsFA+JEz681OfH5Cj5NBJ5kFAmgloIYACgkQfH5Cj5NB J5m6CRAAj6LaV92aK7hMZcM7v6n/9fk5/XuoE+/HGzxzK56HDSMThbO6PqB7RYP6 2JuS/foxrOqUX99GUcksr8MswzolkTIMgYSKLvTXqRlhF1/2vJ+d4BNOkyDYsaCQ UPsuDiMJ/wPApTHuPynLISC6t3OnkOMtrJh1dvQdYNh4JXMaR5uKlezO4xPvLiLl TVEydjKhIlwOkbJllGqoUmF11OgS5PqGv2maNQtBPFI+aar5rc+6L74UPHMgmZpp ERsPVXtEuVBldzmgN+rshX5aEjKGp94f8ERZ2N/MFDtYm0nE/IWI7nnn11Fz8+ry 4WWF1eaDH/AMZTACzjwy49e5i2SJlhh2DSL6981Fm6r/z/nu2YLcaD5m4HmIsgvG NTP+dsyPd4Kcl75zhSN3FdAsZHwS959WXDzw9rxGHRFHWDii5JLCKS6bR50REH7U p0N+7RiW7OfghLlDFOv/kp3KyhF2WsP7hxQ6lKiXZeEg0JO0icBcQc6D6ORfU/eV 9OEbJXRLcMbE3PkrUBTfZX

From Jonas Smedegaard@21:1/5 to All on Thu May 15 10:20:01 2025

Quoting Stefano Zacchiroli (2025-05-15 09:53:14)

On Wed, May 14, 2025 at 10:51:27PM +0200, Aigars Mahinovs wrote:

It is absolutely critical to a very specific DFSG question on what is
the "prefered form of modification".

Nitpick: It is "preferred form *for* modification".

That nit aside, the notion of preferred form also resonates well with my understanding of what we aim for, despite that specific choice of words
not originating in our camp.

- Jonas

--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
* Sponsorship: https://ko-fi.com/drjones

[x] quote me freely [ ] ask before reusing [ ] keep private

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrey Rakhmatullin@21:1/5 to Stefano Zacchiroli on Thu May 15 10:20:01 2025

On Thu, May 15, 2025 at 09:53:14AM +0200, Stefano Zacchiroli wrote:

On Wed, May 14, 2025 at 10:51:27PM +0200, Aigars Mahinovs wrote:

It is absolutely critical to a very specific DFSG question on what is
the "prefered form of modification".

This is maybe a side terminological note, but note that the DFSG talks
about "source code", not "preferred form of modification". The latter >expression comes historically from the GPL, rather than DFSG (or OSD).

Debian sometimes talks about "preferred form of modification" outside of
the context of specific licenses. E.g. https://ftp-master.debian.org/REJECT-FAQ.html does that, I don't remember
if there are more official docs doing that.

--
WBR, wRAR

-----BEGIN PGP SIGNATURE-----

iQJhBAABCgBLFiEEolIP6gqGcKZh3YxVM2L3AxpJkuEFAmglotstFIAAAAAAFQAP cGthLWFkZHJlc3NAZ251cGcub3Jnd3JhckBkZWJpYW4ub3JnAAoJEDNi9wMaSZLh U7AP/jg2PuyzqNyhE2dCxHC6cXATLa3biLqz5D8fhmQqBgBve00k1H0ba4FA1pIz oZt0XjDDzePCtnNVomKIxyzZpvRubZk1NptfJqGa0GlWCPMx/0pBPs/QwHe52/TP 55grM+sQ2FsmJhAouq7m7xDP06KxWk6xEjcoh/r/R1OEKfp53RzcHB9sNsY/T6GS Kz4rFqiDPbHJqswERYZGdBi50AIAEP5t2q4RKo8NcuxN1roPF3uPVq4RIRzpKmG7 HyW7twB2jKgxKiiEBPn8wDcNvlVXcJFJj7t3ov20ICtBASHoPUA3KT8/DEeEoOTB xsBYq3OgjGUNMN73IohULLqgMIJziOHRDbpFgf1HoHGeu0mrcek0TY/ZAQQhVwWE /MGNTxOFOJohiHrs6RXW9gDXeEUw8m1NoBPvHXR25ggJj89N78Kg1dbUScLO/iEI Kl1aDak+P0negbKjFwJcy0Sb8j/zzaH6tU13QMtZ+jaK3jfoh4E8+Oqis48Bzr3Y /43QypBj+45YJCkfFMZpwEjs2YIAuBrrgAXx1TEyfSN42+QGuEVKw6DTSx7RA5tl Uks7qjxpQC7TBukNValhZimK6daS5V+B1Ph68nuGCAXV++MTzz4xg0bu5tOpAXWr ioIlnJVDbN27TcZlDEstHcNT51j7UKzweYze+TEICpB5xL6C
=I1QG
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Zacchiroli@21:1/5 to Aigars Mahinovs on Fri May 16 08:10:01 2025

On Thu, May 15, 2025 at 11:36:22AM +0200, Aigars Mahinovs wrote:

On Thu, 15 May 2025 at 10:06, Stefano Zacchiroli <[email protected]> wrote:

But I don't think it is disputable that the *most general* way of
modifying an ML model is achievable only starting from the full training dataset and pipeline. There are simply things that you cannot do
starting from the trained model.

This is not quite the point I was trying to make in this specific
thread. I was pointing out the difference between raw blob of training
data and pipeline that creates/gathers that raw blob of training data.

[..]

But I do think that it should be perfectly fine to have an ingest
pipeline that simply downloads " https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-18/warc.paths.gz
" for example.

Oh, I see. Thanks for clarifying, I indeed did not get that this was the
main point you were raising in this sub-thread.

FWIW, I agree that "where is it hosted?" is a less important question
wrt the one of whether the full/pristine training dataset is available,
for our users, *somewhere* in the first place. But note that if Debian
accepts not to host datasets on its own infrastructure, then a number of practical issues arises, e.g., what do we do with the package in main
if/when the data disappears from the external hosting place? (Yes, I
know those datasets are hosted by archives, whose mission is to preserve
data in the long run, but even archives can fail, might be forced to
delete data, etc. As long as we are not in control, anything goes.)

Cheers
--
Stefano Zacchiroli . [email protected] . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEE8ZooXsFA+JEz681OfH5Cj5NBJ5kFAmgm1ScACgkQfH5Cj5NB J5mPsg//fbMbRp2v2ZF/rhEhJFRIMhqE3KVxT//OrTQLAx15c3El4yhikenNeNXb +gtfo+LBsHnrKcGFF83UwhwWqV1Jw1juNjITj1cZWd4FFuMtCakC1olBsK90vcdQ cj7DIRwhGbgeYc725PL29SZoCJKwSAnGGeLZnWlnXszdzCTRx0Dl8/U/37+zTQrG Ta5Ar8rA+S6N7gVc35q5pWKXfgTiyEq1tGAALFnfzddnUcTSg/HuICctJ4QotD8d dLH+CTqXk3Tv75Weafrqesmx4X+U3ClxPoTJrkIfWVUDBKLtUGCd2X2x1ip7yuWV B384QH0EtLZLc+eVh1I6a8VptHNe6hy9i2DKnSlz1LjiWopfrJc6f6MbKoQR1zq0 woJ3MoNOfERXFNtF1ylUSPmw/EyD/mSL5QWkxV3KeruJCIwCZgO1uYZ49ljvxdXU 8ZDdQNX9fEvpOU+tv+9Hc5P/3FzUV5lOwaQKK1qroywxNeNDeIjum6mhdZztFQ6f u8KquKAVJGPiftPQEXH5Ii

From Andrea Pappacoda@21:1/5 to Stefano Zacchiroli on Fri May 23 17:50:02 2025

Hi all,

On Fri May 16, 2025 at 8:03 AM CEST, Stefano Zacchiroli wrote:

FWIW, I agree that "where is it hosted?" is a less important question
wrt the one of whether the full/pristine training dataset is
available, for our users, *somewhere* in the first place. But note
that if Debian accepts not to host datasets on its own infrastructure,
then a number of practical issues arises, e.g., what do we do with the package in main if/when the data disappears from the external hosting
place?

This might have been asked before, but: wouldn't this be the perfect
use case for the contrib archive area? The model complies with the DFSG,
but requires software outside of the distribution to build.

Also, I had the impression that sometimes in this discussion "DFSG-free
model" and "model in the main archive area" have been used as
synonymous, while they are not. We can decide that a model is DFSG-free
if its training data is provided by their authors, but still keep data
outside of our archive and have the model live in contrib for as long as
also hosting the training data is inconvenient for us.

Oh, and lastly: not all models are huge! (and we already have some
fairly big packages in our archives)

Does it make sense?

Bye!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	51:51:02
Calls:	12,445
Calls today:	5
Files:	15,192
Messages:	6,537,256

Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Mode

Who's Online

Recent Visitors

System Info