If I wanted to package up my classifier state and distribute it under a
free software license, I think it should be DFSG free.
I think that to satisfy the DFSG I would need to include all the
training data I still had and any scripts I used.
"Ansgar" == Ansgar 🙀 <[email protected]> writes:
On Mon, 2025-05-05 at 14:27 -0600, Sam Hartman wrote:
If I wanted to package up my classifier state and distribute it under a
free software license, I think it should be DFSG free. I think that to
satisfy the DFSG I would need to include all the training data I still
had and any scripts I used.
And the training data would have to be under a DFSG-free license. I
doubt phishing or spam mail comes with proper licensing; even ham
doesn't do this (what are the license terms of this mail?). So if you
were required to include training data it wouldn't be possible even for fairly boring classifiers.
I guess even under my proposal option, packaging the classifier might be tricky. If I deleted the training data and no longer had it, then I
think under my option, the classifier could be DFSG free.
However, I am very leery about extending that exception to cases where
people are intentionally creating that situation by deleting the input
data on purpose.
"Stefano" == Stefano Zacchiroli <[email protected]> writes:
FWIW, in terms of free software ethics, I consider non-open data to be
"less nasty" than non-free code.
FWIW, in terms of free software ethics, I consider non-open data to be >>>"less nasty" than non-free code.
Debian is unusual in the way we interpret our mission statement as
extending to everything we distribute being Free, not just our
executable code.
I don't think Debian is perfectly consistent in applying that principle:
for example, the text of the Developer Certificate of Origin (DCO) is >included in Debian packages (in 'main') and has a clearly non-free
license, and IIRC sometimes not even in debian/copyright.
There are
other examples historically of similar content too, e.g., IETF RFCs
Simon Josefsson wrote:
I don't think Debian is perfectly consistent in applying that
principle: for example, the text of the Developer Certificate of
Origin (DCO) is included in Debian packages (in 'main') and has a
clearly non-free license, and IIRC sometimes not even in
debian/copyright.
Same for the text of GPL.
On Mon, May 05, 2025 at 02:13:58PM -0700, Russ Allbery wrote:
However, I am very leery about extending that exception to cases where
people are intentionally creating that situation by deleting the input
data on purpose.
I agree with you on this. I do wonder however where you would place the
case where the training data is available (possibly: publicly
available), and the model trainers would even want to distribute it, but cannot due to unclear licensing terms. Would you say that it is a "less nasty" case than that where training data is deleted on purpose, or
would you consider it as bad?
FWIW, in terms of free software ethics, I consider non-open data to be
"less nasty" than non-free code.
The ability to exploit non-open-data to serve the needs of free software
(as it would be the case with DFGS-free models, trained on non-DFSG-free data) is something I hesitate giving up on.
On Tue, 06 May 2025 at 13:58:57 +0200, Stefano Zacchiroli wrote:
FWIW, in terms of free software ethics, I consider non-open data to be >>"less nasty" than non-free code.
Debian is unusual in the way we interpret our mission statement as
extending to everything we distribute being Free, not just our
executable code.
Simon McVittie <[email protected]> writes:
Debian is unusual in the way we interpret our mission statement as
extending to everything we distribute being Free, not just our
executable code.
I don't think Debian is perfectly consistent in applying that principle:
for example, the text of the Developer Certificate of Origin (DCO) is >included in Debian packages (in 'main') and has a clearly non-free
license, and IIRC sometimes not even in debian/copyright.
There are
other examples historically of similar content too, e.g., IETF RFCs and >Unicode tables.
Andrey Rakhmatullin <[email protected]> writes:
Simon Josefsson wrote:
I don't think Debian is perfectly consistent in applying that
principle: for example, the text of the Developer Certificate of
Origin (DCO) is included in Debian packages (in 'main') and has a
clearly non-free license, and IIRC sometimes not even in
debian/copyright.
Same for the text of GPL.
License texts have always been a special exception, and I kind of wish we would amend the DFSG to make that clear. Not because I think the status of license texts is somehow in question, but because having one undeclared exception makes people think we should have other undeclared exceptions. I would much prefer to take the time to enumerate all of our major
exceptions.
Are you suggesting that the DCO is a license text that has to be part of
the licensing information for a piece of work, and mentioned in debian/copyright?
Debian is unusual in the way we interpret our mission statement asI have been a Debian developer for almost 30 years, and I remember that
extending to everything we distribute being Free, not just our
executable code. Many other FOSS distributions apply the DFSG, the OSD,
the FSF's guidelines or similar principles to executable code (only),
and do not see a problem with having non-executable data that Debian
would consider to be non-Free.
Well, first, I continue to object to the idea that a model can be
DFSG-free if it's trained on non-DFSG-free data. I think that makes it >definitionally non-free. (I have read Aigars's arguments to the contrary
and do not find them at all persusasive.)
On Tue, May 06, 2025 at 08:36:50AM -0700, Russ Allbery wrote:
Well, first, I continue to object to the idea that a model can be
DFSG-free if it's trained on non-DFSG-free data. I think that makes it
definitionally non-free. (I have read Aigars's arguments to the
contrary and do not find them at all persusasive.)
We appear to have plenty of pre-trained models, apparently trained on non-DFSG-free data, in main right now, which strikes me as a violation
of our current "preferred form of modification" rule.
On Wed, 7 May 2025 at 02:56, Russ Allbery <[email protected]> wrote:
I think if any of the options in the current GR except Aigars's (and maybe >> Sam's?) passes, that would effectively be a change in our current policy,
even if the current policy is not precisely intentional.
IMHO my option will also be a change in our current policy, but, instead of requiring the training data itself, my option would just require adding a documentation section describing how to create/gather and process data required to train such models *if* someone would want to reproduce them.
(While I find the tone of the email a bit exasperated, I will try to
reply factually and I hope it will be received as such.)
Thanks for answers! Surprisingly I now find myself agreeing that your >approach is reasonable and is consistent with existing Debian practices.
I just wish that the existing practices were more libre and more
consistent with documented policies, but I also think this is not the
popular opinion.
[email protected] wrote:
Debian is unusual in the way we interpret our mission statement as >>extending to everything we distribute being Free, not just our
executable code. Many other FOSS distributions apply the DFSG, the OSD,
the FSF's guidelines or similar principles to executable code (only),
and do not see a problem with having non-executable data that Debian
would consider to be non-Free.
I have been a Debian developer for almost 30 years, and I remember that
when I joined the project we had no plans to apply the DFSG to e.g. >documentation.
Then the "editorial changes" (not) GR happened, and some people were
very surprised by the practical outcome.
On Wed, May 07, 2025 at 02:20:44PM +0200, Simon Josefsson wrote:
Thanks for answers! Surprisingly I now find myself agreeing that your >>approach is reasonable and is consistent with existing Debian practices.
I just wish that the existing practices were more libre and more
consistent with documented policies, but I also think this is not the >>popular opinion.
So, let's delve deeper on the practical impact of such consistency
or not. Let's say we have a hypothetical package called
gnipgnop-rattrap. It's an accessibility tool which tracks elements
of your face using pretrained Haar cascade classifier models, and
based on where you look, moves the "mouse" pointer. The models
we ship it with have been trained solely on 75 gigabytes of images
captured from Disney films, which are not available anywhere
because the people who trained the models are afraid of being sued.
What should Debian do? Remove the package from the archive so no
one can use it? Patch it to download the models from a random
URL which may or may not be accessible? Construct 75 gigabytes of
DFSG-free annotated training data to stuff into the source package?
On Wed, May 07, 2025 at 06:10:37PM +0200, Simon Josefsson wrote:
That is not my preference nor what I would want to see happen, but I
think it is consistent with how Debian approach including non-free
firmware in the official installer images, and how Debian approaches >>licensing on other non-source files inside packages.
So what is your preference and what would you want to see happen?
I ask because I see no good options here. I am thinking about
this from the perspective of a user who wants to use the models
unmodified and from the perspective of a user who wants to
modify the models to work better with a face that the models
"consider" an outlier.
That is not my preference nor what I would want to see happen, but I
think it is consistent with how Debian approach including non-free
firmware in the official installer images, and how Debian approaches >licensing on other non-source files inside packages.
I think Thorsten Glaser's proposal on the surface looks more in line
with what I would want to see, but I don't think we understand the full >implications of any of the proposals right now.
https://lists.debian.org/debian-vote/2025/04/msg00118.html
Some approach to have LLM tools in 'main' when they can work with models
that would be appropriate for inclusion in 'main' seems fine to me.
Then we can ship models for that tool in 'non-free', for people who want
to work with some larger model. I don't see a need to permit LLM tools
in 'main' that are unable to work with any libre LLM model, those tools
could go into 'contrib'.
My general point is that we will need to have some exceptions for various reasons, and I'd rather document them explicitly in the DFSG rather than having a set of well-understood exceptions within Debian that aren't
recorded where people would expect that information to be.
But, more directly to your point, I agree with you, but I don't understand why this implies that it's necessary to put non-free data in Debian main.
I can exploit all sorts of non-open data from my Debian computer by
obtaining it from any number of other sources. I don't see the need for Debian to host it.
So what is your preference and what would you want to see happen?
I ask because I see no good options here. I am thinking about
this from the perspective of a user who wants to use the models
unmodified and from the perspective of a user who wants to
modify the models to work better with a face that the models
"consider" an outlier.
"Stefano" == Stefano Zacchiroli <[email protected]> writes:
Answering Russ upthread, I understand very well how such a situation
will make us Debian people fell well, because we are not hosting it. But
I fail to say how this helps in delivering software freedom to our
users. First, they will have the models in question anyway, probably automatically so we will really not be "protecting" them from this eveil OSAID-but-not-DFSG-free stuff. (Or are we going to rule that free
software that does this cannot be in main too?)
Second, it will be more work for our maintainers, and deliver an overall worse experience in terms of security, mirroring, etc.
Finally, we will also be making things harder for people that are fine
with the limited modifications that are possible without the training
data (e.g., fine tuning) as they will not be able to find the full
sources (that are enough for their needs) within the Debian archive.
What I strongly suspect would happen, if proposal A wins (which I also >consider quite likely) is that Debian maintainers of free software
products that use trained ML models that lack DFSG-free training data,
will have to go down the rabbit hole of patching those software to >systematically download the models on first use. Or just give up on >maintaining those packages, of course.
I don't understand why machine learning models are any different. Or,
rather, I understand why they're different to people who truly believe
they really are free software. That argument makes sense to me; I just
don't agree with it. But I don't understand the argument if one agrees
that models without training data are non-free.
Maybe the answer is that they're just too useful to the distribution to
not package regardless of our opinions about whether they're free
software. User experience and free software principles *are* often in
tension and it's fine for us to shift that balance, in my opinion. But I >guess I would have expected us to do that via a mechanism similar to >non-free-firmware if we wanted to make it easy for users to use software
that is OSAID-approved but not DFSG-free, at least if we have a lot of it.
I'm not sure that these are quite the right terms. This email itself
is non-free software, but if Sam wants to train some kind of deep
learning model on it and release the model, without training data,
under the Expat license, I definitely would not refer to the model
as non-free. Would I prefer that copyright law be abolished and
there be no impediments to providing the training data as well?
Of course I would. But, absent that, there would be no way for Sam
to distribute the training data as free software.
"Clint" == Clint Adams <[email protected]> writes:
Just because something can be done cheaper or at scale with help of automation does not make the method of automation for it to become
morally wrong. See torrent, see mass manufacturing techniques that allow factories in China to make millions of knock-offs of known toys.
Here we have a *monumental* movement in the development of both software
and the entire copyright landscape as a whole - a movement that could, finally, permanently wound the corporate silos keeping the lid on the
boiling pot of human knowledge. We finally have a legal tool that could finally free all that knowledge that is currently locked behind
copyright walls and make it available for everyone to use freely and automatically.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 716 |
| Nodes: | 16 (3 / 13) |
| Uptime: | 53:06:47 |
| Calls: | 12,116 |
| Calls today: | 7 |
| Files: | 15,010 |
| Messages: | 6,518,599 |
| Posted today: | 2 |