Forum: >>> Magnum BBS <<<

UDD contains names where spaces are not stripped

From Andreas Tille@21:1/5 to All on Thu Dec 7 10:00:01 2023

Hi,

by chance I realised that the uploaders table contains some names where names are not stripped:

udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)

This causes slight errors when counting uploads of people. My guess is this
is due to some old importer code (I've checked the hit for my name which
is a pretty old upload). Thus I wonder whether it might be the easiest
fix to simply fix this with some proper UPDATE statement to remove unneeded spaces. This statement is doing the trick in my local clone:

UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;

If I'm not misleaded historic uploads will not importet from scratch so
this would cure the situation. Otherwise users need to always remember
adding some trim(name) when dealing with the uploaders.name column not
to mention that it gets even harder to deal with the uploader column
that might feature extra spaces in the middle.

What do you think?

Kind regards
Andreas.

--
http://fam-tille.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lucas Nussbaum@21:1/5 to Andreas Tille on Thu Dec 7 20:10:01 2023

On 07/12/23 at 09:58 +0100, Andreas Tille wrote:

Hi,

by chance I realised that the uploaders table contains some names where names are not stripped:

udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)

This causes slight errors when counting uploads of people. My guess is this is due to some old importer code (I've checked the hit for my name which
is a pretty old upload). Thus I wonder whether it might be the easiest
fix to simply fix this with some proper UPDATE statement to remove unneeded spaces. This statement is doing the trick in my local clone:

UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;

If I'm not misleaded historic uploads will not importet from scratch so
this would cure the situation. Otherwise users need to always remember adding some trim(name) when dealing with the uploaders.name column not
to mention that it gets even harder to deal with the uploader column
that might feature extra spaces in the middle.

What do you think?

Hi,

Uploaders is refreshed every few hours from archive data, so a one-time
UPDATE would not help. UDD usually tries to preserve inaccuracies, so
those might be interesting for QA work.
In your case, why don't you use the email address to identify uploaders? (possibly combining it with the carnivore data to identify different emails belonging to the same person ?)

Lucas

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Tille@21:1/5 to All on Thu Dec 7 20:30:01 2023

Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:

On 07/12/23 at 09:58 +0100, Andreas Tille wrote:

Hi,

by chance I realised that the uploaders table contains some names where names
are not stripped:

udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
...
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;

Uploaders is refreshed every few hours from archive data, so a one-time UPDATE would not help. UDD usually tries to preserve inaccuracies, so
those might be interesting for QA work.

OK.

In your case, why don't you use the email address to identify uploaders?

Since this also does not work:

udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader;
count | uploader
-------+------------------------------------
1 | Andreas Tille <[email protected]>
1 | Andreas Tille <[email protected]>
8785 | Andreas Tille <[email protected]>
(3 Zeilen)

(possibly combining it with the carnivore data to identify different emails belonging to the same person ?)

I could fiddle around with carnivore but that's overkill for thst
purpose and I insist that not stripping blanks from names does not make
any sense, IMHO. (1 Zeile)

BTW: I found

udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)

So why do I have 8707 uploads per uploaders but 16524 per upload_history?

Is my assumption wrong that both values should match (modulo some wrongly spelled names)

Kind regards
Andreas.

--
http://fam-tille.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lucas Nussbaum@21:1/5 to Andreas Tille on Thu Dec 7 20:40:01 2023

On 07/12/23 at 20:24 +0100, Andreas Tille wrote:

Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:

On 07/12/23 at 09:58 +0100, Andreas Tille wrote:

Hi,

by chance I realised that the uploaders table contains some names where names
are not stripped:

udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
...
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;

Uploaders is refreshed every few hours from archive data, so a one-time UPDATE would not help. UDD usually tries to preserve inaccuracies, so
those might be interesting for QA work.

OK.

In your case, why don't you use the email address to identify uploaders?

Since this also does not work:

udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader;
count | uploader
-------+------------------------------------
1 | Andreas Tille <[email protected]>
1 | Andreas Tille <[email protected]>
8785 | Andreas Tille <[email protected]>
(3 Zeilen)

(possibly combining it with the carnivore data to identify different emails belonging to the same person ?)

I could fiddle around with carnivore but that's overkill for thst
purpose and I insist that not stripping blanks from names does not make
any sense, IMHO. (1 Zeile)

BTW: I found

udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)

So why do I have 8707 uploads per uploaders but 16524 per upload_history?

Is my assumption wrong that both values should match (modulo some wrongly spelled names)

If you look at the uploaders table, there are three columns:
- 'uploader', than contains the raw data
- 'name' and 'email' that contain the parsed (and trimmed) data

udd=> select uploader, name, email, count(*) from uploaders where uploader ilike '%tille%' group by 1,2,3;
uploader | name | email | count
------------------------------------+-----------------+------------------+-------
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 8785
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 1
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 1

So, just use name and/or email?

Lucas

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Tille@21:1/5 to All on Fri Dec 8 07:40:01 2023

Am Thu, Dec 07, 2023 at 08:36:12PM +0100 schrieb Lucas Nussbaum:

On 07/12/23 at 20:24 +0100, Andreas Tille wrote:

Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:

On 07/12/23 at 09:58 +0100, Andreas Tille wrote:

udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
...
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;

BTW: I found

udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)

So why do I have 8707 uploads per uploaders but 16524 per upload_history?

???

Is my assumption wrong that both values should match (modulo some wrongly spelled names)

Could you please comment on these different results?

If you look at the uploaders table, there are three columns:
- 'uploader', than contains the raw data
- 'name' and 'email' that contain the parsed (and trimmed) data

udd=> select uploader, name, email, count(*) from uploaders where uploader ilike '%tille%' group by 1,2,3;
uploader | name | email | count
------------------------------------+-----------------+------------------+-------
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 8785
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 1
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 1

So, just use name and/or email?

Well, I do not seek for a solution for this (non-)problem. I simply
think that not stripping values from spaces before injecting these into
UDD is wrong. I simply stumbled upon this when I did the query above.

I stumbled upon another reason which might be even worse:

select distinct done, done_name, done_email, owner, owner_name, owner_email from archived_bugs where done_name like '%"%' or owner_name like '%"%' order by done_name;
done | done_name | done_email |
owner | owner_name | owner_email
---------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+-------------------------------------------------+---------------
--------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+----------------------------------------------
<[email protected]> | | [email protected] | "vanecgs@
gmail.com" <[email protected]> | "[email protected]" | [email protected]
<[email protected]> | | [email protected] | "Varun
Hiremath" <[email protected]> | "Varun Hiremath" | [email protected]
[email protected] (Alexander L. Belikoff) | | [email protected] | "Alexander L.
Belikoff" <[email protected]> | "Alexander L. Belikoff" | [email protected]
[email protected] (Andreas B. Mundt) | | [email protected] | "Andreas B.
Mundt" <[email protected]> | "Andreas B. Mundt" | [email protected]
[email protected] (Antoine R. Dumont (@ardumont)) | | [email protected] | "Antoine R.
Dumont" <[email protected]> | "Antoine R. Dumont" | [email protected]
[email protected] (Antoine R. Dumont) | | [email protected] | "Antoine R.
Dumont" <[email protected]> | "Antoine R. Dumont" | [email protected]
[email protected] (Artur R. Czechowski) | | [email protected] | "Artur R.
Czechowski" <[email protected]> | "Artur R. Czechowski" | [email protected]
...

We have lots of names in probably more than archived_bugs which are not stripped from '"'. You always find the very same names without the
quotes inside the same table. I think this is similarly wrong and even
more annoying than the spaces.

I wonder where we could sensibly discuss those issues which I consider
bugs in UDD. Would it make sense to add some udd category in
`reportbug other` ?

Kind regards
Andreas.

--
http://fam-tille.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lucas Nussbaum@21:1/5 to Andreas Tille on Fri Dec 8 09:00:01 2023

On 08/12/23 at 07:39 +0100, Andreas Tille wrote:

Am Thu, Dec 07, 2023 at 08:36:12PM +0100 schrieb Lucas Nussbaum:

On 07/12/23 at 20:24 +0100, Andreas Tille wrote:

Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:

On 07/12/23 at 09:58 +0100, Andreas Tille wrote:

udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
...
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;

BTW: I found

udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)

So why do I have 8707 uploads per uploaders but 16524 per upload_history?

???

upload_history contains all uploads ever made to Debian.
uploaders contains packages currently in the suite (not superseded by
another upload)

So, just use name and/or email?

Well, I do not seek for a solution for this (non-)problem. I simply
think that not stripping values from spaces before injecting these into
UDD is wrong. I simply stumbled upon this when I did the query above.

It has been like that for about 15 years. I'm not sure changing the API
because you think is wrong is a good idea.

I stumbled upon another reason which might be even worse:

select distinct done, done_name, done_email, owner, owner_name, owner_email from archived_bugs where done_name like '%"%' or owner_name like '%"%' order by done_name;
done | done_name | done_email |

owner | owner_name | owner_email

---------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+-------------------------------------------------+-------------

----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+----------------------------------------------

<[email protected]> | | [email protected] | "vanecgs@

gmail.com" <[email protected]> | "[email protected]" | [email protected]

<[email protected]> | | [email protected] | "Varun

Hiremath" <[email protected]> | "Varun Hiremath" | [email protected]

[email protected] (Alexander L. Belikoff) | | [email protected] | "Alexander

L. Belikoff" <[email protected]> | "Alexander L. Belikoff" | [email protected]

[email protected] (Andreas B. Mundt) | | [email protected] | "Andreas B.

Mundt" <[email protected]> | "Andreas B. Mundt" | [email protected]

[email protected] (Antoine R. Dumont (@ardumont)) | | [email protected] | "Antoine R.

Dumont" <[email protected]> | "Antoine R. Dumont" | [email protected]

[email protected] (Antoine R. Dumont) | | [email protected] | "Antoine R.

Dumont" <[email protected]> | "Antoine R. Dumont" | [email protected]

[email protected] (Artur R. Czechowski) | | [email protected] | "Artur R.

Czechowski" <[email protected]> | "Artur R. Czechowski" | [email protected]

...

We have lots of names in probably more than archived_bugs which are not stripped from '"'. You always find the very same names without the
quotes inside the same table. I think this is similarly wrong and even
more annoying than the spaces.

Use done_email ? It's there for exactly that reason.

I wonder where we could sensibly discuss those issues which I consider
bugs in UDD. Would it make sense to add some udd category in
`reportbug other` ?

There is a udd category in qa.debian.org. See https://wiki.debian.org/UltimateDebianDatabase

Lucas

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	130:47:30
Calls:	12,470
Files:	15,200
Messages:	6,538,514

UDD contains names where spaces are not stripped

Who's Online

System Info