Hi,
by chance I realised that the uploaders table contains some names where names are not stripped:
udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
This causes slight errors when counting uploads of people. My guess is this is due to some old importer code (I've checked the hit for my name which
is a pretty old upload). Thus I wonder whether it might be the easiest
fix to simply fix this with some proper UPDATE statement to remove unneeded spaces. This statement is doing the trick in my local clone:
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
If I'm not misleaded historic uploads will not importet from scratch so
this would cure the situation. Otherwise users need to always remember adding some trim(name) when dealing with the uploaders.name column not
to mention that it gets even harder to deal with the uploader column
that might feature extra spaces in the middle.
What do you think?
On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
Hi,
by chance I realised that the uploaders table contains some names where names
are not stripped:
udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
...
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
Uploaders is refreshed every few hours from archive data, so a one-time UPDATE would not help. UDD usually tries to preserve inaccuracies, so
those might be interesting for QA work.
In your case, why don't you use the email address to identify uploaders?
(possibly combining it with the carnivore data to identify different emails belonging to the same person ?)
Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
Hi,
by chance I realised that the uploaders table contains some names where names
are not stripped:
udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
...
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
Uploaders is refreshed every few hours from archive data, so a one-time UPDATE would not help. UDD usually tries to preserve inaccuracies, so
those might be interesting for QA work.
OK.
In your case, why don't you use the email address to identify uploaders?
Since this also does not work:
udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader;
count | uploader
-------+------------------------------------
1 | Andreas Tille <[email protected]>
1 | Andreas Tille <[email protected]>
8785 | Andreas Tille <[email protected]>
(3 Zeilen)
(possibly combining it with the carnivore data to identify different emails belonging to the same person ?)
I could fiddle around with carnivore but that's overkill for thst
purpose and I insist that not stripping blanks from names does not make
any sense, IMHO. (1 Zeile)
BTW: I found
udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)
So why do I have 8707 uploads per uploaders but 16524 per upload_history?
Is my assumption wrong that both values should match (modulo some wrongly spelled names)
On 07/12/23 at 20:24 +0100, Andreas Tille wrote:
Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
...
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
BTW: I found
udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)
So why do I have 8707 uploads per uploaders but 16524 per upload_history?
Is my assumption wrong that both values should match (modulo some wrongly spelled names)
If you look at the uploaders table, there are three columns:
- 'uploader', than contains the raw data
- 'name' and 'email' that contain the parsed (and trimmed) data
udd=> select uploader, name, email, count(*) from uploaders where uploader ilike '%tille%' group by 1,2,3;
uploader | name | email | count
------------------------------------+-----------------+------------------+-------
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 8785
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 1
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 1
So, just use name and/or email?
Am Thu, Dec 07, 2023 at 08:36:12PM +0100 schrieb Lucas Nussbaum:
On 07/12/23 at 20:24 +0100, Andreas Tille wrote:
Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader --------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" St�phane Glondu" | St�phane Glondu <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
"Andreas Tille " | Andreas Tille <[email protected]>
" LI Daobing" | LI Daobing <[email protected]>
" David Paleino" | David Paleino <[email protected]>
" Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
"Colin Tuckley " | Colin Tuckley <[email protected]>
(20 rows)
...
UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
BTW: I found
udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)
So why do I have 8707 uploads per uploaders but 16524 per upload_history?
???
So, just use name and/or email?
Well, I do not seek for a solution for this (non-)problem. I simply
think that not stripping values from spaces before injecting these into
UDD is wrong. I simply stumbled upon this when I did the query above.
I stumbled upon another reason which might be even worse:owner | owner_name | owner_email
select distinct done, done_name, done_email, owner, owner_name, owner_email from archived_bugs where done_name like '%"%' or owner_name like '%"%' order by done_name;
done | done_name | done_email |
---------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+-------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+----------------------------------------------
<[email protected]> | | [email protected] | "vanecgs@gmail.com" <[email protected]> | "[email protected]" | [email protected]
<[email protected]> | | [email protected] | "VarunHiremath" <[email protected]> | "Varun Hiremath" | [email protected]
[email protected] (Alexander L. Belikoff) | | [email protected] | "AlexanderL. Belikoff" <[email protected]> | "Alexander L. Belikoff" | [email protected]
[email protected] (Andreas B. Mundt) | | [email protected] | "Andreas B.Mundt" <[email protected]> | "Andreas B. Mundt" | [email protected]
[email protected] (Antoine R. Dumont (@ardumont)) | | [email protected] | "Antoine R.Dumont" <[email protected]> | "Antoine R. Dumont" | [email protected]
[email protected] (Antoine R. Dumont) | | [email protected] | "Antoine R.Dumont" <[email protected]> | "Antoine R. Dumont" | [email protected]
[email protected] (Artur R. Czechowski) | | [email protected] | "Artur R.Czechowski" <[email protected]> | "Artur R. Czechowski" | [email protected]
...
We have lots of names in probably more than archived_bugs which are not stripped from '"'. You always find the very same names without the
quotes inside the same table. I think this is similarly wrong and even
more annoying than the spaces.
I wonder where we could sensibly discuss those issues which I consider
bugs in UDD. Would it make sense to add some udd category in
`reportbug other` ?
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 145:08:48 |
| Calls: | 12,089 |
| Calls today: | 2 |
| Files: | 15,000 |
| Messages: | 6,517,496 |