On 7/13/02 3:33 PM, "Nick Simicich" <njs @
>>> If someone wants to vote manually a couple hundred times I do not care,
>> But I do.
> Do you think that will skew the numbers that much?
Will it? Maybe, maybe not. Can it? Sure. All I need is one script kiddie
determined to make "reply-to coercion" win to make the numbers useless. So I
need to design to protect myself from that possibility.
>>> I would think that if you simply recorded ip
>>> addresses (or even an MD5 of each octet) that would settle automated voting
>> And screw over most users from AOL, and any other place that has a
>> significant number of addresses sharing a small IP range through timesharing
>> or firewalls and proxies. It doesn't work in the general case.
> My point was not to automatically throw those things out. It was to allow
> someone who was a third party to judge the reliability of the data, and to
> do some selection based on addresses and commonality of addresses while
> preserving the actual value of the addresses.
The problems here are legion. First, someone makes a subjective decision
which votes are correct or not. So you potentially add in all sorts of bias.
Scientists who get caught choosing data to fit the graph tend to lose their
grants, because that's, well, fraud. And whether or not this third party
does that or not, the data will live with a suspicion they might have.
Second, looking for, finding and resolving these problems takes time and
energy by someone. Right now, that someone is me. If I can design fraud out
of the system in the first place, that's a lot more effective use of my time
and energy, because you have higher quality and better trusted data. Will I
stop 100% of the fraud? Probably not. But if I can remove it below the point
of statistical significance, that's good enough, and a lot better than
saying "we'll clean it up later". Dirty data is dirty data, and you run the
risk of not cleaning it up, just moving the dirt around.
If I open the data to users to evaluate, I want them all analyzing from the
same set of data. I don't want all 12 to decide which set of data ought to
be excluded by their idea of "fair", because you end up with data that will
say whatever people want it to say, and different analysis that can't be
compared to each other, even though the data source is the same. That makes
it basically useless.
Chuq Von Rospach, Architech
com -- http://www.chuqui.com/
Very funny, Scotty. Now beam my clothes down here, will you?