Ideas about user rating systems (c) 2007-2011 Chris Gahan
- Ratings only sparsely cover the space of things to rate
- First-Come, First-Rate Effect
- The Experience of the Rater Doesn't Count (eg. expert vs. novice)
- Incremental Improvement Problem
- Meme (momentum) effect
- The Hyperbole effect
- The bad judgement effect
- Meaninglessness numbers
- The Lowest Common Denominator Effect
- Voting without actually reading
- User's subjective context while evaluating the rateable items
- User's evaluation criteria change over time
- Exposing statistics which change the statistics (statistical feedback loops)
- Reddit's "Fluff"-oriented Voting System
- Related Research
- Recommendation Services
I'm sure you've seen the user rating systems that many sites have (news aggregators, desktop wallpapers, skins/themes, shareware, etc.) which let users rank things with stars or numbers or thumbs-up, and you can then sort the list of stuff on the site by highest rating. I'm sure you've also noticed that there's no real correlation between an item's user rating and its quality.
Sites with the problem:
So, what are the problems?
Ratings only sparsely cover the space of things to rate
Sometimes the amount of stuff to rate is so much bigger than the population rating them that no real inference can be drawn about the rated things.
First-Come, First-Rate Effect
I've noticed that there's a "first-come, first-rate" phenomenon. What the hell am I talking about? Well, imagine the site just opened, and someone uploaded something cool. Relative to all the crap on the site, this is the coolest thing, so people will rate it highly. Also, it's easier to judge it since there's not much stuff on the site to go through.
As time passes, the site fills up with content; so much that the average user doesn't have enough time to sift through all of it. As a result, people will just default to "sort by popularity" and the only things that get a chance to be rated are the ones that they actually get a chance to try out (eg: the stuff at the top of the popularity list). The things that had high ratings when the site first opened continue to have high ratings, so there is very little movement amongst the top 10.
This effect is very prominent amongst sites that rank files by number-of-downloads, because people tend to download the "most popular thing"; additionally, the oldest thing will naturally have the most downloads. These high-rated items will have natural momentum just because they're near the top.
The Experience of the Rater Doesn't Count (eg. expert vs. novice)
Experts and novices judge things differently. Experts and novices also have different tastes.
The amount of experience you have should be incorporated into both the ratings, as well as the filtering.
Incremental Improvement Problem
There's another kind of problem which manifests itself on sites where you're picking the best thing among multiple items in a single class. For example, competing definitions on urbandictionary.com.
The problem is that, again, the first entry to be submitted gets rated the highest, and then always gets listed first. Since most readers are lazy, they won't read more than 3 or 4 defintions down.
However, people tend to make incremental improvements to previous posts and re-post them; this means that a definition posted later is likley to be better than one posted earlier.
So, in this case, the ratings need to be recomputed to be relative to each other. If you add something later, it's probably going to be rated relative to the earlier stuff. So, something that was added recently and got 10 positive ratings in the first 5 minutes is likely to be better than whatever is currently ranked #1.
Meme (momentum) effect
You see this kind of rating effect on YouTube. On YouTube, there is a bar on the right side that shows the tippy-top of popular YouTube videos. Any videos that are shown in the sidebar will naturally get clicked a trillion times.
As a result, those videos stay there forever. They all get really stale.
Here's a scientific study of this effect: Role of Luck in Popular Music
The Hyperbole effect
People generally only bother to rate something if they really like it or if they really hate it. This can lead to a vote schism.
- Perhaps there is a way of detecting these votes and renormalizing them?
The bad judgement effect
Imagine a user has just logged onto a site, and hasn't seen any of the content yet. The first good item they see will seem to be really good to them, so they'll rate it highly. Then, as they see better items, they'll realize that they didn't rate the previous item as accurately as they could've. Or, the opposite will happen -- they'll rate something good with a moderate ranking because they've seen too many good things, and eventually realize that the item they ranked moderately was the best item on the site.
Ratings are usually something like "4/5 stars". What does that mean? Neither the rater nor the rating-viewer know what that number means. It's very difficult to extract meaning from "4".
Verification is also impossible -- you can't look at a rating of "4" and say "That's not accurate! That movie wasn't 4 at all!!!"
What would be better are qualitative descriptive ratings. (See: Slashdot, or International Baccalauriate.)
Qualitative ratings allow raters to weigh and judge whether that rating is applicable. It also allows rate-viewers to judge the raters by weighing the rating against the rated-thing.
The Lowest Common Denominator Effect
It's common that unique things are polarizing -- many people love it, and many people hate it. If you average your scores, this unique thing will have a neutral score. It'll appear average, even though it's unique. Finding an item whose average score is the highest will give you an item that angers the fewest people while pleasing the most. This results in lots of "meh" things.
Voting without actually reading
User's subjective context while evaluating the rateable items
Users rate things differently depending on the mood they were in when they were evaluating the rateable item.
For example, if you watch a comedy when you're in the mood for an action movie, or a haughty art film when you want to watch something stupid, then you're going to give it a low rating. But, if you were to watch it again in a different mood, you might give it a high rating.
The reason this problem exists is because rateable things can have multiple quality dimensions, and people can favor these dimensions differently depending on their mental state (mental "needs" or "desires").
User's evaluation criteria change over time
The more times you see (or hear) the same thing (or the same kind of thing), the less you enjoy it. The first time I saw an action movie, I thought it was amazing! Then as I saw more of them, they got boring. By the time I was 17 years old, my interest had waned significantly.
If you can't model a person's "taste fatigue" towards something, then ratings that you get from users will be biased, and suggestions that you give to users won't be very good.
Assigning proper ratings gets even more complex when you account for the the way that art evolves; a field of art will usually start with a creative individual doing some trailblazingly original work which inspires others to duplicate and improve on that work. After many incremental improvements, the later works that were inspired by the earlier ones are usually of much higher quality than the original.
Now there are multiple kinds of people:
- people who have only seen the original
- people who have only seen the latest iterations
- people who have seen all of the works
People in group 1 would rate the original very high. People in group 2 would rate the new iterations high and the original low. People in group 3 would probably be sick of the whole genre by now, but might have nostalgic feelings for the original work because it was more raw and brings back fond memories, so they'd dislike the newer works.
Modelling people's tastes over time as well as the influence of older art on newer art nees to be taken into account when crafting a good solution.
Exposing statistics which change the statistics (statistical feedback loops)
I just saw an interesting app called "The Ruby Toolbox" which works like this:
The Ruby Toolbox gives you an overview of these tools, sorted in categories and rated by the amount of watchers and forks in the corresponding source code repository on GitHub so you can find out easily what options you have and which are the most common ones in the Ruby community.
GitHub "watches" and "forks" are influenced by nothing but people inspecting the product and "word of mouth" recommendations. GitHub repositories aren't listed by popularity. Therefore, the statistics should be fairly clean.
This application, however, will create a feedback loop, since it will be pushing people towards the most popular repos, making them more popular. This will lead to statistics taking on chaotic and nonlinear behaviour.
A possible solution is to separate the statistics into: "people following because of 'The Ruby Toolbox'" and "people following because they actually looked into the project and think it's good".
Reddit's "Fluff"-oriented Voting System
originally from this thread
There's one huge problem that reddit suffers, which I think is the cause of almost all the problems it's facing, and that's the fluff principle, which I've also heard called "the conveyor belt problem". Basically it is reddit's root of all terrible.
Here's reddit's ranking algorithm. I only want you to notice two things about it: submission time matters hugely (new threads push old threads off the page aggressively), and upvotes are counted logarithmically (the first ten matter as much as the next 100). So, new threads get a boost, and new threads that have received 10 upvotes quickly get a massive boost. The effect of this is that anything that is easily judged and quickly voted on stands a much better chance of rising than something that takes a long time to judge and decide whether it's worth your vote. Reddit's algorithm is objectively and hugely biased towards fluff, content easily consumed and speedily voted on. And it's biased towards the votes of people who vote on fluff.
When I submit a long, good, thought provoking article to one of the defaults, I don't get downvoted. I just don't get voted on at all. I'll get two or three upvotes, but it won't matter, because by the time someone's read through the article and thought about it and whether it was worth their time and voted on it, the thread has fallen off the first page of /new/ and there's no saving it, while in the same amount of time an image macro has received hundreds of votes, not all upvotes but that doesn't matter, what matters is getting the first 10 while it's still got that youth juice.
This single problem explains so much of reddit's culture:
- It's why image macros are huge here, and why those which can be read from the thumbnail are even more popular.
- It's why /r/politics and /r/worldnews and /r/science are suffocated by articles which people have judged entirely from their titles, because an article that was so interesting that people actually read it would be disadvantaged on reddit, and the votes of people who actually read the articles count less.
- It's a large part of why small subreddits are better than big ones. More submissions means old submissions get pushed under the fold faster, shortening the time that voting on them matters.
Reposts also have an advantage- people already having seen them, can vote on them that much quicker.
It's really shitty! And it's hard to reverse now, because this fluff-biased algorithm has attracted people who like fluff and driven away those that don't.
But changing the algorithm would give long, deep content at least a fighting chance.
edit: one good suggestion I've seen
It's often easier to do a quick, rough classification first, and then refine it later.
It ends up being easier because, to do a precise classification in one pass, the user has to remember all the entities in the space that are being classified.
For example, if I'm rating Beatles songs, to do a precise, one-pass rating, for every single song I'm rating, I have to remember every other Beatles song and how that song rates relative to it.
On the other hand, when doing multi-pass rating, I only have to remember that this particular song is better than half of the songs.
You can kill two birds with one stone by encoding both certainty and coarseness in the same rating value. If the users picks "1 start out of 2", that has a low certainty. If the user picks "7.5 stars out of 10", that has a very high certainty.
Stored context and history
The context in which each rating is made is very important to future inference and analysis. For example, an inexperienced music listener will not give the same ratings as an experienced one. A listener at work on Monday morning won't rate music the same way as the same listener on a Friday night, getting excited to go out and party.
Therefore, the context and history of all ratings contain a lot of valuable information which can be used to renormalize all the ratings, or to assist recommendations for other users (ie: budding music aficionados).
Clustering people based on tastes
Naturally, people are biased; it's just a side-effect of the way we compress the information from our experiences. As such, they have different tastes. After rating a large number of items, patterns in their tastes should become clear, and they should be able to be correllated with clusters of individuals.
Users may like to see the clusters they belong to as well, so that they could see new memes propagating through their cluster.
Differential Ratings & Hypothesis Testing
Humans can be used as elements in a ratings calculation system if they are automatically assigned pairs of items to differentially compare based on an inference engine and hypothesis tester.
Which leads to...
Identifying User Traits
The resulting network of cross-movie and cross-user relative-rankings can be used to construct sets of user traits, and use this to enhance prediction of new items.
RBMs/hessian-free dimensionality reduction/HTMs could be used to cluster people.
Unfortunately, humans may present a challenge to trait identification since their interests can change over time, may have emotional noise, and could even be cyclic. Therefore, the hypothesis tester AND the machine learning algorithm must work together to detect and test for temporal patterns.
(Other possibility: RBMs learning about other RBMs is discussed in Geoff Hinton's talk -- it was called "bridging". Perhaps an RBM can first model the cyclic or adaptive nature of brains in general, perhaps from EEG data.)
StumbleUpon already has a very good ranking algorithm, especially considering that the items it's ranking come from a supremely massive data set (all the pages on the web). Their video ranking system is also particularly good -- the categories that the videos are in are very accurate (perhaps because it comes from a much smaller dataset -- a testament to the quality of their algorithms).
Rubrics And The Bimodality Of 1-5 Ratings (by Zed Shaw)
Attack resistant collaborative filtering (by Bhaskar Mehta, Google Inc., Zurich, Switzerland)
Collaborative Filtering and the Missing at Random Assumption (PDF) (by Ben Marlin, June 26, 2007)
Kevin Regan's research:
Novelty-based recommendation algorithm (Netflix is implementing this)
Foundd lets each user rate movies, and then when they get together to watch something, it combines their ratings to find things that everyone will like