Exploiting Latent Information in Recommender Systems
This thesis exploits latent information in personalised recommendation, and investigates how this information can be used to improve recommender systems. The investigations span three directions: scalar rating-based collaborative filtering, distributional rating-based collaborative filtering, and distributional ratingbased hybrid filtering. In the first investigation, the thesis discovers through data analysis three problems in nearest neighbour collaborative filtering — item irrelevance, preference imbalance, and biased average — and identifies a solution: incorporating “target awareness” in the computation of user similarity and rating deviation. Two new algorithms are subsequently proposed. Quantitative experiments show that the new algorithms, especially the first one, are able to significantly improve the performance under normal situations. They do not however excel in cold-start situations due to greater demand of data. The second investigation builds upon the experimental analysis of the first investigation, and examines the use of discrete probabilistic distributional modelling throughout the recommendation process. It encompasses four ideas: 1) distributional input rating, which enables the explicit representation of noise patterns in user inputs; 2) distributional voting profile, which enables the preservation of not only shift but also spread and peaks in user’s rating habits; 3) distributional similarity, which enables the untangled and separated similarity computation of the likes and the dislikes; and 4) distributional prediction, which enables the communication of the uncertainty, granularity, and ambivalence in the recommendation results. Quantitative experiments show that this model is able to improve the effectiveness of recommendation compared to the scalar model and other published discrete probabilistic models, especially in terms of binary and list recommendation accuracy. The third investigation is based on an analysis regarding the relationship between rating, item content, item quality, and “intangibles”, and is enabled by the discrete probabilistic model proposed in the second investigation. Based on the analysis, a fundamentally different hybrid filtering structure is proposed, where the hybridisation strategy is neither linear nor sequential, but of a divide-and-conquer shape backed by probabilistic derivation. Experimental results show that it is able to outperform the standard linear and sequential hybridisation structures.