Anda di halaman 1dari 4

Anomalous Topic Discovery in High Dimensional Discrete Data

ABSTRACT

We propose an algorithm for detecting patterns exhibited by anomalous


clusters in high dimensional discrete data. Unlike most anomaly detection (AD)
methods, which detect individual anomalies, our proposed method detects groups
(clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal
patterns. In many applications this can lead to better understanding of the nature of
the atypical behavior and to identifying the sources of the anomalies. Moreover,
we consider the case where the atypical patterns exhibit on only a small (salient)
subset of the very high dimensional feature space. Individual AD techniques and
techniques that detect anomalies using all the features typically fail to detect such
anomalies, but our method can detect such instances collectively, discover the
shared anomalous patterns exhibited by them, and identify the subsets of salient
features. In this paper, we focus on detecting anomalous topics in a batch of text
documents, developing our algorithm based on topic models. Results of our
experiments show that our method can accurately detect anomalous topics and
salient features (words) under each such topic in a synthetic data set and two real-
world text corpora and achieves better performance compared to both standard
group AD and individual AD techniques.
Conflict-Aware Event-Participant Arrangement and Its Variant for Online
Setting

ABSTRACT

With the rapid development of Web 2.0 and Online To Offline (O2O)
marketing model, various online event-based social networks (EBSNs) are getting
popular. An important task of EBSNs is to facilitate the most satisfactory event-
participant arrangement for both sides, i.e. events enroll more participants and
participants are arranged with personally interesting events. Existing approaches
usually focus on the arrangement of each single event to a set of potential users, or
ignore the conflicts between different events, which leads to infeasible or
redundant arrangements. In this paper, to address the shortcomings of existing
approaches, we first identify a more general and useful event-participant
arrangement problem, called Global Event-participant Arrangement with Conflict
and Capacity (GEACC) problem, focusing on the conflicts of different events and
making event-participant arrangements in a global view. We find that the GEACC
problem is NP-hard due to the conflicts among events. Thus, we design two
approximation algorithms with provable approximation ratios and an exact
algorithm with pruning technique to address this problem. In addition, we propose
an online setting of GEACC, called OnlineGEACC, which is also practical in real-
world scenarios. We further design an online algorithm with provable performance
guarantee. Finally, we verify the effectiveness and efficiency of the proposed
methods through extensive experiments on real and synthetic datasets.
Incremental and Decremental Max-flow for Online Semi-supervised Learning

ABSTRACT

Max-flow has been adopted for semi-supervised data modelling, yet existing
algorithms were derived only for the learning from static data. This paper proposes
an online max-flow algorithm for the semi-supervised learning from data streams.
Consider a graph learned from labelled and unlabelled data, and the graph being
updated dynamically for accommodating online data adding and retiring. In
learning from the resulting non stationary graph, we augment and de-augment
paths to update max-flow with a theoretical guarantee that the updated max-flow
equals to that from batch retraining. For classification, we compute min-cut over
current max-flow, so that minimized number of similar sample pairs are classified
into distinct classes. Empirical evaluation on real-world data reveals that our
algorithm outperforms state-of-the-art stream classification algorithms.

Trust-but-Verify: Verifying Result Correctness of Outsourced Frequent


Itemset Mining in Data-mining-as-a-service Paradigm

ABSTRACT

Cloud computing is popularizing the computing paradigm in which data is


outsourced to a third-party service provider (server) for data mining. Outsourcing,
however, raises a serious security issue: how can the client of weak computational
power verify that the server returned correct mining result? In this paper, we focus
on the specific task of frequent itemset mining. We consider the server that is
potentially untrusted and tries to escape from verification by using its prior
knowledge of the outsourced data. We propose efficient probabilistic and
deterministic verification approaches to check whether the server has returned
correct and complete frequent itemsets. Our probabilistic approach can catch
incorrect results with high probability, while our deterministic approach measures
the result correctness with 100% certainty. We also design efficient verification
methods for both cases that the data and the mining setup are updated. We
demonstrate the effectiveness and efficiency of our methods using an extensive set
of empirical results on real datasets.
User Preference Learning for Online Social Recommendation

ABSTRACT

Social recommendation system has attracted a lot of attention recently in the


research communities of information retrieval, machine learning and data mining.
Traditional social recommendation algorithms are often based on batch machine
learning methods which suffer from several critical limitations, e.g., extremely
expensive model retraining cost whenever new user ratings arrive, unable to
capture the change of user preferences over time. Therefore, it is important to make
social recommendation system suitable for realworld online applications where
data often arrives sequentially and user preferences may change dynamically and
rapidly. In this paper, we present a new framework of online social
recommendation from the viewpoint of online graph regularized user preference
learning (OGRPL), which incorporates both collaborative user-item relationship as
well as item content features into an unified preference learning process. We
further develop an efficient iterative procedure, OGRPL-FW which utilizes the
Frank-Wolfe algorithm, to solve the proposed online optimization problem. We
conduct extensive experiments on several large-scale datasets, in which the
encouraging results demonstrate that the proposed algorithms obtain significantly
lower errors (in terms of both RMSE and MAE) than the state-ofthe-art online
recommendation methods when receiving the same amount of training data in the
online learning process.

Anda mungkin juga menyukai