Anda di halaman 1dari 2

Topic Models from Twitter Hashtags

December 2, 2013

Problem Denition

Topic modelling is about capturing knowledge from a set of semantically related documents in a mathematical framework that represents and abstraction of what the topic means. Typical problem of topic modelling involves a document collection and a procedure to discover which topics occurs in. Its output is statistical information about what topics appears in documents, normally as a probability distribution (or a set of distributions). Several machine learning tasks take advantage of topic models using them as alternative or complementary representations for items. Topic modelling also allows to discover groups of documents or groups of features that contains related information. Groups of terms represents extensive descriptions of what the topics are, while collections act much like implicit descriptions to learn from. Some methods try to discover how many topics there are in a collection while for others it is necessary to manually determine the number of topics the algorithm must consider. Techniques of topic modelling are usually applied on large collections of extense documents as scientic literature or news, where the topic stays relatively unchanged so the eect of time can be ignored. However, the popularizations of user generated real time streams of textual information raises new problems that known procedures does not solve completely. On the other hand, this kind of sparse and unstructured data are plenty of information that could be exploded if it is possible to gather together messages that potentially refers to same subjects. Most extensively used applications range from training messages classiers to user preferences prediction, items and news recommendation and personalized search. High rates at which messages are generated and its availability in real time also motivates the exploitation of such source of news, interests and opinions. Some of this documents also contains semi-structural information, as hashtags or social tags, that are usually integrated in models as evidence of the relevance to a given topic, but as many people uses these tags indiscriminately, they also represents a potential source of noise. The big picture about the scenario we are trying to mine information from can be summarized in next assertions: Social generated messages, as tweets or sms, are short, normally with a xed maximum length. They are noisy in both grammatical and semantic manners When containing social labels or hashtags, these are also susceptible of noise, because some users attach tags that does not corresponds with the semantic content of the message. The number of messages about a given topic or event is function of time and the social impact of the topic. The content of the messages about certain topic changes in time accordingly to the public interest or opinions. Last two points relates to the dynamic aspect of user generated content and means that information in social environments is not static as in closed document collections. Information in social environments grows and is updated according with the development of the events it refers to. In our model of the problem this is called the model decay and clearly is a function of time. Once stated the typical scenario for topic modelling and the characteristics of the documents domain we are trying to capture, our works is concerned in developing a solution that allows to capture models that better describe the information in such dynamic social environments. The general and specic questions that guide our works can be enunciated as follows: Can higher-order relations between words be used as features to represents items? Does they capture more time persistent information that allow us to identify relevant documents generated more time after the training that traditional methods with acceptable accuracy? 1

How can we evaluate the performance of a model through dynamical environments such as real time user generated streams of messages? How these new attributes relates to the intrinsic attributes of the document corpora? To answers these questions, we develop an experimental framework that allows us to test our hypothesis about latent associations by implementing an algorithm that realizes a feature space transformation. Using this transformations, we follow the temporal transformations of topics in time and analyse the model decays for various topics. Next sections exposes a deep explanation of proposed methods, the working hypothesis and the expected results.

Anda mungkin juga menyukai