Social media is growing at an explosive rate, with millions of people all over the world generating and sharing content on a scale barely imaginable a few years ago. This has resulted in massive participation with countless number of updates, opinions, news, comments and product reviews being constantly posted and discussed in social web sites such as Facebook, Digg and Twitter, to name a few. Tweets or Twitter’s post is one of the fastest and successful in reaching people in different parts of the world, when compared other ways of communication to rest of the world.
Twitter was founded by Jack Dorsey, Evan Williams, Biz Stone and Noah Glass in 2006. Twitter was created in March 2006 and it was launched in July 2006. In very short period of time, Twitter gained popularity worldwide. Twitter was one of ten most visited websites. According to the data available, till May 2015, Twitter had 500 million users out of which 332 million users were active.
Twitter is basically a micro-blogging platform which allows users to broadcast short message to other users of Twitter. Broadcasting short message for others is nothing but updating status on Twitter which also known as a Tweet. Length of tweet is limited to 140 characters by Twitter. Main motivations to update tweets on Twitter by users are, to share knowledge of their field, to discuss and throw a light on current events and to build a network. Basically, tweets can be anything like personal information of user, discussion on current events that are happening in local or global world or it might be a link which contains photos, videos or article. Sometimes events become popular and at that time many users would desire to forward or re-post the tweet, originally posted by other users and that is called as Re-Tweet. Re-tweets play important role in determining trends. For instance, #Oscars 2014, a photo tweeted by Ellen DeGeneres with 11 other Hollywood celebrities was one of the most popular tweets of the time and still holds the record of the most Re-tweeted post on twitter with about 3.3 million Retweets.
Tweets and Trends are most likely associated to time and Hash-tags. Trends are presented in the form of list on main twitter.com site. Trends are determined by an algorithm. Trending topics are the topics that have ‘maximum number of tweets’ or ‘frequently used Hash-tag’ or ‘most tweeted on at that time period’. Basically topic is said to be a trending topic depending on the number of tweets it receives over certain period of time rather than the number it receives since its perception. Trending topic gains popularity over short period. Trending topics play significant role in providing news for breaking stories, it also allows other users to provide opinions on that.
Some twitter trends are tailored to users and some twitter trends are based on user-defined locations. One can customize the trends on Twitter by selecting the particular location or whole world depending upon the choice of user. According to location mentioned, Twitter will show trending topics. The list of trending topics is updated every few minutes as new topics become popular.
If trend is just words or hash-tags, it will feature the words ‘and’, ‘the’ or any word that is used lot of times and trend box will never change, as those are frequently used words. So, Twitter trends are not related only to words, phrases or hash-tags, but also uses algorithm to determine or predict the trend. To determine trending topic, twitter uses an algorithm which is based on tweets in certain time period (temporal tweets) as well as number of users tweeting (or re-tweeting) on the same topic in that time period.
Working of Twitter:
User would compose the message; let’s say a tweet up-to 140 characters. He can attach a hash-tag in the tweet related to a topic he is introducing in tweet. If he wants to address to a specific user, the symbol ‘@’ can be used followed by the username to which he wants to address. To re-tweet user can use a symbol specified below tweet.
Factors contributing a trend:
Some of the important factors that contribute a trend are-
a) Topic itself contributes to a trend depending upon a popularity of that topic.
b) Hash-tags related to particular topic, for example, ‘#Elections2016’ for presidential elections in USA 2016 or ‘#prayforjapan’ for 2011 earthquake in Japan.
c) Tweets which includes debates, arguments and personal opinions of the users.
d) Re-tweeting rate
e) Network of users who are tweeting on that particular topic.
f) Geographical area which user selected for trends and geographical area of the event.
g) Duration of time of topic being tweeted as trends are determined by number of tweets topic receives in certain time period.
Tweets have normalized multivariate behavior which means, the mean of distribution will be zero for more than one random variables. So, cumulative tweet count for certain time period can be defined as
Which means it is summation of number of tweets for topic q in time interval t.
Some topics gain popularity, and they stay popular for longer time. Therefore, the number of tweets on the topic over a long time is defined as:
After some time, topic loses its popularity, which defines decay factor .
Random variable is any factor contributing to a trend.
Figure 1: In the above graph (a), histogram depicts the actual number of tweets (grey region) and peak given in black is normal distribution of tweet number. In graph (b), red line indicated is theoretically derived of number of tweets and black gives the actual number of tweets w.r.t. time. It is Cumulative Distribution Function (CDF).
It can be understood from the Figure 1 that:
· Number of tweets is a relative quantity.
· Number of tweets at particular time slot t, is related to number of tweets at previous time slot. In other words, number of tweets at particular time slot is multiple of number of tweets at previous time slot.
· Growth of tweets over time is inversely proportional to the decay factor (t). So, as number of tweets increases with time, decay factor decreases and vice versa.
Posting multiple unrelated updates or tweets to a trending or popular topic gives rise to twitter noise. Malicious tweets increase ambiguity in the twitter stream. Due to noise in the stream, the nature of the trend becomes unpredictable or in the other words stochastic.
Detecting Malicious Tweets - Twitter provides several methods for users to report spam and these reports are investigated by twitter and the accounts of the users reported are suspended or blocked in case of spam.
Figure 2: Architecture of Twitter Noise Identification and Removal
Twitter Noise Identification and Removal system consists of five processes:
· Trending topics collection - First, the system obtains a set of tweets associated with a trending topic.
· Spam labeling -The second process is spam labeling of the trending topics, where the system uses several blacklists to detect spam URLs in tweets and labels them in this collection. The labeled collection will be further used to train the system and detect new spam tweets.
· Feature extraction - In this step, a feature extraction task is performed to represent each labeled tweet using natural language processing and content analysis techniques.
· Classifier training - The final data set consisting of the labeled set of tweets and each tweet represented by a set of features is used by the classifier to train the model to obtain required information to detect spam.
· Spam detection - When the classifier receives a tweet from a user as an input, it notifies the user if the tweet is spam or non-spam. In case the user determines that the tweet is mis-classified, all tweets affected by the same URL may be re-labeled and updated in the data set.
Number of tweets on trending topic at a particular time slot is given by this equation:
From the above it can be seen that, at each time step the number of new tweets on a topic is a multiple of the tweets that we already have (i.e., number of past tweets is a proxy for the number of users that are aware of the topic up to that point). These users discuss the topic on different forums, including Twitter, essentially creating an effective network through which the topic spreads. As more users talk about a particular topic, many others are likely to learn about it, thus giving the multiplicative nature of the spreading.
Persistence of Trend - An important reason to study trending topics on Twitter is to understand why some of them remain at the top while others dissipate quickly. By examining the general pattern of behavior on Twitter it is observed that lifetimes of most of the topics occur continuously while around 34% of topics appear in more than one sequence. This means that they stop trending for a certain period of time before beginning to trend again. A reason for this behavior may be the time zones that are involved. For instance, if a topic is a piece of news relevant to North American readers, a trend may first appear in the Eastern time zone, and 3 hours later in the Pacific time zone. Likewise, a trend may return the next morning if it was trending the previous evening, when more users check their accounts again after the night. Given that many topics do not occur continuously, by examining the distribution of the lengths sequences for all topics it can be observed that it follows a power-law which means that most topic sequences are short and a few topics last for a very long time. This could be due to the fact that there are many topics competing for attention. Thus, the topics that make it to the top (the trend list) last for a short time. However, in many cases, the topics return to trend for more time, which is captured by the number of sequences.
Relation to authors and activity by examining the authors who tweet about given trending topics to see if the authors change over time or if it is the same people who keep tweeting to cause trends. It is observed that the correlation in the number of unique authors for a topic with the duration (number of timestamps) that the topic trends we noticed that correlation is very strong (0.80). This indicates that as the number of authors increases so does the lifetime, suggesting that the propagation through the network causes the topic to trend.
The impact of authors can be computed for each topic by the active-ratio () as:
Trend detection over the twitter stream:
Twitter-Monitor performs trend detection in two steps and analyzes trends in a third step. First, it identifies ‘bursty’ keywords, i.e. keywords that suddenly appear in tweets at an unusually high rate. Subsequently, it groups bursty keywords into trends based on their co-occurrences. In other words, a trend is identified set of bursty keywords that occur frequently together in tweets. After a trend is identified, Twitter-Monitor extracts additional information from the tweets that belong to the trend, aiming to discover interesting aspects of it.
Detecting Bursty Keywords: A keyword is identified as bursty when it is encountered at an unusually high rate in the stream. Twitter-Monitor treats bursty keywords as ‘entry points’ for trend detection. In other words, whenever a keyword exhibits bursty behavior, Twitter-Monitor considers this an indication that a new topic has emerged and seeks to explore it further Effective and efficient detection of bursty keywords is thus crucial to Twitter- Monitor’s performance. To detect bursty keywords, we developed a new algorithm, Queue-Burst, with the following characteristics: (i) One-pass. (ii) Real-time. (iii) Adjustable against ‘spurious’ bursts. (iv) Adjustable against spam. (v) Theoretically sound.
From Bursty Keywords to Trends: To group bursty keywords, Group-Burst assesses their co-occurrences in recent tweets. For this purpose, a few minutes’ history of tweets is retrieved for each bursty keyword and keywords that are found to co-occur in a relatively large number of recent tweets are placed in the same group.
Trend Analysis: After a trend is identified Twitter-Monitor attempts to compose a more accurate description of it. Twitter-Monitor collects data of the associated keywords with a trend. It also identifies frequently cited sources and frequent geographical origins of tweets that belong to a trend and adds them to the description. Finally, a chart is produced for each trend that depicts the evolution of its popularity over time and that gets updated as long as the trend remains popular.
Detecting Length of Long Trending Topics:
We assume that if the relative growth rate of tweets denoted by,
Falls down below a certain threshold, the topic would stop trending. When we consider the long trending topics, as they grow in time they overcome the initial novelty decay (thus becomes constant) so we can measure the change over time using the random variable as:
Since (random variable) are independent and identical distributed random variables.
, would be independent with each other. Thus the probability that a topic stops trending in a time interval s, where s is large, is equal to the probability that is lower than the threshold , which can be written as:
F(x) is the cumulative distribution function of the random variable_. Given that distribution we can actually determine the threshold for survival as:
From the independence property of the random variable, the duration or life time of a trending topic, denoted by L, follows a geometric distribution, which in the continuum case becomes the exponential distribution.
Thus, the probability that a topic survives in the first k time intervals and fails in the (k+1) time interval, given that k is large, can be written as:
The expected length of trending duration L would thus be:
We considered trending duration for topics that trended for more than 10 time-stamps on Twitter. The comparison between the geometric distribution and the trending duration is shown in figure below:
Figure 4: this graph gives the comparison between the data from trend (black), and geometric distribution (red); (trend duration in minutes)
Trend Prediction Model:
The behavior of a trend can be predicted on the basis of number of tweet at the moment. The total number of tweets on a trending topic has either stable or exponential increase.
The twitter web server uses a simple algorithm to predict whether or not a topic would be trending. This algorithm involves Crawler a web tool that gathers and categorizes information from World wide web, SolrINDEX and Data Services that creates graphical representation to predict a trend.
1.Crawler extracts the tweet data from Twitter API, in other words gets live tweet updates on topics. Crawler the identifies the specific taxonomy such as Keywords, Nouns, high occurring words, Hashtags and the title of the Topic itself, and the time at which the tweet has been posted. Either or combination of both is considered.
2. Then the Crawler cleans the data using twitter noise removal algorithm, as discussed earlier. Now that the data is free from noise, POS tagging is given. POS tag or Part-of-Speech tag is the tags given to data on the basis of keywords and time extracted.
3. Then a SolrINDEX is created, and all of these tweet data, their metadata and the associated tags are stored under this Index.
4. The data from the SolrINDEX is then fed to the Data Servers, which creates Steam-Graph and Weighted-Graph visualizations. The Data Servers takes account of (a) Search/Query/topic, (b) time period, (c) maximum number of re-occurring terms, and (d) type of re-occurring term.
Figure 5: Trend Prediction Architecture
Steam-Graph Visualization (Figure.5) is graphical representation of Tweet Count v/s the time period. The time period of interest is split into specified number of interest. For each interval, the graph returns most frequent topics that re-occur with query term within the interval. An example of steam-graph is given by figure.1 for top trending topics world-wide.
Weighted-Graph Visualization (Figure.6) is graphical representation based on ranking of keywords. This graph returns the most important topics that co-occur with query terms and number of times it co-occurred.
Figure 6: Steam Graph Visualization of top Twitter Trends by users
Figure 7: Weighted-Graph Visualization of #Hashtags for topic: Boston Terror Attacks
 Kraker, P., Wagner, C., Jeanquartier, F., & Lindstaedt, S. (2011). On the way to a science intelligence: visualizing TEL tweets for trend detection. InTowards Ubiquitous Learning (pp. 220-232). Springer Berlin Heidelberg.
 Asur, S., Huberman, B. A., Szabo, G., & Wang, C. (2011). Trends in social media: Persistence and decay. Available at SSRN 1755748.
 Dergiades, T., Milas, C., & Panagiotidis, T. (2014). Tweets, Google trends, and sovereign spreads in the GIIPS. Oxford Economic Papers, gpu046.
 Irani, D., Webb, S., Pu, C., & Li, K. (2010). Study of trend-stuffing on twitter through text classification. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS)
 Mathioudakis, M., & Koudas, N. (2010, June). Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1155-1158). ACM.