Methodology - mse2010

In order to extract statistically significant data, we requested to get our account whitelisted by Twitter.

We then proceeded to manually extract data from the Twitter API. We used a python library called Twython to extract ~200K sample tweets, along with information such as time created, number of user's friends/followers/tweets,number of retweets,etc. The crawling of the data was done by keeping a set of users, initially consisting of the users in the public timeline as well as our friends. The algorithm randomly selected a user from that set, extracted a random tweet from that user, and then added a random sample of the user's friends to the set. This process was repeated until we obtained the target number of tweets(200K).

Given the way the algorithm was designed, we obtained what we consider a representative sample of the Twitter universe. The tweets came from different users from all over the world and from different types of users(businesses, personal users). The tweets were fairly recent (~85% were from this year, a few from 2009, and even less from 2008). The reason for this is that the GetUserTimeline() method from the API returns the most recent tweets of a user, and the users from the public timeline are generally active users. However, we don't think this is a concern as the most recent tweets are the most significant for our purposes in the first place.

We then analyzed this data using tools such as Excel and Matlab to test our initial hypothesis, and discovered some new interesting insights along the way. We did the data analysis by separating the data into two parts: One consisting of a random distribution of Tweets(~200K) and another consisting solely on tweets that have been Retweeted (4K).

To analyze the effectiveness of including a particular feature, we compare the probability of getting a retweet given a particular feature against the probability of retweet given the absence of such feature and used the ratio of those probabilities as a standard metric. Separating the data in two parts, a normal sample and a sample of retweeted tweets, essentially gives that probability, as we obtain P(feature) and P(feauture|retweet) by simply using raw counts.

Recall, by Bayes rule:
P(retweet|feauture)=P(feature|retweet)*P(retweet)/ P(feauture).