57
Which topic on Twitter is more popular, Lady Gaga or Oprah Win-
frey? This may not seem like an important question, depending
upon your view of popular culture, but if we can make the com-
parison for these two topics, we can make it for any two topics. Cer-
tainly in the case of presidential elections, or a corruption scandal
in the local news, or an international crisis, it could be a worth-
while goal to be able to analyze social media in a systematic way.
And on the surface, the answer to the question seems trivial: Just
add up who has more tweets. Surprisingly, in order to answer the
question in an accurate and reliable way, this won’t work, at least
not very well. Instead, one must consider many of the vexing ques-
tions that made inferential statistics necessary.
Let’s say we retrieved one hour’s worth of Lady Gaga tweets and a
similar amount of Oprah Winfrey tweets and just counted them
up. What if it just happened to be a slow news day for Oprah? It
really wouldn’t be a fair comparison. What if most of Lady Gaga’s
tweets happen at midnight or on Saturdays? We could expand our
sampling time, maybe to a day or a week. This could certainly
help: Generally speaking, the bigger the sample, the more represen-
tative it is of the whole population, assuming it is not collected in a
biased way. This approach defines popularity as the number of
tweets over a fixed period of time. Its success depends upon the
choice of a sufficiently large period of time, that the tweets are col-
lected for the two topics at the same time, and that the span of time
chosen happens to be equally favorable for both two topics.
Another approach to the popularity comparison would build upon
what we learned in the previous chapter about how arrival times
(and the delays between them) fit into the Poisson distribution. In
this alternative definition of the popularity of a topic, we could sug-
gest that if the arrival curve is “steeper” for the first topic in con-
trast to the second topic, then the first topic is more active and
therefore more popular. Another way of saying the same thing is
that for the more popular topic, the likely delay until the arrival of
the next tweet is shorter than for the less popular topic. You could
also say that for a given interval of time, say ten minutes, the num-
ber of arrivals for the first topic would be higher than for the sec-
ond topic. Assuming that the arrival delays fit a Poisson distribu-
tion, these are all equivalent ways of capturing the comparison be-
tween the two topics.
Just as we did in the chapter entitled, “Sample in a Jar,” we can use
a random number generator in R to illustrate these kinds of differ-
ences more concretely. The relevant function for the Poisson distri-
bution is rpois(), “random poisson.” The rpois() function will gener-
ate a stream of random numbers that roughly fit the Poisson distri-
bution. The fit gets better as you ask for a larger and larger sample.
The first argument to rpois() is how many random numbers you
want to generate and the second number is the average delay be-
tween arrivals that you want the random number generator to try
to come close to. We can look at a few of these numbers and then
use a histogram function to visualize the results:
> rpois(10,3)
[1] 5 4 4 2 0 3 6 2 3 3
> mean(rpois(100,3))
[1] 2.99
> var(rpois(100,3))
[1] 3.028182
91
39
> hist(rpois(1000,3))
In the first command above, we generate a small sample of n=10
arrival delays, with a hoped for mean of 3 seconds of delay, just to
see what kind of numbers we get. You can see that all of the num-
bers are small integers, ranging from 0 to 6. In the second com-
mand we double check these results with a slightly larger sample
of n=100 to see if rpois() will hit the mean we asked for. In that run
it came out to 2.99, which was pretty darned close. If you run this
command yourself you will find that your result will vary a bit
each time: it will sometimes be slightly larger than three and occa-
sionally a little less than three (or whatever mean you specify).
This is normal, because of the random number generator. In the
third command we run yet another sample of 100 random data
points, this time analyzing them with the var() function (which cal-
culates the variance; see the chapter entitled “Beer, Farms, and
Peas”). It is a curious fact of Poission distributions that the mean
and the variance of the “ideal” (i.e., the theoretical) distribution are
the same. In practice, for a small sample, they may be different.
In the final command, we ask for a histogram of an even larger
sample of n=1000. The histogram shows the most common value
hanging right around three seconds of delay with a nice tail that
points rightwards and out to about 10 seconds of delay. You can
think of this as one possible example of what you might observe of
the average delay time between tweets was about three seconds.
Note how similar the shape of this histogram is to what we ob-
served with real tweets in the last chapter.
Compare the histogram on the previous page to the one on the
next page that was generated with this command:
hist(rpois(1000,10))
It is pretty easy to see the different shape and position of this histo-
gram, which has a mean arrival delay of about ten seconds. First of
all, there are not nearly as many zero length delays. Secondly, the
most frequent value is now about 10 (as opposed to two in the pre-
vious histogram). Finally, the longest delay is now over 20 seconds
(instead of 10 for the previous histogram). One other thing to try is
this:
> sum(rpois(1000,10)<=10)
[1] 597
92
17
This command generated 1000 new random numbers, following
the Poisson distribution and also with a hoped-for mean of 10, just
like in the histogram on the next page. Using the “<=” inequality
test and the sum() function, we then counted up how many events
were less than or equal to 12, and this turned out to be 597 events.
As a fraction of the total of n=1000 data points that rpois() gener-
ated, that is 0.597, or 59.7%.
93
Review 11.1 Popularity Contest (Mid-Chapter Review)
Check Answer
Question 1 of 4
The Poisson distribution has a characteristic shape that
would be described as:
A. Negatively (left) skewed
B. Positively (right) skewed
C. Symmetric (not skewed)
D. None of the above
38
We can look at the same kind of data in terms of the probability of
arrival within a certain amount of time. Because rpois() generates
delay times directly (rather than us having to calculate them from
neighboring arrival times), we will need a slightly different func-
tion than the ArrivalProbabilities() that we wrote and used in the
previous chapter. We’ll call this function “DelayProbability” (the
code is at the end of this chapter):
> DelayProbability(rpois(100,10),1,20)
[1] 0.00 0.00 0.00 0.03 0.06 0.09 0.21 0.33 0.48
0.61 0.73 0.82 0.92
[14] 0.96 0.97 0.98 0.99 1.00 1.00 1.00
At the heart of that command is the rpois() function, requesting
100 points with a mean of 10. The other two parameters are the in-
crement, in this case one second, and the maximum delay time, in
this case 20 seconds. The output from this function is a sorted list
of cumulative probabilities for the times ranging from 1 second to
20 seconds. Of course, what we would really like to do is compare
these probabilities to those we would get if the average delay was
three seconds instead of ten seconds. We’re going to use two cool
tricks for creating this next plot. First, we will use the points() com-
mand to add points to an existing plot. Second, we will use the
col= parameter to specify two different colors for the points that
we plot. Here’s the code that creates a plot and then adds more
points to it:
> plot(DelayProbability(rpois(100,10),1,20), col=2)
> points(DelayProbability(rpois(100,3),1,20), col=3)
Again, the heart of each of these lines of code is the rpois() function
that is generating random Poisson arrival delays for us. Our pa-
rameters for increment (1 second) and maximum (20 seconds) are
the same for both lines. The first line uses col=2, which gives us red
points, and the second gives us col=3, which yields green points:
This plot clearly shows that the green points have a “steeper” pro-
file. We are more likely to have earlier arrivals for the 3-second de-
lay data than we are for the 10-second data. If these were real
tweets, the green tweets would be piling in much faster than the
red tweets. Here’s a reminder on how to read this plot: Look at a
value on the X-axis, for example “5.” Then look where the dot is
94
39
and trace leftward to the Y-axis. For the red dot, the probability
value at time (x) equal 4 is about 0.10. So for the red data there is
about a 10% chance that the next event will occur within five time
units (we’ve been calling them seconds, but they could really be
anything, as long as you use the units consistently throughout the
whole example). For the green data there is about a 85% chance
that the next event will occur within four time units. The fact that
the green curve rises more steeply than the red curve means that
for these two samples only the green stuff is arriving much more often
than the red stuff.
These reason we emphasized the point “for these samples only” is
that we know from prior chapters that every sample of data you
collect varies by at least a little bit and sometimes by quite a lot. A
sample is just a snapshot, after all, and things can and do change
from sample to sample. We can illustrate this by running and plot-
ting multiple samples, much as we did in the earlier chapter:
> plot(DelayProbability(rpois(100,10),1,20))
> for (i in 1:15) {points(DelayProbability(r-
pois(100,10),1,20))}
This is the first time we have used the “for loop” in R, so let’s walk
through it. A “for loop” is one of the basic constructions that com-
puter scientists use to “iterate” or repeatedly run a chunk of code.
In R, a for loop runs the code that is between the curly braces a cer-
tain number of times. The number of times R runs the code de-
pends on the expression inside the parentheses that immediately
follow the “for.”
In the example above, the expression “i in 1:15” creates a new data
object, called i, and then puts the number 1 in it. Then, the for loop
keeps adding one to the value of i, until i reaches 15. Each time that
it does this, it runs the code between the curly braces. The expres-
sion “in 1:15” tells R to start with one and count up to 15. The data
object i, which is just a plain old integer, could also have been used
within the curly braces if we had needed it, but it doesn’t have to
be used within the curly braces if it is not needed. In this case we
didn’t need it. The code inside the curly braces just runs a new ran-
dom sample of 100 Poisson points with a hoped for mean of 10.
When you consider the two command lines on the previous page
together you can see that we initiate a plot() on the first line of
95
39
code, using similar parameters to before (random poisson numbers
with a mean of 10, fed into our probability calculator, which goes
in increments of 1 second up to 20 seconds). In the second line we
add more points to the same plot, by running exactly 15 additional
copies of the same code. Using rpois() ensures that we have new
random numbers each time:
Now instead of just one smooth curve we have a bunch of curves,
and that these curves vary quite a lot. In fact, if we take the exam-
ple of 10 seconds (on the X-axis), we can see that in one case the
probability of a new event in 10 seconds could be as low as 0.50,
while in another case the probability is as high as about 0.70.
This shows why we can’t just rely on one sample for making our
judgments. We need to know something about the uncertainty that
surrounds a given sample. Fortunately, R gives us additional tools
to help us figure this situation out. First of all, even though we had
loads of fun programming the DelayProbability() function, there is
a quicker way to get information about what we ideally expect
from a Poisson distribution. The function ppois() gives us the theo-
retical probability of observing a certain delay time, given a particu-
lar mean. For example:
> ppois(3, lambda=10)
[1] 0.01033605
So you can read this as: There is a 1% chance of observing a delay
of 3 or less in a Poisson distribution with mean equal to 10. Note
that in statistical terminology, “lambda” is the term used for the
mean of a Poisson distribution. We’ve provided the named parame-
ter “lambda=10” in the example above just to make sure that R
does not get confused about what parameter we are controlling
when we say “10.” The ppois() function does have other parame-
ters that we have not used here. Now, using a for loop, we could
get a list of several of these theoretical probabilities:
> plot(1,20,xlim=c(0,20),ylim=c(0,1))
> for (i in 1:20) {points(i,ppois(i,lambda=10)) }
We are using a little code trick in the first command line above by
creating a nearly empty set of axes with the plot() function, and
then filling in the points in the second line using the points() func-
tion. This gives the following plot:
You may notice that this plot looks a lot like the ones earlier in this
96
57
chapter as well as somewhat similar to the probability plot in the
previous chapter. When we say the “theoretical distribution” we
are talking about the ideal Poisson distribution that would be gen-
erated by the complex equation that Mr. Poisson invented a couple
of centuries ago. Another way to think about it is, instead just hav-
ing a small sample of points, which we know has a lot of random-
ness in it, what if we had a truly humongous sample with zillions
of data points? The curve in the plot above is just about what we
would observe for a truly humongous sample (where most of the
biases up or down cancel themselves out because the large number
of points).
So this is the ideal, based on the mathematical theory of the Pois-
son distribution, or what we would be likely to observe if we cre-
ated a really large sample. We know that real samples, of reason-
able amounts of data, like 100 points or 1000 points or even 10,000
points, will not hit the ideal exactly, because some samples will
come out a little higher and others a little lower.
We also know, from the histograms and output earlier in the chap-
ter, that we can look at the mean of a sample, or the count of events
less than or equal to the mean, or the arrival probabilities in the
graph on this page, and in each case we are looking at different versions
of the same information. Check out these five commands:
> mean(rpois(100000,10))
[1] 10.01009
> var(rpois(100000,10))
[1] 10.02214
> sum(rpois(100000,10)<=10)/100000
[1] 0.58638
> ppois(10,lambda=10)
[1] 0.58303
> qpois(0.58303,lambda=10)
[1] 10
In the first command, we confirm that for a very large random sam-
ple of n=100,000 with a desired mean of 10, the actual mean of the
random sample is almost exactly 10. Likewise, for another large
random sample with a desired mean of 10, the variance is 10. In
the next command, we use the inequality test and the sum() func-
tion again to learn that the probability of observing a value of 10 or
less in a very large sample is about 0.59 (note that the sum() func-
tion yielded 58,638 and we divided by 100,000 to get the reported
value of 0.58638). Likewise, when we ask for the theoretical distri-
bution with ppois() of observing 10 or less in a sample with a mean
of 10, we get a probability of 0.58303, which is darned close to the
empirical result from the previous command. Finally, if we ask
qpois() what is the threshold value for a probability of 0.58303 is in a
Poisson sample with mean of 10, we get back the answer: 10. You
may see that qpois() does the reverse of what ppois() does. For fun,
try this formula on the R command line:
!
qpois(ppois(10, lambda=10), lambda=10)
Here’s one last point to cap off this thinking. Even with a sample of
100,000 there is some variation in samples. That’s why the 0.58638
from the sum() function above does not exactly match the theoreti-
cal 0.58303 from the ppois() function above. We can ask R to tell us
how much variation there is around one of these probabilities us-
ing the poisson.test() function like this:
97
52
> poisson.test(58638, 100000)
95 percent confidence interval:
0.5816434 0.5911456
We’ve truncated a little of the output in the interests of space: What
you have left is the upper and lower bounds on a 95% confidence
interval. Here’s what a confidence interval is: For 95% of the sam-
ples that we could generate using rpois(), using a sample size of
100,000, and a desired mean of 10, we will get a result that lies be-
tween 0.5816434 and 0.5911456 (remember that this resulting pro-
portion is calculated as the total number of events whose delay
time is 10 or less). So we know what would happen for 95% of the
rpois() samples, but the assumption that statisticians also make is
that if a natural phenomenon, like the arrival time of tweets, also
fits the Poisson distribution, that this same confidence interval
would be operative. So while we know that we got 0.58638 in one
sample on the previous page, it is likely that future samples will
vary by a little bit (about 1%). Just to get a feel for what happens to
the confidence interval with smaller samples, look at these:
> poisson.test(5863, 10000)
95 percent confidence interval:
0.5713874 0.6015033
> poisson.test(586, 1000)
95 percent confidence interval:
0.5395084 0.6354261
> poisson.test(58, 100)
95 percent confidence interval:
0.4404183 0.7497845
We’ve bolded the parameters that changed in each of the three com-
mands above, just to emphasize that in each case we’ve reduced
the sample size by a factor of 10. By the time we get to the bottom
look how wide the confidence interval gets. With a sample of 100
events, of which 58 had delays of 10 seconds or less, the confidence
interval around the proportion of 0.58 ranges from a low of 0.44 to
a high of 0.75! That’s huge! The confidence interval gets wider and
wider as we get less and less confident about the accuracy of our esti-
mate. In the case of a small sample of 100 events, the confidence in-
terval is very wide, showing that we have a lot of uncertainty
about our estimate that 58 events out of 100 will have arrival de-
lays of 10 or less. Note that you can filter out the rest of the stuff
that poisson.test() generates by asking specifically for the
“conf.int” in the output that is returned:
> poisson.test(58, 100)$conf.int
[1] 0.4404183 0.7497845
attr(,"conf.level")
[1] 0.95
The bolded part of the command line above shows how we used
the $ notation to get a report of just the bit of output that we
wanted from poisson.test(). This output reports the exact same con-
fidence interval that we saw on the previous page, along with a re-
minder in the final two lines that we are looking at a 95% confi-
dence interval.
98
50
At this point we have all of the knowledge and tools we need to
compare two sets of arrival rates. Let’s grab a couple of sets of
tweets and extract the information we need. First, we will use the
function we created in the last chapter to grab the first set of
tweets:
tweetDF <- TweetFrame(“#ladygaga”,500)
Next, we need to sort the tweets by arrival time, That is, of course,
unless you accepted the Chapter Challenge in the previous chapter
and built the sorting into your TweetFrame() function.
sortweetDF<-tweetDF[order(as.integer( +
tweetDF$created)), ]
Now, we’ll extract a vector of the time differences. In the previous
chapter the use of the diff() function occurred within the Arrival-
Probability() function that we developed. Here we will use it di-
rectly and save the result in a vector:
eventDelays<- +
as.integer(diff(sortweetDF$created))
Now we can calculate a few of the things we need in order to get a
picture of the arrival delays for Lady Gaga’s tweets:
> mean(eventDelays)
[1] 30.53707
> sum(eventDelays<=31)
[1] 333
So, for Lady Gaga tweets, the mean arrival delay for the next tweet
is just short of 31 seconds. Another way of looking at that same sta-
tistic is that 333 out of 500 tweets (0.666, about two thirds) arrived
within 31 seconds of the previous tweet. We can also ask
poisson.test() to show us the confidence interval around that value:
> poisson.test(333,500)$conf.int
[1] 0.5963808 0.7415144
attr(,"conf.level")
[1] 0.95
So, this result suggests that for 95% of the Lady Gaga samples of
tweets that we might pull from the Twitter system, the proportion
arriving in 31 seconds or less would fall in this confidence band. In
other words, we’re not very likely to see a sample with a propor-
tion under 59.6% or over 74.1%. That’s a pretty wide band, so we
do not have a lot of exactitude here.
Now let’s get the same data for Oprah:
> tweetDF <- TweetFrame("#oprah",500)
> sortweetDF<-tweetDF[order( +
as.integer(tweetDF$created)), ]
> eventDelays<- +
as.integer(diff(sortweetDF$created))
> mean(eventDelays)
[1] 423.01
Hmm, I guess we know who is boss here! Now let’s finish the job:
> sum(eventDelays<=31)
[1] 73
99
Documents you may be interested
Documents you may be interested