Sunday, February 28, 2010

Twitter Forecasting: How likely is it that I'll tweet tomorrow?

For a while now I've been swirling around the notion of forecasting (or simulating) the activities of members in social media communities, especially Twitter. (Think of the weather. Now think of hurricane predictions. Now replace the hurricane with a twitter member and replace land with different twitter activities. Now you're thinking like I'm thinking. I'm on a horse.)

In my quick glances across the web, I have seen a bit about social forecasting in general (and some really really smart people talking about it), and I've seen lots of forecasts in regard to the industry of social media, but not really anything addressing the forecasting of specific activities within a specific community.

Trending is something that the industry is starting to get a hold of, but what I'm talking about goes beyond trending. Trending tells us what has happened. I'm interested in what will (or at least is likely to) happen. In my mind forecasting presents some exciting possibilities in being able to help everyone involved in social media communities from the community managers to the system admins.

For community managers, the benefit lies in being able to work off of indicators that signal the rise of a community star or the trailing off of one. Whole simulations could be run based on activity in a member's first 24 hours.

For systems admins it could help in identifying spikes before they happen or identifying the best days (or hours) to perform maintenance and updates.

Maybe I'm overstating the possibilities, but I think it's big and it's exciting stuff to me. Let's get started.

How likely is it that I'll tweet tomorrow?

It's a simple question but one that has lots of implications. It can help a community manager know whether I'm engaged or slipping away. It can help a systems admin know if I'll be requiring any computing resources.

If we were to start by saying: "What's the probability I will tweet on any given day?" we'd find out that over the last 6 months I've tweeted on 57% of the days, just a little over half of the days. But our question isn't about "any given day" it's about tomorrow.

We know "tomorrow" is one day in a series of my twitter engagement. So what if we knew whether or not I had already tweeted today? Would that make a difference in our probability? Am I more likely to tweet tomorrow if I've tweeted today? Less likely?

Math modeling (no...not that kind of modeling)

What if we could develop a simple model that helps us predict my future tweets by my given state? Turns out, we can...or at least we can try. In this case the key is to focus on the probability of transitioning from one state to the next: Given that I tweeted today, what's the likelihood that I tweet tomorrow? What about for the next 3 days?

So that's what I did. I choose a Markov model (which I've looked at before, but never used...so please...if you see this and are some kind of Markov model expert, please let me know how I did) and starting working away at trying to come up with one that was more than just throwing numbers up in the air.

Now...my model is currently based on a very small sample size (just my twitter activity), and therefore has a very high margin for error, but it's a start. (There are obvious problems with the "super" predictions due to my small sample.)

Here's what I have:


          dor  act  sup
dormant [ 0.46 0.54 0.00 ]
active  [ 0.40 0.59 0.01 ]
super   [ 0.00 1.00 0.00 ]


I've created three states called "dormant", "active", and "super". A dormant user will not tweet in a given day. An active user will tweet 1 to 9 times. A super user will tweet 10 or more times in a day. The matrix represents the probability of moving from one state to the next.

For example, if I tweet today, there is actually only a 40% that I won't tweet tomorrow. (In contrast, there is a 60% chance I tweet at least once.) If I'm dormant today, there's actually only a 54% that I tweet tomorrow.

Let's look at this again. Our simple average says there's a 57% chance I tweet tomorrow, but our model says there could be as high as a 60% chance and as low as a 54% chance. While those all sound close together, when you are talking thousands or millions of users, those small percentage points can mean a lot.

More to come...

To me this is only the beginning. Here's what I'm looking at for next steps:

1) A model based of off all the people I follow on twitter. It gives me a larger sample size (that's not ginormous) and could be really cool to take a gander at.

2) A functional programming model. I did all of my initial fetching and computation using Ruby which is my hammer in this programming world. However, once of the sparks that got me to actually start on this project was the opportunity have something concrete to apply a functional language to. I believe the crunching involved to predict large datasets would benefit greatly from a functional design and given their current hotness I'd love to let one loose on this thing.

3) A model based of off a large twitter sampling. Lots of users. Lots of crunching.

4) A web service? I'm still working on the best way to practically expose this data and work with it. I've got lots of ideas but none that are shining right now...

5) Other models. I'd love to try some other models and test their effectiveness...

That'll do it for now. I'd love to hear your feedback on this.

----

Also if you're interested I've posted the code I wrote this weekend up in a gist. It was two files, but I've combined it for sharing. I did it in a style I'm going to call narrative programming, where the main concern is not reuse or architecture, but instead telling a story and following a train of thought from beginning to end.