Twitter Data Conundrum

Twitter data was specifically cited as problematic by several presenters at the Sentiment Analysis Symposium.

It probably goes without saying that complete and accurate data make for better analysis. While good data is relatively easy for a brand to obtain from its owned media, assuring the integrity of data from earned, yet-to-be-earned and just plain “in-the-wild” text documents has a higher degree of difficulty.

After the symposium, I asked some questions of those who are “in the know” about Twitter data to get some clarity on the issue.

What’s the problem with twitter? Or is there a problem?

Along with the global considerations of sentiment analysis discussed in the last post in this series, tweets are written in a “language” that is not English with structure that may confound discovery and analysis techniques. It was commented at the symposium that the Twitter firehose — a source that, in theory, provides a real-time stream with 100% of all tweets — only yields about 85% of tweets. There was great concurrence with that view among participants actively analyzing firehose data.

I sent a query to Gnip, a company that has a partnership with Twitter to provide data streams to the industry. Gnip does not offer the full firehose to customers, instead filtering the stream based onclients’ requirements. According to the company, Gnip customers have access to 100% coverage over all public Tweets through its Power Track product, which provides full-fidelity firehose filtering with no rate limiting or volume restrictions.

Rob Johnson, Director of Business and Strategy at Gnip, was generous in his response explaining Twitter data access.

For social media monitoring and social media analytics, there are Twitter APIs relevant to the industry — Twitter’s Search API and Twitter’s Streaming API. Both of these APIs start with coverage of public statuses only and do not include protected tweets. Additionally, before tweets are made available to either of these APIs, Twitter applies a quality filter to weed out spam.

The search API provides coverage of 100% of Tweets, with one big qualification: IF you can poll it fast enough to keep up with the stream of interest. Because of rate limiting and Twitter’s ever-increasing volumes, it’s becoming impossible to do this for more and more topics (e.g., Justin Beiber, any trending topic, etc.) and this issue grows proportionally to the number of things you’re trying to track. The reality is that relying on the search API will deliver very poor coverage over Twitter and if you’re working with a vendor who claims full Twitter coverage and is using the Search API exclusively, you’re being mislead.

The Streaming API provides two types of data streams in realtime to developers and companies. The first is sampled streams. This is where a statistically valid percentage of the stream is delivered to a customer. The default, free access to sampled streams is called the “spritzer” and it’s currently 1% of the full 100% firehose.

Additionally, the Streaming API provides keyword-based streams that deliver all the tweets that match a set of keywords that developers or companies upload. However, there’s a key restriction on this as well — Twitter “volume rate limits” these streams such that if your stream can never be above a certain percentage of the full firehose. This percentage is not publicly shared by Twitter.

Then we come to what is known as the full firehose. Twitter does not make the full 100% firehose publicly available. This is the product that Google, Microsoft, Yahoo, etc. have access to. Many companies who claim to have the full firehose actually do not, and instead have access to the Streaming API as described above.

I asked Rob to speculate why so many at the symposium felt access to the tweet stream via the firehose was less than complete:

My guess without speaking to them directly and asking some clarifying questions is that their statements were based on a misunderstanding of what they were evaluating/testing. It is factually true that 100% of public Tweets are sent through the firehose (where “firehose” is precisely this, by definition). Most likely these analysts are not consuming the full firehose and were instead consuming data from one of the APIs.

Datasift also provides tweet data sourced from the firehose through partnership with Twitter. Datasift CEO Nick Halstead confirmed that his company is getting 100% of the tweets.

Assuming all tweets are available, the lack of confidence expressed by many sentiment analysis experts remains reason for concern. In sentiment analysis, the tweet stream may be processed by two or three companies before presentation as a user tool. If it turns out that all the data is not present, there are multiple points were it may get lost or compromised.

After reviewing the responses from Gnip and Datashift, Lexalytics’ Jeff Catlin commented, “The stream is definitely weak, but that’s really just the nature of the beast. It’s a separate the wheat from the chaff sort of problem.” Alta Plana’s Seth Grimes suggests that, while all tweets may be present, there is the possibility that a lack of tweet-associated metadata interferes with search queries yielding results that create the perception of missing tweets.

How this concern relates to other decisions brands should make when choosing a sentiment analysis solution will be discussed in the next part in this series.