Gathering Tweets in Python

Anand Bisen bio photo By Anand Bisen Comment

The amount of data freely available for playing around has increased leaps and bound. Amongst various data sources I feel Twitter is one of the most interesting and rich source. First and foremost they have made the system very approachable (documentation & tools), and secondly the data is interesting in a sense that anybody can find a subset of data that is interesting to them.

Occasionally I gather tweets for either playing around or for creating visualizations for identifying trends. One of the use-case for me is to gather tweets during a conference that I am interested in and see what was the most popular trend/technology during the days of conference. I do this several times every year especially for Hadoop Summit, Strata and PyData.

Below is a small snippet that I use to collect the tweets. It collects tweets in byte sized multiple compressed files in raw form that could be processed incrementally as the tweets are being captured. Download code for twcapture.py.

I: Capture Tweets

$ ./twcapture.py -h
usage: twcapture.py [-h] [--data-dir DATA_DIR] [--chunk-size CHUNK_SIZE]
                    api_key api_secret access_token access_secret watch

positional arguments:
		api_key               Twitter API Key
		api_secret            Twitter API Secret
		access_token          Twitter Access Token
		access_secret         Twitter Access Secret
		watch                 Comma delimited keywords to watch

optional arguments:
		-h, --help            show this help message and exit
		--data-dir DATA_DIR   Directory location to capture tweets
		--chunk-size CHUNK_SIZE
	                      Number of chunks per data file (Default: 1000)

II: Dataprep: Read the tweets

Once tweets have been captured in these gzipped files it’s very easy in Python to perform some quick analysis and identify trends. The snippet below reads the json strings from all the gzipped files into a single list of json objects. In a large scale analysis with huge data set this may not be the most efficient method due to memory constrains, but in this case the data set is fairly small simplicity triumphs complexity :)

The statistics performed in this post are the most basic ones. There is some really interesting analysis that could be done on this data. From “sentiment analysis” to quantifying the influence of individual users. If you would like to perform more analysis you may download the raw data here

import json, gzip, os.path

import pygal
import nltk
import pandas as pd
from collections import Counter


# Quick and dirty list of files containing tweets
lst=os.listdir("../../code/tweepy-pydata/")
lst=[i for i in lst if ".gz" in i]

# Get all tweets converted to json and in a single list
# Not the most efficient method for large number of tweets but okay for this data

pytweets = []
base_path = "../../code/tweepy-pydata/"
for filename in lst:

    lines = gzip.open( os.path.join(base_path, filename), "r" ).readlines()
    for twts in lines:
        jsn = json.loads(twts)

        # Filter out any tweet that does not contain the word PYDATA or is not in english
        if ( (jsn['lang']=='en') & ("PYDATA" in jsn['text'].upper()) ):
            pytweets.append( jsn )

III: Plot the most prolific tweeters

Now we have all the tweets in a list pytweets we can quickly perform some quick analysis and generate some pretty charts using pygal. In the snippet below we find the most prolific users.

Every time I look around I find a new Graphing/Charting module for Python. For the snippets below I have used a neat module called pygal. It produces neat looking graphs in form of SVG and has a very simple to use interface.

# Use Python collection for counting frequency
user_count = Counter()
for i in pytweets:
    user_count[ i['user']['screen_name'] ] += 1


# Prepare the SVG Plot

barplot = pygal.HorizontalBar( style=pygal.style.SolidColorStyle )

topnum = 10  
for i in range(topnum):
    barplot.add( user_count.most_common(topnum)[i][0], \
              [ { 'value': user_count.most_common(topnum)[i][1], \
                  'label':user_count.most_common(topnum)[i][0]} ] )

barplot.config.title = barplot.config.title= "Top " + str(topnum) + " Most Prolific Tweeters"
barplot.config.legend_at_bottom=True

barplot.render_to_file("Top_Tweeters.svg")

Top Tweeters

Get the list of most popular tweets…

# Tweets with RT count > 10
count = Counter([i['text'] for i in pytweets])

frdf = []
for i,j in count.iteritems():
    if j > 10:
        frdf.append([j, i])

df = pd.DataFrame(frdf, index=None, columns=["Count", "Tweet"])
df.sort(columns="Count", inplace=True, ascending=False)

for i,j,k in df.itertuples():
    print j,"\t", k
46  RT @raymondh: Heard at #pydata:  When someone says "big data" what they usually mean is "insufficient memory" :-)
45  RT @oceankidbilly: Here is the demo IPy notebook from my #pydata talk: http://t.co/0ObJ0OOXRS mpld3, vincent, ggplot, bokeh, seaborn, IPyth…
41  RT @jiffyclub: I talked to like 5 people yesterday at #pydata who are thinking about leaving academia. Get your shit together, academia.
31  RT @jsundram: "Math is to physics what python is to data science" -- @pwang #pydata
28  RT @jsundram: .@wesmckinn: forget about big data, "medium data is where it's at" #pydata
27  RT @SciPyTip: RT @astrobiased: Quote of the day at #PyData "I'm a recovering R user" - @DataJunkie
23  RT @ellisonbg: Wow @NASAJPL run ipython nbviewer internally to share notebooks @PyDataConf @rwitoff http://t.co/aBEI78Y7LJ
16  RT @jsteeleeditor: "The way a violin is different from a kazoo, Python is different from Excel." -@pwang #PyDataSV
16  RT @amontalenti: #PyData is Python's most important conference, in my opinion, because data analysis is Python's killer app. Follow them at…
15  RT @neilkod: "Being able to publish your paper as a reproduce-able AMI is where we're headed." --@profjsb at #pydata
14  RT @teoliphant: Great talk on PyToolz at PyData.  You should check out this library: http://t.co/x4cHp2yPbd
14  RT @rgrrl: #pydata #wit panel with the superheroes of data science (all women)!! http://t.co/nCrNGu9Me3
13  RT @sarah_guido: Visualizing algorithms in the browser by @laffra is awesome. See here: http://t.co/BQ0eTmElRJ @PyDataConf #pydata #pydatasv
13  RT @mrocklin: Slides from my talk: Functional Performance with Core Data Structures  http://t.co/upYwoy8SJq  #PyData #PyDataConf #PyDataSV
12  RT @astrobiased: Meet Sticky, the awsm D3 + IPython widgets package that @oceankidbilly talked about yesterday at #pydata: https://t.co/w7t…
12  RT @sarah_guido: From @pwang's keynote: data comprehension is key. @PyDataConf #pydata http://t.co/R2P0F5fmrV
11  RT @genetics_blog: Learn #DataScience - open course in an #IPython notebook http://t.co/YyrqzUHb6A #pydata
11  RT @aminusfu: Slides for the talk I just gave at #pydata on optimization: https://t.co/O5ns93H5wX
11  RT @rgrrl: #pydata daniel moisset querying your database in natural language in "son of ping an ping"!  Now!!! http://t.co/p4AEqwXhL8
11  RT @oceankidbilly: ...and here's all of the data + notebook styling to run my #pydata talk in a live ipynb: https://t.co/qvDuIyJN8J
11  RT @pwang: So much win. #PyData cookies @PyDataConf, thanks @facebook! http://t.co/XGpcHMnNHZ
11  RT @jiffyclub: Idea: build nap times into conferences. #pydata
11  RT @OReillyUG: Author @wesmckinn talking about DataPad: Python-powered Business Intelligence at #pydata http://t.co/ri60ZY5Yr0
11  RT @teoliphant: Periodic table and more with Bokeh. #pydata #bokeh. Super proud of this team @BokehPlots @bryvdv @pwang http://t.co/xmVq191…

V: What platform are users tweeting from?

# Get the source field from each tweet
source = []
for i in pytweets:
    source.append(i['source'])

# Reduce the source and count
src=Counter(source)

# Convert the "Counter" container to Pandas dataframe for easy manipulation
frame = []
for i,j in src.iteritems():
    frame.append( [j, nltk.clean_html(i)])
sourcedf = pd.DataFrame(frame, columns=["COUNT", "SOURCE"])


# A lookup table to normalize the data in the containers we want
#   - all iOS Platforms (iPad, iPhone et. al. goes into iOS etc.)
sourcelookup = { "web": "Web",                              "Twitter for iPhone": "iOS",
                "Twitter for Android": "Android",           "TweetDeck": "TweetDeck",
                "Tweetbot for iOS": "iOS",                  "Twitter for iPad": "iOS",
                "Twitter for Mac": "Mac",                   "Tweetbot for Mac": "Mac",  
                "Twitter for Android Tablets": "Android",   "Twitterrific": "iOS",
                "iOS": "iOS",                               u"Plume\xa0for\xa0Android": "Android",
                "YoruFukurou": "Mac",                       "TweetCaster for Android": "Android",
                "Guidebook on iOS": "iOS",                  "Twitter for Android": "Android",
                "UberSocial for iPhone": "iOS",             "Twitterrific for Mac": "Mac"
                }


# A helper function for looking up the table defined above
def translate(txt):
    try:
        return sourcelookup[txt]
    except KeyError:
        return "Other"

# Create a new column with normalized field
sourcedf['NSOURCE']=sourcedf.SOURCE.apply(lambda x: translate(x))

# Groupby the normalized field "NSOURCE"
grouped = sourcedf.groupby(by=["NSOURCE"])

# Create the chart (PieChart)
chart = pygal.Pie( style=pygal.style.SolidColorStyle )

for i in grouped.groups.iteritems():
    chart.add( i[0], grouped.get_group(i[0]).COUNT.tolist() )

chart.config.title="Twitter Source for PyData-SV Users"      
chart.render_to_file('pie_chart_twitter_pydatasv.svg')

Pie Chart

twcapture.py: Code for capturing tweets in small gzipped files

comments powered by Disqus