Lecture 3 – API

This lecture serves three purposes:

  • Turn you into a data junky! Sorry, make you aware of the endless possibilities of data on the web
  • Show you that APIs are actually easy to use
  • Let you use the Twitter API to obtain data and use it
  • Reddit API:


    The first part is about Reddit for which we need a couple of links:
    MovieKills
    There is a blog post explaining it here: http://www.theswarmlab.com/r-vs-python-round-2/
    And a video showing quick and dirty web scraping in Python here: https://www.youtube.com/watch?v=qfGthiqwaZo
    Read more about Reddit here: http://www.reddit.com/about

    We want to use Reddit’s API to scrape information about posts on Reddit. We will be using the Python Reddit API Wrapper (PRAW) to do that.
    To install PRAW:

    easy_install praw

    Mac and Linux users may need to use “sudo” before the above command to give install permissions, or use the –user prefix

    Detailed PRAW documentation here: https://praw.readthedocs.org/en/latest/

    A bunch of example scripts that you can learn from: https://praw.readthedocs.org/en/latest/pages/useful_scripts.html

    In-class code practice

    Before using PRAW, you must register your client with Reddit:

    import praw
    reddit = praw.Reddit(user_agent="MSU data analysis class bot")

    This lets Reddit know that you plan to use their API to scrape some data. This checkin process is important. If you’re dishonest about who you are or abuse the API, Reddit admins may ban you!
    If you plan to use Reddit for your project, be sure to read the API usage terms: https://github.com/reddit/reddit/wiki/API#rules

    To get posts on the front page of reddit.com:

    for post in reddit.get_front_page():
        print post

    Each “post” object has attributes related to the post, like the time it was posted (created_utc), where is was posted (subreddit), and how many upvotes it has (ups). Here’s how to print the linked URL for each of the posts on the front page:

    for post in reddit.get_front_page():
        print post.url

    To get the 10 newest posts in a particular subreddit (note the “limit” parameter):

    for post in reddit.get_subreddit("dataisbeautiful").get_new(limit=10):
        print post

    To get the 25 newest posts and comments by a particular user:

    for entry in reddit.get_redditor("Deimorz").get_overview(limit=25):
        print entry

    Challenge: Go through a redditor’s profile with PRAW and print out the last 50 subreddits they have posted to.

    Twitter API:


    You need to obtain twitter credentials, goto:https://dev.twitter.com/apps
    You also need to download the twitter API:
    easy_install twitter
    pip install twitter
    for windows users goto: https://code.google.com/p/python-twitter/

    The full iPython notebook can be found here: cse891 twitter API.ipynb

    The first piece of code deals with authenticating yourself:

    import twitter
     
    def readStrFromFile(filename):
        f=open(filename,'r')
        value=f.read()
        f.close()
        return value
     
    #load all keys, secrets, tokens, ...
    CONSUMER_KEY=readStrFromFile("consumerKey.txt")
    CONSUMER_SECRET=readStrFromFile("consumerSecret.txt")
    OAUTH_TOKEN =readStrFromFile("authToken.txt")
    OAUTH_TOKEN_SECRET =readStrFromFile("authTokenSecret.txt")
     
    #authenticte yourself
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
     
    #get a twitter instance calles twitter_api
    twitter_api = twitter.Twitter(domain='api.twitter.com', 
                                  api_version='1.1',
                                  auth=auth
                                 )
     
    print twitter_api

    Then we want to search for tweets:

    import json
    search_res=twitter_api.search.tweets(q="MSU",count=100)
    #print search_res
    #L=search_res["statuses"]
    #for l in L:
    #    print l['text']
    print json.dumps(search_res,indent=1)

    Next we obtain trending data, for that we need to understand WOE_IDs (Where On Earth ID) to be found here:http://woeid.rosselliot.co.nz/

    WORLD_WOE_ID=1
    US_WOE_ID=23424977
    MICHIGAN_WOE_ID=2347581
    print json.dumps(twitter_api.trends.place(_id=WORLD_WOE_ID),indent=1)

    I will elaborate more in JSON and Dictionaries:
    Serialize the search result into one large string and split it into a word array

    L=search_res["statuses"]
    S=""
    for l in L:
        #print l['text']
        S=S+" "+l['text']
    words=S.split(" ")

    Next we use a dictionary to count how often each word appears:

    count=dict()
    for word in words:
        if count.has_key(word):
            count[word]=count[word]+1
        else:
            count[word]=1

    And find the word with the highest count:

    maxCount=0
    maxWord=""
    for key in count.keys():
        if(count[key]>maxCount):
            maxCount=count[key]
            maxWord=key
    print maxWord+" "+str(maxCount)

    The exercise can be found Exercise3.

2 comments on “Lecture 3 – API

  1. Sriram Kovai V January 21, 2014 9:06 pm

    Arend, can we have the slides from Lecture 3.

    • ArendHintze January 21, 2014 9:41 pm

      Sure, I uploaded them to the downloads section, I also work on getting all the links online as well as the ipython notebooks Cheers Arend

Leave a Reply