Month: January 2014

Lecture 6

Todays lecture is about data preprocessing. We will discuss different forms of data and how it can be turned into numerical values, measurement errors, noise, and distributions, duplicate or missing data, privacy issues, aggregation, sampling, and discretization.

You can find the iPython notebook here:cse891 data preprocessing.ipynb
In addition there is a fictional data file about bank customers:bank-data

The Exercise can be found here:Exercise 6
In this exercise there is a bonus question for the interested, the solution for it can be found here:cse891 Ex6 solution.ipynb

Cheers Arend

Lecture 5

Today we looked into SQL queries and how to get data from the tables. Here is a list of commands we used in detail:

SELECT * FROM employees
SELECT * FROM employees WHERE name="Amy Wong"
SELECT * FROM employees WHERE name="Amy Wong" OR salary
SELECT * FROM employees WHERE name="Amy Wong" OR salary=1.0
SELECT * FROM employees WHERE name="Amy Wong" OR salary=1.0 ORDER BY salary
SELECT * FROM employees WHERE name="Amy Wong" OR salary=1.0 ORDER BY salary DESC
SELECT * FROM employees WHERE name="Amy Wong" UNION SELECT * FROM employees WHERE salary=1.0
SELECT * FROM employees WHERE name="Amy Wong" OR salary=1.0
 
SELECT * FROM employees LIMIT 4
SELECT * FROM employees ORDER BY salary LIMIT 4
 
SELECT SUM(salary) FROM employees
SELECT SUM(salary) FROM employees WHERE roomNumber=1
SELECT SUM(salary) FROM employees GROUP BY roomNumber
 
SELECT COUNT(salary) FROM employees
SELECT COUNT(*) FROM employees
 
SELECT MAX(salary) FROM employees
SELECT MIN(salary) FROM employees
 
SELECT AVG(salary) FROM employees

In addition I started to talk about distance measures, something we will deal with much more when we will talk about clustering.
Here are the slides: Lecture 5
Cheers Arend

Individual Projects

The behind giving you individual projects is to challenge you with an open problem in big data. However, because you can propose your project, the hope is that your level of motivation is higher than in a project that I would give you, because most likely you choose something you are already interested in. In addition, your curiosity can lead you to places where projects that I design would not lead you.

The project proposals that you are writing right now will be evaluated by a couple of criteria:

  • feasibility – we estimate the time, effort, and computational resources required, we don’t want you to over- or under-scope, or end up fighting problems that are distracting from the actual proposal
  • complexity – think about the three V of big data: Volume, Velocity, Variety, your project should ideally address all of these, but it is very unrealistic that you find a problem that does. So please identify which of the three Vs you have as the primary issue. Simply scraping the website and visualizing the data doesn’t cut it. Please check this project which is interesting but in my opinion is too simple:http://alproductions.us/blog/2013/11/26/the-end-of-flash-gaming/
  • method – we haven’t talked about the computational methods we teach you, so this is really hard to eyeball, but in general your project should either detect interesting patterns in data (like clusters, modules, trends) or allow predictions (trends, time series, consumer basket) or allow classifications (who will win, who will buy XYZ, this object belongs in this category)
  • originality – if you come up with a project that is mind blowing, or at least very interesting, but does not really conforms to the above, we will probably still try to make it work, simply because.

I have been asked about the business aspect, and ideally your project should answer a business question, or should have commercial value. However, I think it is more important to perform an analysis accurately on a toy problem, than using a real world problem that is either not interesting, or not teaching you the necessary skills. Therefor I am relaxing this constraint, but would of cause be happy if you chose a business problem.

Here are a couple of ideas that I had, but that are also already inspired by conversations with you:

  • money ball – you download some form of sport stats and try to derive a model that is predictive about a game outcome, watch the movie “money ball” or read the book if you are curious about this. In essence you can make your baseball team much better, if you optimize for players that get you on the “first base”.
  • interest biases – I am not talking about money but about what people are interested in. I crawled bloggers.com and tried to find gender biases and stereotypes per state:http://alproductions.us/blog/2013/11/14/gender-bias-and-stereotypes/, bloggers is easy to crawl, has geo tags, blogs that people read, and follow
  • interest clusters – it is not clear how interests relate to each other, one could cluster interests and find categories of interests that belong together
  • interest profiling – if you know one or two interests can you predict what the person might be interested as well?
  • social networks – get data which you use to derive a social (or other type of) network and identify cluster (groups), more interesting would be to see how and why the network changes
  • Diet Coke and Fries – I guess this is a stupid title, but there is the idea that when you order fries it totally doesn’t matter anymore if you order diet coke or not, calorie wise you are over your limit already, however people still order diet coke to fries – or do they? Data can reveal such contradictions, or open opportunities: People invest in either risky or conservative funds, however bet-hatching suggests that you should do both.
  • recommendation system – regardless of the webservice, there is always the option to improve how data is found or accessed, better clustering or better classification as well as a totally new approach is thinkable

Please feel free to add your suggestions and ideas, the more we move ideas around the better. Cheers Arend

Lecture 4

This and the next lecture are about databases. The first will focus on what databases are and how they are setup, the second lecture will deal with using databases and search queries. We will use the sqlite3 module for python which wraps the standard SQL language and as such is mostly compatible with the standard SQL. Alternatively you can use arctic and the installed SQL server there.

The iPython notebook: cse891 sqlite3.ipynb

The slides: Lecture 4

The exercise: Exercise4

Lecture 3 – API

This lecture serves three purposes:

  • Turn you into a data junky! Sorry, make you aware of the endless possibilities of data on the web
  • Show you that APIs are actually easy to use
  • Let you use the Twitter API to obtain data and use it
  • Reddit API:


    The first part is about Reddit for which we need a couple of links:
    MovieKills
    There is a blog post explaining it here: http://www.theswarmlab.com/r-vs-python-round-2/
    And a video showing quick and dirty web scraping in Python here: https://www.youtube.com/watch?v=qfGthiqwaZo
    Read more about Reddit here: http://www.reddit.com/about

    We want to use Reddit’s API to scrape information about posts on Reddit. We will be using the Python Reddit API Wrapper (PRAW) to do that.
    To install PRAW:

    easy_install praw

    Mac and Linux users may need to use “sudo” before the above command to give install permissions, or use the –user prefix

    Detailed PRAW documentation here: https://praw.readthedocs.org/en/latest/

    A bunch of example scripts that you can learn from: https://praw.readthedocs.org/en/latest/pages/useful_scripts.html

    In-class code practice

    Before using PRAW, you must register your client with Reddit:

    import praw
    reddit = praw.Reddit(user_agent="MSU data analysis class bot")

    This lets Reddit know that you plan to use their API to scrape some data. This checkin process is important. If you’re dishonest about who you are or abuse the API, Reddit admins may ban you!
    If you plan to use Reddit for your project, be sure to read the API usage terms: https://github.com/reddit/reddit/wiki/API#rules

    To get posts on the front page of reddit.com:

    for post in reddit.get_front_page():
        print post

    Each “post” object has attributes related to the post, like the time it was posted (created_utc), where is was posted (subreddit), and how many upvotes it has (ups). Here’s how to print the linked URL for each of the posts on the front page:

    for post in reddit.get_front_page():
        print post.url

    To get the 10 newest posts in a particular subreddit (note the “limit” parameter):

    for post in reddit.get_subreddit("dataisbeautiful").get_new(limit=10):
        print post

    To get the 25 newest posts and comments by a particular user:

    for entry in reddit.get_redditor("Deimorz").get_overview(limit=25):
        print entry

    Challenge: Go through a redditor’s profile with PRAW and print out the last 50 subreddits they have posted to.

    Twitter API:


    You need to obtain twitter credentials, goto:https://dev.twitter.com/apps
    You also need to download the twitter API:
    easy_install twitter
    pip install twitter
    for windows users goto: https://code.google.com/p/python-twitter/

    The full iPython notebook can be found here: cse891 twitter API.ipynb

    The first piece of code deals with authenticating yourself:

    import twitter
     
    def readStrFromFile(filename):
        f=open(filename,'r')
        value=f.read()
        f.close()
        return value
     
    #load all keys, secrets, tokens, ...
    CONSUMER_KEY=readStrFromFile("consumerKey.txt")
    CONSUMER_SECRET=readStrFromFile("consumerSecret.txt")
    OAUTH_TOKEN =readStrFromFile("authToken.txt")
    OAUTH_TOKEN_SECRET =readStrFromFile("authTokenSecret.txt")
     
    #authenticte yourself
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)
     
    #get a twitter instance calles twitter_api
    twitter_api = twitter.Twitter(domain='api.twitter.com', 
                                  api_version='1.1',
                                  auth=auth
                                 )
     
    print twitter_api

    Then we want to search for tweets:

    import json
    search_res=twitter_api.search.tweets(q="MSU",count=100)
    #print search_res
    #L=search_res["statuses"]
    #for l in L:
    #    print l['text']
    print json.dumps(search_res,indent=1)

    Next we obtain trending data, for that we need to understand WOE_IDs (Where On Earth ID) to be found here:http://woeid.rosselliot.co.nz/

    WORLD_WOE_ID=1
    US_WOE_ID=23424977
    MICHIGAN_WOE_ID=2347581
    print json.dumps(twitter_api.trends.place(_id=WORLD_WOE_ID),indent=1)

    I will elaborate more in JSON and Dictionaries:
    Serialize the search result into one large string and split it into a word array

    L=search_res["statuses"]
    S=""
    for l in L:
        #print l['text']
        S=S+" "+l['text']
    words=S.split(" ")

    Next we use a dictionary to count how often each word appears:

    count=dict()
    for word in words:
        if count.has_key(word):
            count[word]=count[word]+1
        else:
            count[word]=1

    And find the word with the highest count:

    maxCount=0
    maxWord=""
    for key in count.keys():
        if(count[key]>maxCount):
            maxCount=count[key]
            maxWord=key
    print maxWord+" "+str(maxCount)

    The exercise can be found Exercise3.

Lecture 2 summary

This lecture introduced how to get data from the web. Once you download a HTML page you need to get information out of it. For that we use regular expressions, or if the data is in JSON format we can use a JSON parser.

The terminology for all these methods is not carved in stone yet, but I call getting data from a website scraping. If you use a program to download websites in a systematic manner in order to get data from it I call that crawling. If you use a none systematic way of simply exploring a network of sites in order to collect data, it is called a spider.

In order to get data from a website we used somewhat unintuitive (complicated) regular expressions. We used a specific RE term that returns pieces of the website that is flanked by specific phrases. The following piece of code for example:

geoLoc=re.findall('''"latLng"(.+?)"locality"''',theSite,re.DOTALL)

takes the string stored in “theSite” and returns all substrings that are flanked by “latlng” to the left and “locality” to the right. Similarly if you want to get all links on a site you would use the following RE search term:

links=re.findall('''<a href="(.+?)")

An alternative for all this is to use API, which is dedicated software to obtain data directly from websites. The data is usually then organized in a JSON string which can be parsed using the JSON library. The advantage is clear, you don’t need to parse the website yourself and also reduce web traffic, which makes everything faster. Also, this approach is usually supported by the ones running the web-service. The only “downside” is that you need to understand how to translate the JSON string into palatable data (more on this on Wednesday).

As you can see the exercise required you to take a code snipplet copy it into iPython notebook and change some variable in it, and of cause then run the snipplet. For the rest of the course we will keep this level, where you are required to adapt code snipplets for your own problems. Of cause nothing keeps you from writing more sophisticated scripts, in fact I encourage you to do so, and will help you doing your own project.

For further reading you might be interested in the following links:
Tutorial on HTML
Tutorial on Regular Expressions
Tutorial on JSON

The lecture slides can be downloaded here:Lecture 2

Cheers Arend

Lecture 2

This lecture is about data collection from the internet. We will use wget as an option for the shell, direct url downloads using urllib for python, parsing data using regular expressions and json, and we combine everything into a web spider.

The pdf for the exercise is here:Exercise2

The urllib example

import urllib
myUrl="http://www.mapquest.com/maps?cat=starbucks&zipcode=48823"
theSite=urllib.urlopen(myUrl).read()
print theSite

parsing code using regular expressions

import re
geoLoc=re.findall('''"latLng"(.+?)"locality"''',theSite,re.DOTALL)
allAdd=re.findall('''"singleLineAddress":"(.+?)"''',theSite,re.DOTALL)
for i in range(len(allAdd)):
    lat=re.findall('''"lat":(.+?),''',geoLoc[i],re.DOTALL)
    lng=re.findall('''"lng":(.+?)}''',geoLoc[i],re.DOTALL)
    print "Lat: "+str(lat)+" Lng: "+str(lng)+" Addr: "+str(allAdd[i])

JSON

import json
A=json.loads('{"lat":42.68464,"lng":-84.43375}')
print A
print A["lat"]
B=json.loads('''{"geocodeQualityCode":"L1","latLng":{"lat":42.68464,"lng":-84.43375},"locality":"Okemos","postalCode":"48864","quality":"ADDRESS","region":"MI","regionLong":"Michigan","singleLineAddress":"3552 Meridian Crossings Dr, Okemos, MI 48864","street":"3552 Meridian Crossings Dr"}''')
print B
print B["locality"]

mapquest API

import mapq
f=open("MapQuestKey.txt")
theMapQuestKey=f.read()
mapq.key(theMapQuestKey)
print mapq.batch('starbucks','48823')

the webspider

import urllib
import re
seedPage="http://www.technicpack.net"
 
maxNrOfPages=10
currentCuePosition=0
theQue=[]
theQue.append(seedPage)
 
recordGraph=True
if(recordGraph):
    G=dict()
 
while(currentCuePosition<maxNrOfPages):
    try:
        urlToLoad=theQue[currentCuePosition]
        theSite=urllib.urlopen(urlToLoad).read()
        #insert code to analyse "theSite" here:
        links=re.findall('''<a href="(.+?)"''',theSite,re.DOTALL)
        print str(currentCuePosition)+": "+theQue[currentCuePosition]
        for l in links:
            if l not in theQue:
                #additional filtering to be added here
                theQue.append(l)
            if(recordGraph):
                if(G.has_key(urlToLoad)):
                    G[urlToLoad].append(l)
                else:
                    G[urlToLoad]=[]
    except Exception:
        pass
    currentCuePosition=currentCuePosition+1

visualizing the link network using networkx

from pandas import *
import networkx as nx
g = nx.DiGraph()
q = "SNL"
for P1 in G:
    for P2 in G[P1]:
        g.add_edge(P1,P2)
 
print nx.info(g)
figure(figsize=(80, 80))
nx.draw(g)
savefig("theGraph.pdf")

Lecture 1

Todays lecture was about getting your heading on the arctic server, finding your home directory, getting your web folder setup and dabbling around in python. In the theory part we defined what big data is – it is big data when at least one of the three Vs applies (volume, velocity, variety) – and we briefly discussed different approaches to big data analysis.

For the exercise we needed to get access to arctic. Please find the pdf file of the Exercise here: Exercise 1. For that you either needed ssh (on OSX) or putty on Windows. If you want graphic content being forwarded to you and to use xemacs properly you either need to use ssh -X username@arctic.cse.msu.edu or enable X11 forwarding on putty. However, for this to work on a mac you need to install xquartz first.

To remember from today: Moving around in the unix/linux filesystem is done with ls (list the directory content) and folders are changed with cd (.. is the folder above). Files are copied with cp, and moved with mv. In the exercise we used gzip and tar to unpack a file, and we also used ln to create a symbolic link instead of copying files. In order to make folders you use mkdir, and you can remove file and folders using rm. There is of cause a very detailed and good command line (called shell) tutorial to be studied here.

For next time I ask you to install ipython notebook which is easiest done with enthought canopy package. Please install the free version. You should also, just in case, get an account at sagemath.cloud so you can do everything just in case the installation has issues.

If you want to check out the slides of today you can find them here: Lecture 1.

I think at times the exercise must have been confusing. You needed to do thing that sometimes didn’t work, and sometimes needed more time to sink in. I also think that especially the shell is rather unintuitive because we are used to graphic interfaces and suddenly we have to do everything by hand using strange abbreviations with an system that doesn’t allow mistakes. The next lectures that deal with ipython notebook will be way more intuitive and do not require you to jump through hoops. I don’t know why handin didn’t work properly, but you know my email addess now and can just mail the exercise.

Please create an account on this blog so that you can also comment. In case your comment contains a link it will end up in a cue to approve it, this is spam protection. I encourage you to use the comments to ask, point out, question, criticizes, demand, or compliment on the lecture. Feedback will improve the quality.

We haven’t talked about the schedule and outline of the class, we will do that Monday. Looking forward to see you again,

Cheers Arend