Lecture 2

This lecture is about data collection from the internet. We will use wget as an option for the shell, direct url downloads using urllib for python, parsing data using regular expressions and json, and we combine everything into a web spider.

The pdf for the exercise is here:Exercise2

The urllib example

import urllib
myUrl="http://www.mapquest.com/maps?cat=starbucks&zipcode=48823"
theSite=urllib.urlopen(myUrl).read()
print theSite

parsing code using regular expressions

import re
geoLoc=re.findall('''"latLng"(.+?)"locality"''',theSite,re.DOTALL)
allAdd=re.findall('''"singleLineAddress":"(.+?)"''',theSite,re.DOTALL)
for i in range(len(allAdd)):
    lat=re.findall('''"lat":(.+?),''',geoLoc[i],re.DOTALL)
    lng=re.findall('''"lng":(.+?)}''',geoLoc[i],re.DOTALL)
    print "Lat: "+str(lat)+" Lng: "+str(lng)+" Addr: "+str(allAdd[i])

JSON

import json
A=json.loads('{"lat":42.68464,"lng":-84.43375}')
print A
print A["lat"]
B=json.loads('''{"geocodeQualityCode":"L1","latLng":{"lat":42.68464,"lng":-84.43375},"locality":"Okemos","postalCode":"48864","quality":"ADDRESS","region":"MI","regionLong":"Michigan","singleLineAddress":"3552 Meridian Crossings Dr, Okemos, MI 48864","street":"3552 Meridian Crossings Dr"}''')
print B
print B["locality"]

mapquest API

import mapq
f=open("MapQuestKey.txt")
theMapQuestKey=f.read()
mapq.key(theMapQuestKey)
print mapq.batch('starbucks','48823')

the webspider

import urllib
import re
seedPage="http://www.technicpack.net"
 
maxNrOfPages=10
currentCuePosition=0
theQue=[]
theQue.append(seedPage)
 
recordGraph=True
if(recordGraph):
    G=dict()
 
while(currentCuePosition<maxNrOfPages):
    try:
        urlToLoad=theQue[currentCuePosition]
        theSite=urllib.urlopen(urlToLoad).read()
        #insert code to analyse "theSite" here:
        links=re.findall('''<a href="(.+?)"''',theSite,re.DOTALL)
        print str(currentCuePosition)+": "+theQue[currentCuePosition]
        for l in links:
            if l not in theQue:
                #additional filtering to be added here
                theQue.append(l)
            if(recordGraph):
                if(G.has_key(urlToLoad)):
                    G[urlToLoad].append(l)
                else:
                    G[urlToLoad]=[]
    except Exception:
        pass
    currentCuePosition=currentCuePosition+1

visualizing the link network using networkx

from pandas import *
import networkx as nx
g = nx.DiGraph()
q = "SNL"
for P1 in G:
    for P2 in G[P1]:
        g.add_edge(P1,P2)
 
print nx.info(g)
figure(figsize=(80, 80))
nx.draw(g)
savefig("theGraph.pdf")

2 comments on “Lecture 2

  1. Chao January 13, 2014 6:54 pm

    Could we get the Lecture 2 slides?

    • ArendHintze January 13, 2014 7:01 pm

      Sure, I am working on that. I wanted to write a summary or review as well, that takes a moment
      Cheers Arend

Leave a Reply