Author: ArendHintze

Lecture 9

Now we start talking about data analysis methods, beginning with two sessions about classification.

There are two types of software that helps you do this. The first one is the java program weka, and the second is the python library scikit.learn.
Todays exercise will be about weka, and you can find it here:http://www.cs.waikato.ac.nz/ml/weka/
Scikit.learn which if there is time we will start using today can be found here:http://scikit-learn.org/stable/install.html

The exercise:Exercise 9

Cheers Arend

Lecture 6

Todays lecture is about data preprocessing. We will discuss different forms of data and how it can be turned into numerical values, measurement errors, noise, and distributions, duplicate or missing data, privacy issues, aggregation, sampling, and discretization.

You can find the iPython notebook here:cse891 data preprocessing.ipynb
In addition there is a fictional data file about bank customers:bank-data

The Exercise can be found here:Exercise 6
In this exercise there is a bonus question for the interested, the solution for it can be found here:cse891 Ex6 solution.ipynb

Cheers Arend

Lecture 5

Today we looked into SQL queries and how to get data from the tables. Here is a list of commands we used in detail:

SELECT * FROM employees
SELECT * FROM employees WHERE name="Amy Wong"
SELECT * FROM employees WHERE name="Amy Wong" OR salary
SELECT * FROM employees WHERE name="Amy Wong" OR salary=1.0
SELECT * FROM employees WHERE name="Amy Wong" OR salary=1.0 ORDER BY salary
SELECT * FROM employees WHERE name="Amy Wong" OR salary=1.0 ORDER BY salary DESC
SELECT * FROM employees WHERE name="Amy Wong" UNION SELECT * FROM employees WHERE salary=1.0
SELECT * FROM employees WHERE name="Amy Wong" OR salary=1.0
 
SELECT * FROM employees LIMIT 4
SELECT * FROM employees ORDER BY salary LIMIT 4
 
SELECT SUM(salary) FROM employees
SELECT SUM(salary) FROM employees WHERE roomNumber=1
SELECT SUM(salary) FROM employees GROUP BY roomNumber
 
SELECT COUNT(salary) FROM employees
SELECT COUNT(*) FROM employees
 
SELECT MAX(salary) FROM employees
SELECT MIN(salary) FROM employees
 
SELECT AVG(salary) FROM employees

In addition I started to talk about distance measures, something we will deal with much more when we will talk about clustering.
Here are the slides: Lecture 5
Cheers Arend

Individual Projects

The behind giving you individual projects is to challenge you with an open problem in big data. However, because you can propose your project, the hope is that your level of motivation is higher than in a project that I would give you, because most likely you choose something you are already interested in. In addition, your curiosity can lead you to places where projects that I design would not lead you.

The project proposals that you are writing right now will be evaluated by a couple of criteria:

  • feasibility – we estimate the time, effort, and computational resources required, we don’t want you to over- or under-scope, or end up fighting problems that are distracting from the actual proposal
  • complexity – think about the three V of big data: Volume, Velocity, Variety, your project should ideally address all of these, but it is very unrealistic that you find a problem that does. So please identify which of the three Vs you have as the primary issue. Simply scraping the website and visualizing the data doesn’t cut it. Please check this project which is interesting but in my opinion is too simple:http://alproductions.us/blog/2013/11/26/the-end-of-flash-gaming/
  • method – we haven’t talked about the computational methods we teach you, so this is really hard to eyeball, but in general your project should either detect interesting patterns in data (like clusters, modules, trends) or allow predictions (trends, time series, consumer basket) or allow classifications (who will win, who will buy XYZ, this object belongs in this category)
  • originality – if you come up with a project that is mind blowing, or at least very interesting, but does not really conforms to the above, we will probably still try to make it work, simply because.

I have been asked about the business aspect, and ideally your project should answer a business question, or should have commercial value. However, I think it is more important to perform an analysis accurately on a toy problem, than using a real world problem that is either not interesting, or not teaching you the necessary skills. Therefor I am relaxing this constraint, but would of cause be happy if you chose a business problem.

Here are a couple of ideas that I had, but that are also already inspired by conversations with you:

  • money ball – you download some form of sport stats and try to derive a model that is predictive about a game outcome, watch the movie “money ball” or read the book if you are curious about this. In essence you can make your baseball team much better, if you optimize for players that get you on the “first base”.
  • interest biases – I am not talking about money but about what people are interested in. I crawled bloggers.com and tried to find gender biases and stereotypes per state:http://alproductions.us/blog/2013/11/14/gender-bias-and-stereotypes/, bloggers is easy to crawl, has geo tags, blogs that people read, and follow
  • interest clusters – it is not clear how interests relate to each other, one could cluster interests and find categories of interests that belong together
  • interest profiling – if you know one or two interests can you predict what the person might be interested as well?
  • social networks – get data which you use to derive a social (or other type of) network and identify cluster (groups), more interesting would be to see how and why the network changes
  • Diet Coke and Fries – I guess this is a stupid title, but there is the idea that when you order fries it totally doesn’t matter anymore if you order diet coke or not, calorie wise you are over your limit already, however people still order diet coke to fries – or do they? Data can reveal such contradictions, or open opportunities: People invest in either risky or conservative funds, however bet-hatching suggests that you should do both.
  • recommendation system – regardless of the webservice, there is always the option to improve how data is found or accessed, better clustering or better classification as well as a totally new approach is thinkable

Please feel free to add your suggestions and ideas, the more we move ideas around the better. Cheers Arend

Lecture 4

This and the next lecture are about databases. The first will focus on what databases are and how they are setup, the second lecture will deal with using databases and search queries. We will use the sqlite3 module for python which wraps the standard SQL language and as such is mostly compatible with the standard SQL. Alternatively you can use arctic and the installed SQL server there.

The iPython notebook: cse891 sqlite3.ipynb

The slides: Lecture 4

The exercise: Exercise4