First steps
					in Data Mining
					with Weka
					Łukasz Kobyliński & Radosław Szmit
					Codepot 2015
				What is Data Mining?
					Data Mining is a process of discovering hidden information in data.
					
					https://visualisingadvocacy.org/blog/disinformation-visualization-how-lie-datavis
				Typical applications
					Customer analysis
					- which customers are likely to increase their purchases?
- which products are more likely to sell to my customers?
Typical applications
						Text mining
						
						- what is the category of this email we have received?
- is this product review positive or negative?
- what are they saying about me on twitter?
Typical applications
					Image mining
					
					- which images in my collection contain cats?
- which of my contacts are visible on these photos?
- what is the sex and age of these people?
Data Mining Methods
					- Regression analysis
- Classification
- Cluster analysis
- Association rule mining
- Sequence mining
- Anomaly detection
- ...
Task #1
					Assign names to flowers
					How do they differ?
					iris-versicolor, 
					iris-setosa, 
					iris-virginica
					iris-
					iris-
					iris-
					
					
					
					
				        iris-setosa ma niskie wartości dla wszystkich parametrów oprócz sepal-width (szerokie działki korony).
				        iris-versicolor ma średnie wartości dla wszsytkich parametrów.
				        iris-virginica ma wysokie wartości dla wszystkich parametrów oprócz sepal-width.
				    Fisher's iris dataset
						petals and sepals
						Task #2
					Answer the questions:
					- iris-setosa has:- low values of 
- high values of 
 
- iris-virginica has:- low values of 
- high values of 
 
- the three classes are best separated by:- sepallength
- sepalwidth
- petalwidth
 
Task #3
					Use the rules.PART classifier on the iris dataset and answer the questions:
					- which is better on this dataset: J48 or PART?
- how many examples of iris-versicolor have been classified as iris-setosa?
- how many examples of iris-virginica have been classified as iris-versicolor?
- what is the accuracy of the classifier on the training set?
Task #4
					Use the J48 classifier on the iris dataset and answer the questions:
					- use the tree visualization pane to manually perform classification of the following example:- sepallength=6.7, sepalwidth=3.0
- petallength=5.0, petalwidth=1.7
 
- are all the attributes used in the classifier?
- what are the numbers of instances of type iris-versicolor, which were misclassified as iris-virginica? (use the Visualize classifier errors panel).
Data: the evolution
					Big DataData Mining and Knowledge DiscoveryData WarehousingData AccessData CollectionData: the evolution
					Big Data"What’s likely to happen to online sales, considering 1M visits/day?"Data Mining and Knowledge Discovery"What’s likely to happen to Boston unit sales next month? Why?"Data Warehousing"What were unit sales in New England last March? Drill down to Boston."Data Access"What were unit sales in New England last March?"Data Collection"What was my total revenue in the last five years?" 
		
					First steps
					in Data Mining
					with Weka
					
					Łukasz Kobyliński & Radosław Szmit
					Codepot 2015