K-Means and association rules in Weka

K-Means

We analyze agricultural datasets from New Zealand with K-Means (https://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar), in particular, we will group the eucalyptus dataset.

% Attribute Information:
%  1.  Abbrev - site abbreviation - enumerated
%  2.  Rep - site rep - integer
%  3.  Locality - site locality in the North Island - enumerated
%  4.  Map_Ref - map location in the North Island - enumerated
%  5.  Latitude - latitude approximation - enumerated
%  6.  Altitude - altitude approximation - integer
%  7.  Rainfall - rainfall (mm pa) - integer
%  8.  Frosts - frosts (deg. c) - integer
%  9.  Year - year of planting - integer
%  10. Sp - species code - enumerated
%  11. PMCno - seedlot number - integer
%  12. DBH - best diameter base height (cm) - real
%  13. Ht - height (m) - real
%  14. Surv - survival - integer
%  15. Vig - vigour - real
%  16. Ins_res - insect resistance - real
%  17. Stem_Fm - stem form - real
%  18. Crown_Fm - crown form - real
%  19. Brnch_Fm - branch form - real

Since K-Means’ input are continuous features, our analysis starts by choosing the best diameter base height (DBH) and height (Ht) features. However, they have different units and missing values first, we’ll apply the filter “remove missing values”, “a mathematical expression” and “normalization” in order to keep away bad results.

Analysis

69% of clustered instances belong to cluster 1 and 31% of clustered instances belong to cluster 0, but our plot said another thing, instances are really close, so, a simple K-Means algorithm is unuseful, it doesn’t help to segment instances.

Association rules

We start from zero opening again our dataset.

Association rules algorithms’ inputs are categorical variables, so we discretize first.

We’re associating attributes by choosing the Apriori algorithm.

The rules are the following:

1. Frosts='(-inf--2.9]' 430 ==> Rep='(-inf-3.1]' 430    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.58)
2. Frosts='(-inf--2.9]' 430 ==> DBH='(-inf-4209.022]' 430    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.58)
3. Frosts='(-inf--2.9]' DBH='(-inf-4209.022]' 430 ==> Rep='(-inf-3.1]' 430    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.58)
4. Rep='(-inf-3.1]' Frosts='(-inf--2.9]' 430 ==> DBH='(-inf-4209.022]' 430    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.58)
5. Frosts='(-inf--2.9]' 430 ==> Rep='(-inf-3.1]' DBH='(-inf-4209.022]' 430    <conf:(1)> lift:(1) lev:(0) [1] conv:(1.17)
6. Brnch_Fm='(2.5-3]' 340 ==> Rep='(-inf-3.1]' 340    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.46)
7. Brnch_Fm='(2.5-3]' 340 ==> DBH='(-inf-4209.022]' 340    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.46)
8. DBH='(-inf-4209.022]' Brnch_Fm='(2.5-3]' 340 ==> Rep='(-inf-3.1]' 340    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.46)
9. Rep='(-inf-3.1]' Brnch_Fm='(2.5-3]' 340 ==> DBH='(-inf-4209.022]' 340    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.46)
10. Brnch_Fm='(2.5-3]' 340 ==> Rep='(-inf-3.1]' DBH='(-inf-4209.022]' 340    <conf:(1)> lift:(1) lev:(0) [0] conv:(0.92)

Since real datasets have thousands of instances, we are showing you how to calculate association rules metrics.

Suppose you’re developing a recommendation system for a streaming company with authors Itemset={Mozart, Chopin, Handel, Bach, Queen, Chuck Mangione}. You associate them by extracting the association rule

Antecedent\to Consequent

where $Antecedent\subset Itemset,Consequent\in Itemset$ .

Suppose the following dataset

	Mozart	Handel	Bach	Chopin	Queen	Chuck Mangione
User 1	X			X
User 2	X	X	X	X
User 3		X
User 4					X
User 5					X	X
User 6	X	X	X

If a person listens Mozart and Handel, then he or she listens to Bach. Other words,

\{Mozart, Handel\}\to Bach

How do we measure this association rule's strengths?

Support.

support(Antecedent\to Consequent) =\dfrac{\text{Instances with Antecedent and Consequent}}{\text{Instances}}

So,

support(\{Mozart, Handel\}\to Bach ) =\dfrac{2}{6}=33.33\%

Confidence.

support(Antecedent\to Consequent) =\dfrac{\text{Instances with Antecedent and Consequent}}{\text{Instances with Antecedent}}

So,

support(\{Mozart, Handel\}\to Bach ) =\dfrac{2}{2}=100\%

Lift.

lift(Antecedent\to Consequent) =\dfrac{\text{Confidence}}{\text{Consequent support}}

So,

lift(\{Mozart, Handel\}\to Bach ) =\dfrac{1}{2}=50.00\%