ChiMerge

💡
ChiMerge is a process that transforms quantitative data into quality data using merging approach.

The algorithm uses χ2\chi^2 statistic to discretize continuous attributes such as numeric attributes, so it performs discretization automatically.

The author invents a better algorithm than user interaction and poorly chosen intervals using domain understanding and other discretization algorithms such as equal-width-intervals, equal-frequency-intervals, C4, CART, and PVM.

However, ChiMerge proposes a concise summarization of a numeric attribute that is an interval, and its high-quality measures are intra-interval uniformity and inter-interval difference. ChiMerge operationalizes the notion of quality with χ2\chi^2 statistic, where χ2\chi^2 is a measure that tests if two discrete attributes are statistically independent.

An outline is present below.

foldr (\x y -> 
          if x and y has the lowest chi value
          then merge x y
          else x y 
      )  
      (map intervals with chi value) 
           -- equal-width-intervals or equal-frequency-intervals 
repeat until chi square exceeds thresehold

The χ2\chi^2 value is

χ2=∑i=0m∑j=1k(Aij−Eij)Eij\chi^2=\sum_{i=0}^m\sum_{j=1}^k \dfrac{(A_{ij}-E_{ij})}{E_{ij}}

where m=2m=2 intervals, k:k:number of classes, AijA_{ij}: number of examples in ithi_{th} interval and jthj_{th} class, RiR_i number of examples in ithi_{th} interval, CjC_j number of examples in jthj_{th}, N total number of examples, EijE_{ij} expected frequency.

And, you determine χ2\chi^2-threshold by selecting a desired significance level.

References

Randy Kerber. 1992. ChiMerge: discretization of numeric attributes. In Proceedings of the tenth national conference on Artificial intelligence (AAAI'92). AAAI Press, 123–128.