Class Projects for Intro to Data Mining

Go to Project 2

Project 1: Data Pre-Processing

 

    In this project, the students are to apply data pre-processing techniques to the gene expression data-set for lung cancer.

  1. To discretize the datausing equi-width binning with 5 intervals for every attribute.

  2. Apply the Entropy-based method to select the top-k genes, ranked by the information gain.

    Students should first do the above for k=30 genes , then with k=100 genes, and finally on the entire set of genes.

Note:- The gene expression dataset is obtained from the Kent Ridge Bio-medical Data Set Repository. The dimensionality of the original data-set was then modified by selecting the first 500 attributes from the data-set.