Data Exploration

DATA PREPROCESSING

Most analysis methods cannot be performed if there are missing values in the data. If you are going on to study your data by clustering, you may need to put different genes on a single scale of variation. Moreover, missing values may prevent proper classification and poor substitution schemes for missing values may cause classification errors. If all the values substituted are determined by the most likely value, then the individual values are less likely to help define class (cluster) boundaries. Missing data values can negatively impact discovery results, and errors or data skews can proliferate across subsequent runs and cause a larger, cumulative error effect.

CodeLinker provides you with tools to remove missing values filter and normalize your data. Filtering provides a number of gene prioritization options. The processes generally take a large number of genes and apply selection criteria so that the output includes fewer genes. Some methods remove all of the genes that do not meet specified criteria while others allow you to specify the number of genes that will be left after the filtering. In CodeLinker, the term normalization is used to describe scaling, translation, or any other numerical transformation of the data besides filtering. Normalizations which may accomplish this include logarithm, standardization, division by maximum, and scaling between 0 and 1.