Best Split Algorithm—Gini Impurity Measure
When you minimize impurity, you want the observations in each node to have the same value of the target variable. The homogeneity or purity of a partition increases with the proportion of observations that share the same target value. The Best Split algorithm in Xpress Insight uses the measure of Gini impurity, which calculates the heterogeneity or impurity of the node. When the Gini impurity value is 0.0 (minimum value), the partition is homogeneous or pure. When the Gini impurity value is at its maximum value, the node is heterogeneous or impure. The maximum Gini impurity value varies for binary and multinomial target variables.
Settings for Target-Driven Decision Trees
When you define a decision tree with a target variable, you can specify the decision tree settings when you define the tree, or you can adjust your original settings while you are editing the tree. After the tree is created and you select the Best Split () from the Tree View, Xpress Insight takes the values from these decision tree settings to decide when to stop searching for the optimal split.
The resulting list of predictors, sorted by gain in purity, helps inform your decisions about the predictors you want to insert in your decision tree.
Default Settings and Descriptions for Target-Driven Decision Trees
Setting | Default Value | Description |
---|---|---|
Impurity is greater than |
0 |
The valid values for this setting range from 0.0 to 1.0. The Best Split algorithm does not attempt to split a node if the impurity value of the node is less than or equal to this value. For example, if the Impurity value is set to 0.5 and there is a node with an impurity value of 0.4, the algorithm stops trying to separate the remaining records so as to reduce overfitting.
Tip An impurity value of
0.0 always allows splitting, while high values, above
0.5, may inhibit splitting completely.
|
Gain in purity is greater than |
0.0001 |
The valid values for this setting range from 0.0 to 0.5. The Best Split algorithm stops when any further splitting would not improve the purity by more than this value. After the Best Split algorithm runs, the predictors are ranked in order of their gain in purity.
Tip Lower values lead to more splitting, while higher values may inhibit splitting completely.
|
Raw counts are greater than |
100 |
The valid values for this setting are greater than 0. The Best Split algorithm stops searching for the optimal split when the raw counts are less than the value specified in this setting. |
Splits are less than or equal to |
4 |
The valid values for this setting range from 2 through 256. The Best Split algorithm stops searching for the optimal split when the number of splits is greater than the value specified in this setting. |