Entropy and Information Gain
Neither 'Entropy' and 'Information' are concepts with very intuitive definitions. Most people learn about entropy in chemistry class where it is used to describe the amount of 'order' in a system. But how do you translate 'order' into a mathematical equation? And what about information?
In data science the terms 'Entropy' and 'Information Gain' are usually used in the context of decision trees. Here entropy describes the 'purity' of a set, which of course is equivalent to the order of a system in chemistry. Decision trees try to split up a dataset based on differences in a single feature such that the split results in the 'purest' branches, meaning the lowest amount of variation in the target variable. Then the branches are split again according to the same criterion, until we reach the point where all branches are pure, or we decide the model is strong enough. The entropy (or often you will see cross-entropy or deviance (D)) of a set is defined as follows:
...
where ... is the probability, or fraction of the set m, for all data points of class K. The lower the entropy, the more pure the set. The split that results in the lowest sum of proportional entropies is chosen for each node. This combined in the definition for information gain which describes how much purer children are to their parent node.
...
This is simply the difference in entropy of the parent p at split s and the average entropy of its children C.
Some intuition:
The more homogeneous a set, the higher the entropy and the lower the amount of information it contains. You want to increase the amount of information at each split to get accurate classifications.
In data science the terms 'Entropy' and 'Information Gain' are usually used in the context of decision trees. Here entropy describes the 'purity' of a set, which of course is equivalent to the order of a system in chemistry. Decision trees try to split up a dataset based on differences in a single feature such that the split results in the 'purest' branches, meaning the lowest amount of variation in the target variable. Then the branches are split again according to the same criterion, until we reach the point where all branches are pure, or we decide the model is strong enough. The entropy (or often you will see cross-entropy or deviance (D)) of a set is defined as follows:
...
where ... is the probability, or fraction of the set m, for all data points of class K. The lower the entropy, the more pure the set. The split that results in the lowest sum of proportional entropies is chosen for each node. This combined in the definition for information gain which describes how much purer children are to their parent node.
...
This is simply the difference in entropy of the parent p at split s and the average entropy of its children C.
Some intuition:
The more homogeneous a set, the higher the entropy and the lower the amount of information it contains. You want to increase the amount of information at each split to get accurate classifications.
Nice blog...
ReplyDeleteDOT NET training in Marathahalli
dot net training institute in Marathahalli
dot net course in Marathahalli
best dot net training institute in Marathahalli
Thanks for sharing this blog
ReplyDeletedata science training in bangalore
best data science courses in bangalore
data science institute in bangalore
data science certification bangalore
data analytics training in bangalore
data science training institute in bangalore
selenium training centers in BTM
best software testing training institutes in BTM with placements
automation testing courses in BTM
selenium testing course in BTM
software testing institutes in btm
selenium training in btm
best selenium training in btm
selenium course in btm