Sie sind auf Seite 1von 7

17/05/2019 Workshop 5

1608275 Tamur Khan


In [1]: import pandas as pd
from math import log2
import math as m

In [2]: df = pd.read_csv("recipe_1608275.csv")

Showcasing whole dataset

In [3]: df

Out[3]:
meat carbs veg fruit spice outcome

0 pork pasta cabbage orange lavender yumyum

1 pork turnip artichoke plum rosemary yukyuk

2 duck pasta avacado fig fenugreek yukyuk

3 duck dumplings avacado pineapple jalapeno yumyum

4 pork pasta artichoke peach capsicum yukyuk

5 pork dumplings artichoke banana fennel yukyuk

6 sausage pasta artichoke cherry rosemary yumyum

7 sausage dumplings avacado apple capsicum yumyum

8 duck turnip artichoke cherry jalapeno yumyum

9 sausage pasta avacado plum dill yumyum

All good outcomes

In [4]: pos = df[df.outcome == "yumyum"]


pos

Out[4]:
meat carbs veg fruit spice outcome

0 pork pasta cabbage orange lavender yumyum

3 duck dumplings avacado pineapple jalapeno yumyum

6 sausage pasta artichoke cherry rosemary yumyum

7 sausage dumplings avacado apple capsicum yumyum

8 duck turnip artichoke cherry jalapeno yumyum

9 sausage pasta avacado plum dill yumyum

All bad outcomes

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 1/7


17/05/2019 Workshop 5

In [5]: neg = df[df.outcome == "yukyuk"]


neg

Out[5]:
meat carbs veg fruit spice outcome

1 pork turnip artichoke plum rosemary yukyuk

2 duck pasta avacado fig fenugreek yukyuk

4 pork pasta artichoke peach capsicum yukyuk

5 pork dumplings artichoke banana fennel yukyuk

In [6]: pos.shape #shape of the positives dataframe

Out[6]: (6, 6)

In [7]: neg.shape #shape of the negatives dataframe

Out[7]: (4, 6)

Whole dataset entropy

In [8]: def entropy(full_size, n_positive):


if(n_positive == 0 or n_positive == full_size):
return 0
n_negative = full_size - n_positive
p = n_positive / full_size
n = n_negative / full_size
return (-p * m.log2(p) + (-n * m.log2(n)))
print('entropy of the whole dataset:',(entropy(10,4)))

entropy of the whole dataset: 0.9709505944546686

Meat Attribute

In [9]: pork = df[df.meat == "pork"]


duck = df[df.meat == "duck"]
sausage = df[df.meat == "sausage"]
print(pork)
print(duck)
print(sausage)

meat carbs veg fruit spice outcome


0 pork pasta cabbage orange lavender yumyum
1 pork turnip artichoke plum rosemary yukyuk
4 pork pasta artichoke peach capsicum yukyuk
5 pork dumplings artichoke banana fennel yukyuk
meat carbs veg fruit spice outcome
2 duck pasta avacado fig fenugreek yukyuk
3 duck dumplings avacado pineapple jalapeno yumyum
8 duck turnip artichoke cherry jalapeno yumyum
meat carbs veg fruit spice outcome
6 sausage pasta artichoke cherry rosemary yumyum
7 sausage dumplings avacado apple capsicum yumyum
9 sausage pasta avacado plum dill yumyum

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 2/7


17/05/2019 Workshop 5

In [10]: def entropy(n_p, n_n):


if(n_p == 0 or n_n ==0):
return 0
p_plus_n = n_p + n_n
p_ratio = n_p / p_plus_n
n_ratio = n_n / p_plus_n
return -(p_ratio * log2(p_ratio)) - (n_ratio * log2(n_ratio))
print('Entropy for pork:',(entropy(1, 3))) #pork
print('Entropy for duck:',(entropy(2, 1))) #duck
print('Entropy for sausage:',(entropy(3, 0))) #sausage

Entropy for pork: 0.8112781244591328


Entropy for duck: 0.9182958340544896
Entropy for sausage: 0

Fraction of each value in the attribute against the total of all values. Muliplying the fraction by the
entropy of that value gives us entropy of the whole attribute.

So for the meat attribute to get the entropy would be:

total values for pork are 4, divided by all the values in the attribute which are 10. Then multiplying that
by the entropy of pork. Do this for the remaining two values and add them together.

4/10 x 0.811 plus 3/10 x 0.918 etc.

In [11]: MeatEntropy = -(4/10 * 0.811+3/10 * 0.918+3/10 * 0)


infoGainMeat = 0.970-0.599 # entropy of whole dataset - entropy of specific attribute
print('meat entropy:',(MeatEntropy))
print('meat info gain:',(infoGainMeat))

meat entropy: -0.5998


meat info gain: 0.371

Carbs Attribute

In [12]: pasta = df[df.carbs == "pasta"]


turnip = df[df.carbs == "turnip"]
dumplings = df[df.carbs == "dumplings"]
print(pasta)
print(turnip)
print(dumplings)

meat carbs veg fruit spice outcome


0 pork pasta cabbage orange lavender yumyum
2 duck pasta avacado fig fenugreek yukyuk
4 pork pasta artichoke peach capsicum yukyuk
6 sausage pasta artichoke cherry rosemary yumyum
9 sausage pasta avacado plum dill yumyum
meat carbs veg fruit spice outcome
1 pork turnip artichoke plum rosemary yukyuk
8 duck turnip artichoke cherry jalapeno yumyum
meat carbs veg fruit spice outcome
3 duck dumplings avacado pineapple jalapeno yumyum
5 pork dumplings artichoke banana fennel yukyuk
7 sausage dumplings avacado apple capsicum yumyum

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 3/7


17/05/2019 Workshop 5

In [13]: def entropy(n_p, n_n):


if(n_p == 0 or n_n ==0):
return 0
p_plus_n = n_p + n_n
p_ratio = n_p / p_plus_n
n_ratio = n_n / p_plus_n
return -(p_ratio * log2(p_ratio)) - (n_ratio * log2(n_ratio))
print('Entropy for pasta:',(entropy(3, 2))) #pasta
print('Entropy for turnip:',(entropy(1, 1))) #tunip
print('Entropy for dumplings:',(entropy(2, 1))) #dumplings

Entropy for pasta: 0.9709505944546686


Entropy for turnip: 1.0
Entropy for dumplings: 0.9182958340544896

In [14]: CarbsEntropy = -(5/10 * 0.97 + 2/10 * 1.0 +3/10 * 0.918)


infoGainCarbs = 0.970-0.960
print('Carbs entropy:',(CarbsEntropy))
print('Carbs Info Gain ',(infoGainCarbs))

Carbs entropy: -0.9604


Carbs Info Gain 0.010000000000000009

Veg Attribute

In [15]: cabbage = df[df.veg == "cabbage"]


artichoke = df[df.veg == "artichoke"]
avacado = df[df.veg == "avacado"]
print(cabbage)
print(artichoke)
print(avacado)

meat carbs veg fruit spice outcome


0 pork pasta cabbage orange lavender yumyum
meat carbs veg fruit spice outcome
1 pork turnip artichoke plum rosemary yukyuk
4 pork pasta artichoke peach capsicum yukyuk
5 pork dumplings artichoke banana fennel yukyuk
6 sausage pasta artichoke cherry rosemary yumyum
8 duck turnip artichoke cherry jalapeno yumyum
meat carbs veg fruit spice outcome
2 duck pasta avacado fig fenugreek yukyuk
3 duck dumplings avacado pineapple jalapeno yumyum
7 sausage dumplings avacado apple capsicum yumyum
9 sausage pasta avacado plum dill yumyum

In [16]: def entropy(n_p, n_n):


if(n_p == 0 or n_n ==0):
return 0
p_plus_n = n_p + n_n
p_ratio = n_p / p_plus_n
n_ratio = n_n / p_plus_n
return -(p_ratio * log2(p_ratio)) - (n_ratio * log2(n_ratio))
print('Entropy for cabbage:',(entropy(1, 0))) #cabbage
print('Entropy for artichoke:',(entropy(2, 3))) #artichoke
print('Entropy for avacado:',(entropy(3, 1))) #avacado

Entropy for cabbage: 0


Entropy for artichoke: 0.9709505944546686
Entropy for avacado: 0.8112781244591328

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 4/7


17/05/2019 Workshop 5

In [17]: VegEntropy = -(1/10 * 0 + 5/10 * 0.970 + 4/10 * 0.811)


infoGainVeg = 0.970-0.809
print('Entropy for veg',(VegEntropy))
print('Info gain for veg',(infoGainVeg))

Entropy for veg -0.8094


Info gain for veg 0.16099999999999992

Fruit attribute

In [18]: df[['fruit','outcome']]

#for fruit in df:


# print(df[['fruit', 'outcome']])

Out[18]:
fruit outcome

0 orange yumyum

1 plum yukyuk

2 fig yukyuk

3 pineapple yumyum

4 peach yukyuk

5 banana yukyuk

6 cherry yumyum

7 apple yumyum

8 cherry yumyum

9 plum yumyum

In [19]: def entropy(n_p, n_n):


if(n_p == 0 or n_n ==0):
return 0
p_plus_n = n_p + n_n
p_ratio = n_p / p_plus_n
n_ratio = n_n / p_plus_n
return -(p_ratio * log2(p_ratio)) - (n_ratio * log2(n_ratio))
print('Entropy for orange:',(entropy(1, 0))) #orange
print('Entropy for plum:',(entropy(1, 1))) #plum
print('Entropy for fig:',(entropy(0, 1))) #fig
print('Entropy for pineapple:',(entropy(1, 0))) #pineapple
print('Entropy for peach:',(entropy(0, 1))) #peach
print('Entropy for banana:',(entropy(0, 1))) #banana
print('Entropy for cherry:',(entropy(2, 0))) #cherry
print('Entropy for apple:',(entropy(1, 0))) #apple

Entropy for orange: 0


Entropy for plum: 1.0
Entropy for fig: 0
Entropy for pineapple: 0
Entropy for peach: 0
Entropy for banana: 0
Entropy for cherry: 0
Entropy for apple: 0

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 5/7


17/05/2019 Workshop 5

In [20]: FruitEntropy = -(1/10 * 0 + 2/10 * 1.0 + 1/10 * 0 + 1/10 * 0 + 1/10 * 0 + 1/10 * 0 + 2/10
infoGainFruit = 0.970 - 0.2
print('Entropy for the fruit attribute is',(FruitEntropy))
print('Information gain for the fruit attribute is',(infoGainFruit))

Entropy for the fruit attribute is -0.2


Information gain for the fruit attribute is 0.77

Spice Attribute

In [21]: df[['spice','outcome']]

Out[21]:
spice outcome

0 lavender yumyum

1 rosemary yukyuk

2 fenugreek yukyuk

3 jalapeno yumyum

4 capsicum yukyuk

5 fennel yukyuk

6 rosemary yumyum

7 capsicum yumyum

8 jalapeno yumyum

9 dill yumyum

In [22]: def entropy(n_p, n_n):


if(n_p == 0 or n_n ==0):
return 0
p_plus_n = n_p + n_n
p_ratio = n_p / p_plus_n
n_ratio = n_n / p_plus_n
return -(p_ratio * log2(p_ratio)) - (n_ratio * log2(n_ratio))
print('Entropy for lavander:',(entropy(1, 0))) #lavender
print('Entropy for rosemary:',(entropy(1, 1))) #rosemary
print('Entropy for fengreek:',(entropy(0, 1))) #fenugreek
print('Entropy for jelepeno:',(entropy(2, 0))) #jelepeno
print('Entropy for capsicum:',(entropy(1, 1))) #capsicum
print('Entropy for fennel:',(entropy(0, 1))) #fennel
print('Entropy for dill:',(entropy(1, 0))) #dill

Entropy for lavander: 0


Entropy for rosemary: 1.0
Entropy for fengreek: 0
Entropy for jelepeno: 0
Entropy for capsicum: 1.0
Entropy for fennel: 0
Entropy for dill: 0

In [23]: SpiceEntropy = -(1/10 * 0 + 2/10 * 1.0 + 1/10 * 0 + 2/10 * 0 + 2/10 * 1 + 1/10 * 0 + 1/10
infoGainSpice = 0.970 - 0.4
print(SpiceEntropy,('is the entropy for the spice attribute'))
print(infoGainSpice,('is the information gain for the spice attribute'))

-0.4 is the entropy for the spice attribute


0.57 is the information gain for the spice attribute

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 6/7


17/05/2019 Workshop 5

In [24]: print('Meat Entropy is:',(MeatEntropy))


print('Carbs Entropy is:',(CarbsEntropy))
print('Veg Entropy is:',(VegEntropy))
print('Fruit Entropy is:',(FruitEntropy))
print('Spice Entropy is:',(SpiceEntropy))

Meat Entropy is: -0.5998


Carbs Entropy is: -0.9604
Veg Entropy is: -0.8094
Fruit Entropy is: -0.2
Spice Entropy is: -0.4

In [25]: print('Meat Information Gain =',(infoGainMeat))


print('Carbs Information Gain =',(infoGainCarbs))
print('Veg Information Gain =',(infoGainVeg))
print('Fruit Information Gain =',(infoGainFruit))
print('Spice Information Gain =',(infoGainSpice))

Meat Information Gain = 0.371


Carbs Information Gain = 0.010000000000000009
Veg Information Gain = 0.16099999999999992
Fruit Information Gain = 0.77
Spice Information Gain = 0.57

The attribute with the lowest entropy and highest info gain is Fruit

The best attribute to use as the root node of the decision tree would be the one with the lowest entropy, which
in turn would have the highest information gain. From the results we can see that the fruit attribute has the
lowest entropy resulting in the highest info gain. But it would not be suitable to use this attribute as the root
node because the dataset is not large enough and the fruit attribute contains 8 different values out of the total
10 in the whole dataset. If we were to use the fruit attribute as the root node we would have to split on 8
values resulting in a complex decision tree. If the dataset was larger we could get an accurate entropy
calculation result for the fruit attribute. Therefore the best attribute to use as the root node for the decision
tree would be the meat attribute. If we look at the meat attribute and then split it into three subsets of the
values that it contains (pork, sausage and duck) we can look at more attributes to see if the outcome is either
"yumyum" or "yukyuk". For sausage the outcome is always "yumyum" (entropy is 0) so it doesn't matter what
the ingredients are in the other attirbutes.

The next value in the meat attribute with the lowest entropy is pork but the outcome for pork is not concrete
so it needs another attribute to help decide the outcome. To do that we need to look at the attribute after Meat
with the highest information gain and that attribute is Veg, from here we can see the outcome. If it contains
cabbage it will be "yumyum" and if it contains artichoke the outcome will be "yukyuk". Similarly we can look at
the duck value and the only attribute left to decide is carbs which has 3 different values in making a decision
on the outcome; dumplings = yumyum, turnip = yumyum, pasta = yukyuk.

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 7/7