A Logistic Regression Model To Identify Factors Influencing Cassava Productivity in The Southern Part of Sierra Leone

Journal of Agricultural Economics and Rural Development
Vol. 5(2), pp. 592-604, September, 2019. © www.premierpublishers.org, ISSN: 2167-0477
Research Article
A Logistic Regression Model to Identify Factors Influencing

Cassava Productivity in the Southern Part of Sierra Leone
*1Regina Baby Sesay, 2Ahmed Koneh
1,2Department of Mathematics and Statistics, School of Technology, Njala University, Njala, Sierra Leone
The role of cassava as food and cash crop in Sierra Leone has contributed immensely to the
country's economic development. This includes providing employment facilities for Sierra
Leoneans. Cassava is the second largest food crop grown across the country. Despite its
importance and tremendous contributions to the country's economic development, its production
faces several constraints. This work, therefore, focused on using a statistical modeling technique
to key out the major factors influencing cassava productivity in the southern part of Sierra Leone.
It further measured the effect of each factor on cassava productivity. A multiple binary logistic
regression modeling technique were used in the empirical analysis. Two hundred cassava
farmers were randomly selected from the communities in the study area. Cassava productivity
was measured by the level of cassava yield. Initially, several factors were considered as possible
determinants of the level of cassava yield. However, the empirical analysis showed that farm size,
educational level, and age by farming experience are the main factors influencing cassava
productivity in the study area. Increase in farm size can increase cassava yield whiles an increase
in educational level may decrease cassava productivity. Older people with more farming
experience can contribute significantly to cassava productivity.
Key words: Predictors, Sensitivity, Farmers, Yield, Sierra Leone
INTRODUCTION
Cassava with a botanical name Manihot esculenta Crantz In Sierra Leone, the significance of cassava cannot be
originated from South America. It is extensively overemphasized as it stands out to be the main
propagated as an annual crop in the tropical and supplement to rice, which is a well-known staple food for
subtropical regions for its edible starchy tuber (FAO, Sierra Leoneans. Nearly 90% of cassava produced is for
2003a). Cassava is a perennial shrub grown throughout human consumption; less than 10% are semi-processed
lowland tropical regions. for on-farm animal feed (Sanni et al. 2009). This is clearly
seen in the provinces during the raining season, as the
The role of cassava as food and cash crop has contributed demand for food shifts from the staple food, rice to cassava
immensely to the economic development of Sierra Leone. due to an increase in the price of rice.
Cassava is the second largest food crop grown across the
country, with an annual yield of 350,000 tons in 2006 Moreover, annual population growth is about 2.8% in most
(Sanni et al., 2009). The main areas of production are the West African countries, while urban growth is generally
South-West, central and far north of the Country. It is one significantly higher than rural growth. An annual urban
of the most important food crops in Sierra Leone as it growth rate of 5% for a 10-year period implies a 63%
serves as a major source of carbohydrate (FAO, 2004)
According to FAO estimates, 172 million tons of cassava *Corresponding Author: Regina Baby Sesay,
were produced worldwide in 2000. Africa accounted for Department of Mathematics and Statistics, School of
54% (FAO, 2003b). Also, in 2002, world production of Technology, Njala University, Njala, Sierra Leone.
cassava tuber was estimated to be 184 million tons, the E-mail: regisesay@yahoo.com; Tel: +23279235912.
majority of production was in Africa, where 99.1 million Co-Author Email: ahmedkonneh@gmail.com, Tel:
tons were grown (FAO, 2003b). +23279782822
A Logistic Regression Model to Identify Factors Influencing Cassava Productivity in the Southern Part of Sierra Leone
Sesay and Koneh 593
increase in the urban population and the demand for food The Binomial Distribution
(Essers et al. 2005). To feed the urban dwellers, food
supply from every farm household has to increase by at The binomial distribution is appropriate to use as an error
least 63% in 10 years (Sanni et al. 2009). This clearly distribution in logistic regression because:
points out the necessity for an increase in the growth of a
supplementary food crop like Cassava. 1. the outcome of interest is dichotomous (a success or a
failure); and
Despite its importance and tremendous contributions, 2. a number of independent trials are considered.
cassava production in Sierra Leone faces several Let:
constraints. Some of these constraints are: inadequate
funding; lack of farming experience; lack of availability of 1 𝑖𝑓 𝑡ℎ𝑒 𝑖 𝑡ℎ 𝑓𝑎𝑟𝑚′𝑠 cassava 𝑦𝑖𝑒𝑙𝑑 𝑖𝑠 ℎ𝑖𝑔ℎ
land for farming and the educational level of cassava 𝑦𝑖 = {
0 𝑖𝑓 𝑡ℎ𝑒 𝑖 𝑡ℎ 𝑓𝑎𝑟𝑚′𝑠 cassava 𝑦𝑖𝑒𝑙𝑑 𝑖𝑠 𝑙𝑜𝑤
farmers. Equation (1)
This study, therefore, aims to key out the major factors where 𝑦𝑖 is the level of the yield for farm i. Here, 𝑦𝑖 is
influencing cassava productivity in the Moyamba District, considered as a realization of a random variable 𝑌𝑖 that
Southern Province of Sierra Leone. It used a logistic can take the values one and zero with probabilities 𝑝𝑖 and
regression modeling technique to identify the key 1 − 𝑝𝑖 respectively. The distribution of 𝑌𝑖 is called a
determinants of cassava productivity and to measure the bernoulli distribution with parameter 𝑝𝑖 and can be written
effect of each determinant on the yield of cassava grown as
in the study area.
𝑦
𝑝𝑟(𝑌𝑖 = 𝑦𝑖 ) = 𝑝𝑖 𝑖 (1 − 𝑝𝑖 )1−𝑦𝑖 Equation (2)
MATERIALS AND METHODS
for 𝑦𝑖 = 0,1. If 𝑦𝑖 = 1 𝑝𝑖 is obtained, and if 𝑦𝑖 = 0 1 −
Theoretical Frameworks 𝑝𝑖 is obtained.
This section focuses on the review of the theoretical and Logistic Regression Model
conceptual frameworks of using a logistic regression
method for analyzing categorical outcome. It also points From the above discussion of the binomial distribution, the
out the main statistics used in the logistics regression logistic regression model can be understood as a means
model checking. of finding the 𝛽 parameters that best fit:
Logistic Regression 1 β0 + β1 x + ε > 0

𝑦𝑖 = { Equation (3)
0 𝑒𝑙𝑠𝑒
Regression analysis is a predictive modeling technique. It
Where 𝜀 is an error term
investigates and estimates the relationship between a
variable of interest called the dependent or target variable
In short, if 𝑝̂ is the predicted probability that 𝑌 = 1, given
and one or more variables that may have an influence on
the values of 𝑥1 , … , 𝑥𝑘 ,
the dependent variable called predictor(s). Based on the
type of dependent variable(s), the number of independent
the model assumes that
variables and shape of the regression line, there exist
different regression techniques used to investigate 𝑝̂
relevant relationships and to make valuable predictions. log (1−𝑝̂) = 𝛽0 + 𝛽1 𝑥1 +, … , 𝛽𝑝 𝑥𝑘 Equation (4)
Among these numerous regression techniques, this work
used a multiple binary logistic regression modeling Where 𝑌~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑝̂ )
technique to investigate the factors influencing cassava
production in the Moyamba District. Cassava productivity Parameter Interpretation
was measured by the level of cassava yield. A multiple
binary logistic regression was used, multiple’ because Unlike the simple linear model, 𝑌 = 𝛽0 + 𝛽1 𝑥1 indicating
there were over one independent variable, ‘binary’ that if x increases by 1, Y increases by .𝛽1 , in a logistic
because the variable of interest, called the dependent 𝑝̂
regression model, it is log (1−𝑝̂) which increases by .𝛽1 . To
variable was dichotomous (high or low yield) and ‘logistic’
because of lack of linearity between the dependent see this,let the predicted probability of the event of interest
variable and the independent variable (s). be 𝑝0 when 𝑥 = 0 and 𝑝̂1 when 𝑥 = 1, then
𝑃̂0
In building the logistic regression model to achieve the log = 𝛽0
(1 − 𝑃̂0 )
purpose of this research work, the following concepts and
statistics were considered:
J. Agric. Econ. Rural Devel. 594
𝑝̂1 Pearson Goodness-of-Fit Test

log = 𝛽0 + 𝛽1
1 − 𝑝̂1
The Pearson goodness-of-fit test also assesses the
𝑝̂1 𝑝̂0 discrepancy between the current model and the full model.
log = log + 𝛽1 The test-statistic is:
1 − 𝑝̂1 1 − 𝑝̂0 2
O
(O −E ) 2 ( i⁄N−pi )
2
𝜒 = ∑𝑛𝑗=1 i i = N ∑ni=1 Equation (8)
Taking exponent on both sides of this equation we have: E i pi
where
log(
𝑝1
) log
̂0
𝑝
+𝛽1 𝜒 2 = Pearson's cumulative test statistic, which
𝑒 ̂1
1−𝑝 =𝑒 1−𝑝̂0
asymptotically approaches a 𝜒 2 distribution.
Oj == the number of observations of type j.
This gives
N= total number of observations
𝑝1 𝑝̂0 Ej = NPj = the expected (theoretical) frequency of type j,
= × 𝑒 𝛽1 asserted by the null hypothesis that the fraction of type j in
1−𝑝̂1 1−𝑝̂0
Equation (5) the population is pj
nj = the number of cells in the table.
This means, when x increases by 1, the odds of a positive
outcome increase by a factor Hosmer Lemeshow
𝛽 𝛽
of 𝑒1 . Therefore, 𝑒1 is called the odds ratio for a unit
increase in x. This goodness-of-fit test was used to determine whether
To be specific, the odd ratio for a continuous independent the predicted probabilities deviate from the observed
variable, 𝑂𝑅𝑐 can be defined as: probabilities in a way that the binomial distribution does not
𝑜𝑑𝑑𝑠(𝑥+1)
𝐹(𝑥+1)
𝑒 𝛽0 +𝛽1 (𝑥+1)
predict. If the p-value for the goodness-of-fit test is lower
1−𝐹(𝑥+1)
𝑂𝑅𝑐 = = 𝐹(𝑥) = = 𝑒 𝛽1 than the chosen significance level, the predicted
𝑜𝑑𝑑𝑠(𝑥) 𝑒 𝛽0 +𝛽1 𝑥
1−𝐹(𝑥) probabilities deviate from the observed probabilities in a
Equation (6) way that the binomial distribution cannot predict.
In case of a binary independent variable, the odds ratio
𝑎𝑑
can be define as , where a, b, c and d are cells in a 2×2 Hosmer and Lemeshow (2000) recommended partitioning
𝑏𝑐
contingency table the observations into 10 equal sized groups according to
their predicted probabilities. So that
Measures of fit for Logistic Regression
2
(𝑂𝐽 −𝐸𝐽 )
2
𝐺𝐻𝐿 = ∑10
𝑗=1 ~𝜒82 Equation (9)
Like any classical linear model, a vital part of logistic 𝐸𝐽 (1−𝐸𝑗 /𝑛𝑗 )
regression analysis is how well the model fits the Data.
Before trusting the result of a model to make valid 𝑛𝐽 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑗𝑡ℎ groug
conclusions or predict future outcomes, it is important to 𝑂𝑗 = ∑𝑖 𝑦𝑖𝑗 = Observed number cases in the 𝑗𝑡ℎ groug
check the model beyond all reasonable doubt to make sure 𝐸𝑗 =expected number of cases in the 𝑗𝑡ℎ group
that the model assumed is correctly specified and the data
at hand do not conflict with assumptions made by the Measures of the Predictive Power of the Logistic
model. Regression Model
The residuals or differences between observed and fitted The R2 statistics for logistic regression was used to
values were the raw materials used in these tests. measure the predictive power of the model. There are
different versions of R2 in the statistics literature, but this
Deviance Goodness-of-Fit Test work used the Nagelkerke and Cox and Snell R2 Squares
produced by SPSS.
The deviance goodness-of-fit test assesses the
discrepancy between the current model and the full model. In using R2, adding any variable may tend to increase it's
The deviance statistic denoted as D2 is thus; value, even if that variable is irrelevant. For this reason,
𝐷2 = 2 log Ls (β̂) − log Lm (β̂) Equation (7) the adjusted R2 is preferably used to access the predictive
where power of the logistic legression model.
log Lm (β̂) = maximized log-likelihood of the fitted model
log Ls (β̂) = maximized log-likelihood of the saturated Cox and Snell R2 :
It’s sometimes referred to as a
model “pseudo” R2 .
Evidence for model lack-of-fit occurs when the value of D2
is large The Cox and Snell R2 is
Sesay and Koneh 595
2
2 𝐿𝑂 𝑛 control means of pest and disease, credit facility,
𝑅𝐶&𝑆 = 1−( ) Equation (10) extension services, land and socioeconomic factors. The
𝐿𝑀
where n is the sample size socioeconomic factors/variables were made up of the
farmer gender, age, level of education, marital status,
Nagelkerke R Square: The Nagelkerke R Square adjusts religious background and family size. The land variable
the Cox and Snell’s R Square so that the range of possible was the per total land area in acres cultivated by the
values extends to 1. This is achieved by dividing the Cox farmer, which indicates the size of the farm. Labour
and Snell R-squared by its maximum possible value, calculations were based on the total number of people
2⁄
employed to work on a given farm land in a particular crop
𝑁 season. The educational level of the farmers was
1 − 𝐿(𝑀𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 ) Equation (11)
determined by the number of years spent in school. Family
So, if the full model perfectly predicts the outcome and has size was determined by the number of people living in the
a likelihood of 1, Nagelkerke R-squared = 1. This implies, household during the crop year. The output factor was the
When L(Mfull) = 1, R2 = 1 and when L(Mfull) = L(Mintercept), level of cassava yield, which is the total cassava yield in
R2 = 0. The Nagelkerke R Square is: bags per acre per crop season. For example, if the
expected yield per acre is 10 bags of cassava during the
crop year, then below 5 bags was considered as a low
yield, whiles 5 bags and above was considered as high
yield.
Equation (12) Descriptive and Exploratory Data Analysis
Methodology The first stage of the analysis was used to gain an
understanding of the distributions of both continuous and
This section introduces stages involved in the data
categorical variables.
analysis. It also points out the type of data analysis
adopted at each stage together with the need for each For the continuous variables, a bivariate exploratory data
analysis. analysis was carried out to know if there was a relationship
between each continuous independent variable and the
Description of study area categorical outcome variable.
This study was carried out in the Moyamba District, the The independent sample t test was used as an exploratory
southern part of Sierra Leone with. a population of 318,064 tool. Like any exploratory analysis, the independent
in the 2015 population and housing census (Statistics sample t-test helped to determine whether it was worth
Sierra Leone, 2015). Moyamba District has a seasonal fitting a logistic regression model for the continuous
variation like any other parts of the country. It has a rainy variables. A significant difference in mean was an
season that starts in May and ends in October and a dry indication that, using a logistic regression model would be
season that starts in November and ends in April. One of the best as the results would be significant.
the main occupations of people living in this part of the
country is farming. Variable Selection
Sampling Technique and Data Collection The following steps were taken to select variables to enter
In line with the work of Peduzzi et al. (1996) (for sample the Logistic Regression Model.
size consideration in logistic regression), a random
sampling technique was employed to select two hundred Step 1. Univariable Analyses
(200) cassava farmers from the communities in the study The univariate logistic regression was used to test the
area. Questionnaires containing questions relating to the association of each explanatory variable (one at a time)
level of cassava output together with potential factors that with the outcome variable. This step helped to eliminate
might influence the level of the output were administered insignificant variables from the model (i.e. variables that do
to all the selected cassava farmers. The data obtained not show any significant association with the dependent
provided information on the socioeconomic characteristics variable all by themselves) as such variables are not likely
of the cassava farmers, output or yield of cassava and to be associated with the outcome variable even after all
other factors such as farming experience, farm size, the other variables are added to the model).
sources of labour, source of farm power, control means of
pest and disease, credit facilities, and extension contacts. The result of this univariate analysis includes: Wald and
likelihood ratio chi-square test statistics and their P-values;
Measurement of Variables parameter estimates and standard errors; and odds ratios
The study used data on technical coefficient (input-output) and their confidence limits. Each of these results were
of cassava production. The input factors include labour, considered.
Furthermore, since the values of the parameters for the model assumed is correctly specified and that the
logistic regression are calculated on a log scale, odd ratios data at hand does not conflict with assumptions made by
were examined. The odd ratios were calculated after the model. In this work, the Hosmer and Lemeshow
exponentiating the parameter estimates. An odds ratio Goodness of fit test was used to check whether the logistic
greater than one (>1) indicates a positive association, less model assumed was correctly specified.
than one indicates (<1) indicates a negative association
and equal to one (=1) indicates no association of the tested Model Discrimination
variable with the outcome.
How well the model distinguished between the two groups
Step 2. Multivariable Analyses in the binary outcome in binary logistic regression was
assessed using the area under the receiver operating
The next step was to carry out the multiple logistic characteristic (ROC) curve. This curve was obtained by
regression analysis on the selected independent variables. plotting sensitivity against specificity. The diagonal line
At the end of the multiple logistic regression analysis, represents chance. A curve that is far above the diagonal
those variables found to be insignificant were not included line shows that an indicator is accurate. This measure
in the final model. varies between 0.5 and 1. An area of 0.5 represents the
diagonal, attained when no discrimination exists. An area
Test for Parameters closer to 1 represents a good indicator. Whereas an area
After the multiple logistic regression analysis, the of 1 represent a perfect indicator.
importance of each explanatory variable was assessed by
carrying out statistical tests of the significance of the Measures of the Predictive Power of the Model
coefficients. Parameter estimates and standard errors of
the variables in the model were assessed after addition or The R2 statistic for logistic Regression was used to
deletion of a variable. This was done using the Wald and measure the predictive power of the model.
likelihood ratio test statistics and their associated p-values.
Test for Model Assumptions
The Wald statistic
In the case of binary logistic regression, the fact that the
Wald χ2 statistic
was used to test the significance of probability lies between 0 and 1 imposes a constraint.
individual coefficients in the model. The statistic is Therefore, both the assumptions of constant variance and
calculated as follows: normality present in multiple linear regressions are lost.
However, like every statistical test, there are certain
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 2 assumptions that needed to be met if the result of the
( ) Equation (13) multiple binary logistic regression model must be useful.
𝑆𝐸 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
The model was checked to make sure that the data did
Each Wald statistic was compared to a χ2 distribution with not fail those assumptions.
1 degree of freedom. Wald statistics are easy to calculate,
but their reliability is questionable, particularly for small Multicolinearity
samples. For data that produce large estimates of the
coefficient, the standard error is often inflated, resulting in Multicollinearity occurs when the model includes multiple
a lower Wald statistic, and therefore the explanatory independent variables that are correlated with each other.
variable may be incorrectly assumed to be unimportant in This normally occurs when there are some independent
the model. Likelihood ratio tests (see below) are generally variables that are redundant. It is a type of disturbance that
considered to be superior. may be present in the data. If this disturbance is not
eliminated from the data, any statistical inferences made
Likelihood ratio test: The likelihood ratio test for a about the data may not be reliable. There are a number of
particular parameter compares the likelihood (L0) of ways of detecting multicollinearity in a data set. Among
obtaining the data when the parameter is zero with the these are two collinearity diagnostic factors that can help
likelihood (L1) of obtaining the data evaluated at the MLE to identify multicollinearity. These are, the value of the
of the parameter. The test statistic is calculated as follows: tolerance and its reciprocal, called variance inflation factor
−2 × 𝑙𝑛(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑟𝑎𝑡𝑖𝑜) = −2 × 𝑙𝑛(𝐿0/𝐿1) = −2 × (𝑙𝑛𝐿0 − 𝑙𝑛𝐿1) (VIF). If the value of the tolerance is less than 0.2 or 0.1
Equation (14) and, simultaneously, the value of VIF 10 and above, then
multicollinearity is problematic.
Measures of Fit for Logistic Regression Model
The variable’s tolerance is 1 − R2 . Generally, a small
As already mentioned under the theoretical framework tolerance value indicates that the variable under
section, before trusting the result of a model to make valid consideration is almost a perfect linear combination of the
conclusions or predict future outcomes, the model should independent variables already in the equation and that it
be checked beyond all reasonable doubt to make sure that should not be added to the model.
Sesay and Koneh 597
Also, if the standard errors of the regression coefficients The final step was to find out if there were observations
are large, then multicollinearity is an issue. that do not fit the model well (outliers), have strange values
In addition to the standard errors of the regression for any variable (leverage) or that have undue influence on
coefficients, this work used the tolerance statistics and the the model (influence).
variance inflation factor (VIF) to test for multicollinearity.
This ended the variable selection for the final model.
Interaction
Final Model
To test for interaction, the logistic regression analysis was
carried out with an interaction term, the p-value of the After the variable selection stage, the next step was to fit
regression output determined whether or not to include an and assess the final logistic regression model. Most of the
interaction term in the model. A significant p-value led to diagnostic steps taken during the variable selection stage
the retention of the interaction term in the present model. were again applied to the final model. This was done to
ensure the appropriateness, adequacy and usefulness of
Influential Observation and Outliers the final model upon which our conclusion was based.
EMPIRICAL ANALYSIS
Descriptive Statistic/ Exploratory data Analysis

Table 1. Descriptive Analysis: Dependent (DV) and Independent (IV) Variables to be Modeled
Variable Name IV/DV Valid Range Variable Type
Cassava Yield/Outcome DV High, Low Character, Categorical
Educational Level IV No Formal Education, Character, Categorical
Primary School,
Secondary School,
Tech - Voc.
Gender IV Male, Female Numeric, Categorical
Land Owner IV Self, Communal, Lease, Rent Character, Categorical
Family Size IV 1 -17 Numeric, Categorical
Farm Size IV 1-10 acres Numeric, Continuous
Age IV 17-59 years Numeric, Continuous
Farming Experience IV 1-29 yesrs Numeric, Continuous
Source of Labour IV Family, haired, communal Numeric, Categorical
Pesticides IV Yes, No Numeric, Categorical
Credit Facility IV Yes, No Numeric, Categorical
Extension Services IV Yes, No Numeric Categorical
Descriptive Statistic For Categorical Variable

Table 2:Descriptive Statistics
N Range Minimum Maximum
EDUCATIONAL LEVEL 200 3 0 3
FAMILY SIZE 200 16 1 17
OWNERSHIP OF THE FARM LAND 200 4 0 4
SOURCES OF LABOUR 200 2 0 2
CREDIT FACILITIES 200 1 0 1
SOURCE OF FARM POWER 200 2 0 2
PESTS AND DISEASES CONTROL 200 0 0 0
Valid N (listwise) 200
Descriptive Statistic for Continuous Variable

Table 3: Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
FARM SIZE OF THE RESPOND 200 0 10 5.29 3.086
AGE OF RESPONDENT 200 17 59 41.55 8.689
FARMING EXPERIENCE OF RESPONDENT 200 1 29 14.51 8.460
Valid N (listwise) 200
Exploratory data Analysis normality was satisfied before using the t-test for
independent means.
A bivariate exploratory analysis was carried out to know if
there was a relationship between the continuous Quantile-Quantile (Q-Q) plot for continuous
independent variables and the categorical outcome independent variables
variable. The independence sample t-test was used to
explore the relationship between each of the continuous The Q-Q plots for the continuous variables are presented
independent variables and the outcome variable, cassava in figure 1. The Q-Q plot is a graphical method for
yield. comparing two probability distributions by plotting their
quantiles against each other. A concave departure from
Like any statistical test, before using the independence the straight line in the Q-Q plot is an indication of a heavy
sample t-test, the common assumptions made when doing tailed distribution, whereas a convex departure is an
a t-test were considered. The assumption of the t-test for indication of a thin tail.
independent means focuses on sampling, research
design, measurement, population distributions and From the Q-Q plot in figure 1, it is evident that, the
population variance. The t-test for independent means is distributions of the continuous independent variables are
considered typically robust for violations of the normal not perfectly normally distributed. However, because of the
distribution assumption (with a larger sample size). This central Limit Theorem (sample size is greater than 30) and
work used the QQ-plot to see if the assumption of the data was obtained randomly, the t-test was carried out.
Figure 1: QQ-plot of continuous variables

The significance level in the independence sample t-test
Independent Samples Test (in table 4) for farm size in relation to cassava yield is far
below the threshold significance level of 0.05. This means
The independent sample t-test was carried out for each that the mean difference in the farm size for those cassava
of the continuous independent variables, to determine if: farmers with high cassava yield and those with low
(1) there is a statistically significant difference in the mean cassava yield is statistically significant. This further implies
experience gained by cassava farmers with high that, there is a relationship between farm size and cassava
cassava yield and those with low cassava yield. yield. The logistic regression model was used to further
(2) there is a statistically significant difference in the mean explore this relationship.
farm size used by cassava farmers with high cassava
yield and those with low cassava yield. Similarly, the significance level in the independence
(3) there is a statistically significant difference in the mean sample t-test, (presented in table 6) for farming experience
age of cassava farmers with high cassava yield and is far below the threshold significant level of 0.05. This
those with low cassava yield. means that, the mean difference in the farming experience
of those cassava farmers with high yield and those with
The independent sample t-test acted as an exploratory low yield is statistically significant. This further implies that,
tool. Like all exploratory analysis, the independence there is a relationship between farming experience and
sample t-tests helped to determine if it is worth fitting a cassava yield. The logistic regression model was used to
logistic regression model for these variables or not. A further explore this relationship
significant difference in mean, implies, running a logistic
regression would be the best, as the results would be However, the significance level in the independence
significant. Below are the outputs of the independent sample t-test for the continuous variable, age is above the
sample t-tests for the continuous variables used in the final threshold significance level of 0.05. This means that, the
model. difference in the mean age of those Cassava farmers with
high yield and those with low yield is not statistically
significant.
Sesay and Koneh 599
Table 4: Independent Samples Test

Levene's
Test for
Equality of
Variances t-test for Equality of Means
Sig. 95% Confidence Interval of
(2- Mean Std. Error the Difference
F Sig. t df tailed) Difference Difference Lower Upper
FARM SIZE Equal
OF THE variances 2.130 .146 4.738 198 .000 1.979 .418 1.155 2.803
RESPOND assumed
Equal
variances 4.745 188.042 .000 1.979 .417 1.156 2.802
not assumed

Levene's Test
for Equality of
95% Confidence
Interval of the
Sig. (2- Mean Std. Error Difference
AGE OF Equal
RESPONDENT variances .070 .792 1.914 198 .057 2.353 1.230 -.072 4.778
assumed
Equal
variances
1.917 188.006 .057 2.353 1.228 -.069 4.775
not
assumed

Levene's
Test for
Equality of
Sig. 95% Confidence Interval of
(2- Mean Std. Error the Difference
FARMING Equal
EXPERIENCE variances .456 .500 7.544 198 .000 8.033 1.065 5.933 10.133
OF assumed
RESPONDENT Equal
variances
7.612 192.494 .000 8.033 1.055 5.952 10.115
not
assumed
Variable Selection From table 7, all the independent variables with p-values
less than the threshold value of 0.05 were found to be
This involves two stages of analysis, the univariate stage significant and hence associated with the dependent
and the multivariable stage. variable. At the second stage of the variable selection
procedure, all the significant independent variables were
Univariate Analysis further simultaneously investigated using the multivariable
logistic regression.
This is the first stage of the variable selection procedure.
Each of the variables was investigated separately using
univariate logistic regression. Table 7 gives a combined
summary of all the univariate outputs.
Table 7: P-Values and Odd Ratios of Independent

Variables from Univariate Analysis In addition, due to further statistical investigation on each
Factor P-values (Wald P-values Odd of the statistically significant independent variables
test) (LR test) Ratio mentioned below (Table 8), some of them did not enter the
(OR) final model. The reason being that, further statistical
Age 0.010 0.009 1.673 investigations (tests) on these variables showed that some
Educational Level 0.00 0.00 2.256 of them influenced the outcome variable in such a way that
Family Size 0.331 0.307 0.410 their inclusion in the model violates the assumption of ‘no
Farm Experience 0.001 0.000 0.031 outlier. For example, when the variable, credit facility
Land Owner 0.474 0.474 0.923 entered the model as an independent variable with
Farm Size 0.00 0.000 0.174 extremely high significant value, the maximum of the
Source of Labour 0.00 0.00 0.137 cook’s distance exceeded one (1). It even attained the
Pesticides 0,00 0.00 0,171 value of two (2) which is a clear violation of the assumption
of ‘no outlier’ or influential observation for the validity of the
Credit Facility 0.001 0.00 0.329
result of the logistic regression model.
Extension Services 0.193 0.191 0.680
Gender 0.078 0.078 1.680
Some of the discoveries of the statistical investigations on
the independent variables are actually in line with reality.
Multivariate Analysis
For example, very few farmers have access to credit
facilities. The few that have access may tend to have big
The multivariate output together with the goodness of fit
farm lands, more laborers, and improved planting
test result for the multivariate analysis are presented in materials leading to very high cassava yield/output. On the
tables 8 and 9 respectively. From table 8, the Wald other hand, some unfaithful cassava farmers may use the
statistic, p-values for some of the independent variables
money received from the credit to do something different
are greater than the chosen significant threshold value of
from the cassava production for which it was obtained
0.05. The statistically significant independent variables
(credit facility’s odd Ratio <1, meaning higher credit grant
base on the p-values are: farming experience, educational
decreases the odds of cassava yield). So it was not
level, credit facility, source of labour and control means of surprising to see that when credit facility entered the
pest and disease. This implies that some of the variables equation, the incidence of influential /Outlier observation
that entered the model during the multivariable analysis
was alarming. Nevertheless, we still acknowledge the fact
stage were found to be insignificant. The Hosmer-
that credit facility is an extremely high determinant of high
Lemeshow test of goodness of fit ( in Table 9) shows that,
or low level of cassava yield (outcome variable) in the
at this multivariate analysis stage, the model is not a good
study area.
fit to the data as p=0.004<0.05.
Table 8: Variables in the Equation

B S.E. Wald df Sig. Exp(B)
Step 1a FARMING_EXPERIENCE .104 .032 10.690 1 .001 1.109
AGE -.009 .026 .120 1 .729 .991
EDUCATIONAL 14.967 3 .002
EDUCATIONAL(1) -1.874 1.189 2.486 1 .115 .153
EDUCATIONAL(2) .304 1.329 .052 1 .819 1.355
EDUCATIONAL(3) -.181 1.250 .021 1 .885 .834
FARM_SIZE .127 .069 3.371 1 .066 1.136
SOURCES_OF_LABOUR 6.532 2 .038
SOURCES_OF_LABOUR(1) -1.370 .538 6.474 1 .011 .254
SOURCES_OF_LABOUR(2) -.490 .604 .658 1 .417 .612
CREDIT(1) -3.085 .661 21.803 1 .000 .046
CONTROL_MEAN(1) -1.816 .572 10.083 1 .001 .163
Constant 4.142 1.911 4.697 1 .030 62.952
a. Variable(s) entered on step 1: FARMING_EXPERIENCE, AGE, EDUCATIONAL, FARM_SIZE,
SOURCES_OF_LABOUR, CREDIT, CONTROL_MEAN.
Table 9: Hosmer and Lemeshow Test

Step Chi-square df Sig.
1 22.670 8 .004
Sesay and Koneh 601
Now that the significant independent variables in relation were taken to build and confirm the final model so as to
to the output variable are selected, the next step was to fit achieve our objective of identifying the main factors that
the final model for the logistic regression analysis. influence cassava productivity and to determine the effect
of each factor on cassava produced in the study area.
Final Model
This is the last stage of the analysis. After the variables The categorical variable coding result presented in table
have been selected from the first two stages of the logistic 10 shows that majority of cassava farmers were illiterates
regression modeling, the following analytical procedures with no formal education.
Table 10: Categorical Variables Codings

Parameter coding
Frequency (1) (2) (3)
EDUCATIONAL LEVEL OF RESPONDENT NO FORMAL EDUCATION 119 1.000 .000 .000
PRIMARY SCHOOL 16 .000 1.000 .000
SECONDARY SCHOOL 55 .000 .000 1.000
TECH - VOC. 10 .000 .000 .000
The model coefficients are contained in the column a unit (acre) increase in farm size is higher than at the
headed B in Table 11. A negative coefficient means that original farm size. Also, from table 11, educational level is
the Odd of increase in cassava yield decreases. seen as a significant factor in determining the level of
cassava yield. It odd ratio (Exp(B)) is less than one (<1) for
The output in Table 11 helped to identify the key all levels (no formal education, primary school, secondary
determinants of increase or decrease in cassave school and tech voc). This means that the probability of
productivity. That is, those independent variables that high cassava yield with a unit increase in educational level
contributed significantly to the level of cassava yield. It is lower than at original (or no increase). In other words,
also helped to determine how each determinant influenced the odds of increase in cassava yield is lower for farmers
cassava yield. with high educational level than for those with no or low
educational level. Lastly, from table 11, the interaction
From table 11, it is clear that among the independent term, farming experience by age is a highly significant
variables that entered the final model, farm size with factor (with a significant level of 0.00) in determining the
significance level (for Wald) that is far below the threshold level of cassava yield. It odd ratio is greater than one. This
significance level of 0.05 is the main factor that influenced implies that, the odds of an increase in cassava yield is
the level of cassava yield. The odd ratio (Exp(B)) higher for older people with more farming experience than
associated with farm size is 1.188 which is greater than for younger people with less farming experience. That is,
one (>1), meaning, an increase in farm size will increase the probability of an increase in cassava yield is higher with
the probability of an increase in cassava yield. In other a unit (year) increase in age by farming experience than at
words, the probability of high cassava yield occurring with original.
Table 11: Variables in the Equation
95% C.I.for EXP(B)
B S.E.
Wald df Sig. Exp(B) Lower Upper
Step 1a EDUCATIONAL 21.921 3 .000
EDUCATIONAL(1) -2.099 1.217 2.975 1 .085 .123 .011 1.331
EDUCATIONAL(2) -.971 1.322 .539 1 .463 .379 .028 5.051
EDUCATIONAL(3) -.085 1.263 .005 1 .946 .918 .077 10.908
FARM_SIZE .172 .060 8.357 1 .004 1.188 1.057 1.335
AGE by FARMING_EXPERIENCE .003 .001 26.598 1 .000 1.003 1.002 1.004
Constant -.809 1.275 .402 1 .526 .445
a. Variable(s) entered on step 1: EDUCATIONAL, FARM_SIZE, AGE * FARMING_EXPERIENCE.
Model Checking likelihoods (-2LLs) suggests that the new model is
explaining more of the variation in the outcome variable
Chi-square goodness of fit test for model coefficients than the baseline model. In other words, a significantly
reduced value of the Log-likelihoods shows that the new
The test in table 12 was used to check if the present (new)
model is an improvement over the baseline model. From
model with explanatory variables included is an
Table 12, the chi-square statistic is highly significant (chi-
improvement over the baseline model. This test uses the
square=87.395 df=5, p<.000). This shows that, the present
chi-square test to see if there is a significant difference
(new) model is significantly better compared to the
between the Log-likelihoods of the baseline model and the
baseline model.
present model. A significantly reduced value of the Log-
Table 12: Omnibus Tests of Model Coefficients predictor variables used in the analysis. In other words, it
Chi-square df Sig. was used to identify points that negatively affect the logistic
Step 1 Step 87.395 5 .000 regression model. The measurement is a combination of
Block 87.395 5 .000 each observation’s leverage and residual values; the
Model 87.395 5 .000 higher the leverage and residuals, the higher the Cook’
distance. A Di value of more than 1 indicates that an
From the classification table presented in Table 13, the influential observation is present.
present logistic regression model correctly classified the
outcome for 77% of the cases. The maximum and minimum values of the Cook’s Distance
for our analysis are presented in the summary table (table
Outcome Classification 16) below.
Table 13: Classification Tablea From table 16, the maximum value of Di is 0.40012 which
Predicted is less than one (<1). Therefore, the issue of influential
OUTPUT OR observation or outlier is not alarming.
YIELD Percentage
Observed LOW HIGH Correct Table 16: Analog of Cook's influence statistics
Step OUTPUT LOW 66 22 75.0 N Valid 200
1 OR YIELD HIGH 24 88 78.6 Missing 0
Overall Mode .00028
77.0
Percentage Range .40010
a. The cut value is .500 Minimum .00003
Maximum .40012
Model chi-square goodness of fit test
MODEL DISCRIMINATION
The hypothesis tested for the model goodness of fit were
stated as: How well the model distinguishes between the two groups
in the binary outcome in binary logistic regression was
𝐻0 : The model is a good fitting model. assessed using the area under the receiver operating
𝐻𝑎 : The model is not a good fitting model. characteristic (ROC) curve.
From table 14, the tests of goodness of fit shows that, the The Two basic measures of diagnostic accuracy are the
model is a good fit to the data as 𝑝 = 0.724 > .05 sensitivity and specificity (Zhou et al 2002). When
sensitivity is plotted against 1-specificity we obtained the
Table 14: Hosmer and Lemeshow Test receiver operating characteristic (ROC) curve. The
Step Chi-square df Sig. diagonal line in the curve represents chance. The curve in
figure 2 is well above the diagonal line. In addition, from
1 5.309 8 .724
table 16, the area under the curve (AUC) is 0.814. This
represents a high predictive accuracy of the chosen
Measures of the Predictive Power of the Model
model. In other words, an AUC value of 0.814 (which is
close to 1) indicates that the model reliably distinguished
From the model summary result presented in table 15, it is
between cassava farmers with high and low cassava
clear that, between 35% and 47% of the variation in
yields.
cassava yield was explained by the logistic regression
model.
Table 15: Model Summary

-2 Log Cox & Snell R Nagelkerke R
Step likelihood Square Square
1 186.977a .354 .474
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.
Influential Observation and Outliers
Again, it is good to find out if there are observations that

do not fit the model well (outliers), have strange values
(leverage) or that have undue influence on the model
(influential). In this work, the cook’s distance denoted as
Di, was used to find an influential predictor in the set of Figure 2: Receiver Operating Characteristic (ROC) curve
Sesay and Koneh 603
Table: 16: Area Under the Curve size will increase the probability of high cassava yield. In
Test Result Variable(s): other words, the probability of high cassava yield occurring
Area with unit (acre) increase in farm size will be higher than at
.814 the original farm size. This result is in conformity with the
research result documented by Ren et al (2019), that farm
Multicolinearity size plays a critical role in agricultural sustainability.
As already measured under the methodology section, Educational level was also shown to be a significant factor
among the number of ways of detecting multicilinearity in in determining the level (high or low) of cassava yield.
a data set, this work used the value of the tolerance and However, in line with the view of mejority Sierra Leoneans,
its reciprocal, called the variance inflation factor (VIF) to that, subsistence farming is an option for those who failed
detect or identify multicollinearity in the data. The to go to school or droped out uf school, its odd ratio
variable’s tolerance is 1 − 𝑅2 . If the value of tolerance is (Exp(B)) is 0.123 which is less than one (<1). This means
less than 0.2 or 0.1 and, simultaneously, the value of VIF that the probability of high cassava yield with unit increase
10 and above, then multicollinearity is problematic. in educational level is lower than at original (no increase).
From our analysis, the highest value of, 𝑅2 which is the In other words, the odds of increase in cassava yield is
Negelkerke, 𝑅2 is equal to 0.474. Hence the tolerance is lower for higher educational level. This result is similar to
calculated as 1 − 𝑅2 = 1 − 0.474 = 0.526 and it VIF is that obtained by Malte Reimers and Stephan Klasen
1 (2013) who detected insignificant or even surprisingly
2.1097 (𝑖. 𝑒. 𝑉𝐼𝐹 = = 2.1097) . The tolerance is far negative effects of schooling on agricultural productivity
0.474
above 0.1 and the value of VIF is far below 10. It is
therefore concluded that multicolinearity is not Finally, the interaction term, farming experience by age is
problematic. In addition, the standard errors of the a high] y significant factor (with a significant level of 0.00)
coefficients are not too significant. This further suggested in determining the level of cassava yield. Its odd ratio is
that multicolinearity is not an issue here. greater than one. This implies that the Odds of increase in
cassava yield are higher for older people with more
farming experience than for younger people with less
RESULTS AND DISCUSSION farming experience. In other word, the probability of an
increase in cassava yield is higher with a unit (year)
A logistic regression analysis was carried out to find out increase in age by experience. This is not surprising as
the main factors influencing cassava productivity in the extension services for disseminating information on farm
Moyamba District, southern province of Sierra Leone. The technologies are not common in the rural areas. Farmers
level of cassava productivity was measured by the level only gain experience after long years of farming. A study
(high or low) of cassava yield. At the initial stage of the conducted by Gideon Danso-Abbeam et al (2018)
analysis, many factors were considered as potential reaffirmed the critical role of extension programmes in
determinants of cassava productivity in the study area. enhancing farm productivity and household income.
However, further statistical investigation proved that some
of those factors were not significant determinants of a high Credit facility, though, did not enter the final model (as it
or low yield of cassava. Insignificant factors were dropped exhibited an extreme behavior), was still recognized as a
out of the analysis. Variables (factors) that entered the significant determinant of high level of cassava yield. This
final model are: farm size, educational level and the is because, the Wald p-value associated with credit facility
interaction term, age by farming experience. was significant at both the univariate and multivariable
stages of the variable selection in the logistic regression
At the final stage of the analysis, the logistic regression modeling. This result is supported by Ekwere et al (2014),
model was significant, as the test of the full model against in their book title, “Effects of agricultural credit facility on
a model with only the constant was significant. This shows the agricultural production and rural development, In their
that the predictors as a set reliably distinguished between book, they documented that, the independent variables;
a high and low yield of cassava (chi square = 87.395, p < loan size, farm size, and inputs explained the variation in
.05 with df=5). The model explained between 35% and the total value of farmers output.
47% (Negelkerke R2 and Cox and snail R2 respectively) of
the variation in the cassava yield.
CONCLUSION
The Wald criteria showed that, among the independent
variables that entered the final model, farm size with a The purpose of this work was to identify the main factors
significance level (for Wald) that is far below the threshold influencing cassava productivity and to determine the
significance level of 0.05 was the main factor that effect of each factor on cassava yield/output. The
influenced the level of cassava yield. The odd ratio empirical evidence showed that, farm size, educational
(Exp(B)) associated with farm size is 1.188 which is level, and age by farming experience are the main factors
greater than one (>1), meaning that, an increase in farm influencing cassava productivity in the study area. An
increase in farm size can increase cassava yield whiles an Peduzzi P., Concato J., Kemper E, Holford T R, Feinstein
increase in educational level can decrease cassava yield. A.R. (1996). Simulation study of the number of events
In fact, most of the cassava farmers were illiterates with no per variable in logistic regression analysis. Journal of
formal education. Older people with more farming Clinical Epidemiology.
experience contributed significantly to cassava production Ren, Chenchen, Liu, Shen et al (2019), The impact of farm
in the study area. size on agricultural sustainability, Journal of Cleaner
. Production vol. 22
Sanni, L.O., Onadipe, P. Ilona, M.D. Mussagy, A. Abass,
REFERENCES and
A.G.O. Dixon, (2009). Successes and challenges of
Ekwere et al 2014, Effects of agricultural credit facility on cassava enterprises in West Africa: a case study of
the agricultural production and rural development, Nigeria, Bénin,and Sierra Leone. IITA, Ibadan, Nigeria.
International Journal of the Environment Vol.3 (2) 192- 19 pp
204 Statistics Sierra Leone, (2015). Population and Housing
Essers MA, DE Vrics-Smits LM, Barker N et all. (2005), Census.
functional interaction between beta catenin and FOXO Zhou, X. H., Obuchowski, N. A., and Obushcowski, D. M.
in oxidative stress signalling. Science 308; 1181- 1184 (2002). Statistical methods in diagnostic medicine.
FAO, (2003a). The state of food insecurity in the World: Wiley & Sons: New York.
monitoring progress towards the food summit and
millennium development goals: Rome, Italy: pp. 24-26.
FAO. (2003b). Cassava production data (2002).
(http:/www.fao.org).
FAO (2004). Proposals for a definition and methods of Accepted 29 August 2019
analysis for dietary fibre content. CX/NFSDU 04/3 Add
1. Codex Committee on Nutrition and Foods for Special Citation: Sesay RB, Koneh A (2019). A Logistic
Dietary Uses. Codex Alimentarius Commission. Regression Model to Identify Factors Influencing Cassava
Gideon Danso-Abbeam, Dennis Sedem Ehiakpor and Productivity in the Southern Part of Sierra Leone. Journal
Robert Aidoo (2018), Agricultural extension and its of Agricultural Economics and Rural Development, 5(2):
effects on farm productivity and income: insight from 592-604.
Northern Ghana, Agriculture & Food Security
https://doi.org/10.1186/s40066-018-0225-
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic
regression. New York: Wiley. Copyright: © 2019: Sesay and Koneh. This is an open-
Malte Reimers and Stephan Klasen (2013), Revisiting the access article distributed under the terms of the Creative
Role of Education for Agricultural Productivity, Commons Attribution License, which permits unrestricted
American Journal of Agricultural Economics, vol. 95, use, distribution, and reproduction in any medium,
issue 1, 131-152 provided the original author and source are cited.

A Logistic Regression Model To Identify Factors Influencing Cassava Productivity in The Southern Part of Sierra Leone

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Logistic Regression Model To Identify Factors Influencing Cassava Productivity in The Southern Part of Sierra Leone

Hochgeladen von

Copyright:

Verfügbare Formate

Journal of Agricultural Economics and Rural Development

Vol. 5(2), pp. 592-604, September, 2019. © www.premierpublishers.org, ISSN: 2167-0477

A Logistic Regression Model to Identify Factors Influencing

Key words: Predictors, Sensitivity, Farmers, Yield, Sierra Leone

Logistic Regression 1 β0 + β1 x + ε > 0

𝑝̂1 Pearson Goodness-of-Fit Test

Descriptive Statistic/ Exploratory data Analysis

Descriptive Statistic For Categorical Variable

Descriptive Statistic for Continuous Variable

Figure 1: QQ-plot of continuous variables

Table 4: Independent Samples Test

Table 5: Independent Samples Test

Table 6: Independent Samples Test

Table 7: P-Values and Odd Ratios of Independent

Table 8: Variables in the Equation

Table 9: Hosmer and Lemeshow Test

Table 10: Categorical Variables Codings

Table 15: Model Summary

Influential Observation and Outliers

Again, it is good to find out if there are observations that

Das könnte Ihnen auch gefallen