Beruflich Dokumente
Kultur Dokumente
Summary
The oil-and-gas industry is entering an era of “big data” because of the huge number of wells drilled with the rapid development of
unconventional oil-and-gas reservoirs during the past decade. The massive amount of data generated presents a great opportunity for
the industry to use data-analysis tools to help make informed decisions. The main challenge is the lack of the application of effective
and efficient data-analysis tools to analyze and extract useful information for the decision-making process from the enormous amount
of data available. In developing tight shale reservoirs, it is critical to have an optimal drilling strategy, thereby minimizing the risk of
drilling in areas that would result in low-yield wells. The objective of this study is to develop an effective data-analysis tool capable of
dealing with big and complicated data sets to identify hot zones in tight shale reservoirs with the potential to yield highly productive
wells. The proposed tool is developed on the basis of nonparametric smoothing models, which are superior to the traditional multiple-
linear-regression (MLR) models in both the predictive power and the ability to deal with nonlinear, higher-order variable interactions.
This data-analysis tool is capable of handling one response variable and multiple predictor variables. To validate our tool, we used two
real data sets—one with 249 tight oil horizontal wells from the Middle Bakken and the other with 2,064 shale gas horizontal wells from
the Marcellus Shale. Results from the two case studies revealed that our tool not only can achieve much better predictive power than
the traditional MLR models on identifying hot zones in the tight shale reservoirs but also can provide guidance on developing the opti-
mal drilling and completion strategies (e.g., well length and depth, amount of proppant and water injected). By comparing results from
the two data sets, we found that our tool can achieve model performance with the big data set (2,064 Marcellus wells) with only four
predictor variables that is similar to that with the small data set (249 Bakken wells) with six predictor variables. This implies that, for
big data sets, even with a limited number of available predictor variables, our tool can still be very effective in identifying hot zones
that would yield highly productive wells. The data sets that we have access to in this study contain very limited completion, geological,
and petrophysical information. Results from this study clearly demonstrated that the data-analysis tool is certainly powerful and flexible
enough to take advantage of any additional engineering and geology data to allow the operators to gain insights on the impact of these
factors on well performance.
Introduction
During the past decade, the technological advancements in horizontal drilling and multistage hydraulic fracturing enabled the produc-
tion boom in unconventional oil-and-gas resources worldwide. US Energy Information Administration (EIA 2013a) reported that the
worldwide technically recoverable shale oil and gas reserves are 345 billion barrels and 7,299 Tcf respectively. In addition, according
to the EIA forecast in 2014, the share of shale gas production will increase from 40% of total US dry-gas production in 2012 to 53% in
2040 (EIA 2014a) and that of onshore tight oil production will increase from 33% of total lower-48-state onshore oil production to 51%
in 2040 (EIA 2013b).
The vast amount of data generated from drilling and completion activities in unconventional tight oil-and-gas reservoirs has led the
oil-and-gas industry to a “big-data” era (Willigers et al. 2014; Gupta et al. 2014; Grujic et al. 2015; Zhong et al. 2015). As more
well data became available, significant variations in well performance in different areas of a tight shale reservoir were also observed
(Schuetter et al. 2015; Esmaili and Mohaghegh 2016). To be a commercial success in developing a tight shale reservoir, it is necessary
to identify and drill in the most productive areas in the reservoir. However, it remains a major challenge for the industry to distinguish
areas of poor economic potential from areas of high economic potential. Even after numerous wells are drilled, without an effective
quantitative approach to evaluate and analyze the data generated, high uncertainty in well performance of newly drilled wells in the
same general area remains (Willigers et al. 2014).
Physics-based models are often limited in their ability to solve this problem because of the numerical complexity and computational
effort required. Statistical models, on the other hand, rely on analyzing available well data from a reservoir to quantify the correlations
between well performance and completion as well as geologic variables. The results can then be used to identify drillsites with high
economic potential and to develop drilling and completion strategies to maximize well productivity.
The artificial neural network (ANN) has been used in a variety of applications in the oil-and-gas industry; see Esmaili and
Mohaghegh (2016) and Mohaghegh (2016). ANN mimics the biological neural network of a human brain (Pham and Liu 1995; Picton
2000). It matches the output parameter (response) by adjusting the weights assigned to the input parameters (predictors) through an iter-
ative process. Basically, it is a pattern-recognition method. When the input parameters are properly tuned, they can provide good pre-
dictive power especially in complex cases. However, it not very effective in modeling the relationships between the predictors and
the response.
MLR techniques (Neter et al. 1989) have been widely used to determine the influence of various completion and geologic variables
on well performance. MLR builds an empirical model to describe the relationship between predictor variables and an objective response
variable through linear regression. Voneiff et al. (2013) in their study found very weak correlations between the individual predictor
Copyright V
C 2017 Society of Petroleum Engineers
Original SPE manuscript received for review 10 February 2017. Revised manuscript received for review 10 August 2017. Paper (SPE 189440) peer approved 13 August 2017.
variable and the response variable using 2D linear regression. They then used an MLR model to try to identify the effects of four com-
pletion variables (number of fracture stages, perforation clusters per stage, lateral length, and fluid volume) on the average gas rate dur-
ing the first year of production in the Montney Formation. This model matched the mean of the observed well-performance well.
However, it failed to provide a reasonable match to the range of performance of individual wells (Voneiff et al. 2014). Gao and Gao
(2013) used MLR with the multivariate-adaptive-regression-splines (MARS) algorithm to determine the relationship of the early-time
well performance with nine completion and geologic variables in the Eagle Ford Formation. King and Wray (2014) used multivariate
analysis with nonlinear regression to analyze the effects of lateral length, proppant volumes, stage length, proppant type, treatment rate,
and water cut on well performance in the Bakken and Three Forks Formations. However, they did not address the potential issue of
overfitting the data. Also, no attempt was made in their study to use well location (longitude and latitude) as a predictor variable.
LaFollette et al. (2014) used a gradient-boosting method and geographic-information systems (GISs) to analyze well productivity in
the Eagle Ford Formation. They concluded that the well location is an important predictor variable for well productivity, which is
linked to the variation in fundamental reservoir parameters such as shale permeability, thickness, reservoir pressure, and reservoir-fluid
viscosity. Eburi et al. (2014) used the ordination technique called detrended correspondence analysis (DCA) to identify the key varia-
bles affecting well performance in Haynesville Shale. They reported that subsurface variables are the most significant drivers of well
performance, followed by completion variables.
Lolon et al. (2016) compared MLR without interaction terms (ITs), MLR with ITs determined by Bayesian information criterion
(BIC) and Akaike information criterion (AIC), random forests, and gradient-boosting machine (GBM) to evaluate the relationship
between well parameters and the production in the Bakken and Three Forks Formations. Both random forests and GBM are tree-based
methods (Zhong et al. 2015; Schuetter et al. 2015; Singh 2015). On the basis of cross-validation comparison, they concluded that the
BIC and AIC chosen models are simpler and have a better well-performance prediction than random forests and GBM. Although MLR
is widely applied to perform data analysis in shale reservoirs, it remains a challenge to choose the right algorithm to avoid the issue of
overfitting and the resulting inaccuracy in model prediction (Schuetter et al. 2015). In addition, MLR is one of the parametric methods
that requires the data to meet some restrictive assumptions such as linearity, normality, homogeneity, and independence. It will become
more complicated for higher-order variable interactions (Maučec et al. 2015).
In the following sections, we will demonstrate how we develop an effective and efficient data-analysis tool by use of nonparametric
smoothing models to identify hot zones in the tight shale reservoirs and to provide guidance on developing optimal drilling and comple-
tion strategies. Unlike parametric models (e.g., MLR models), the nonparametric models used in this study do not require assumptions
of variable distribution and linear relationships between the predictor variables and response variable. It has broad applications in mod-
eling, especially when the data set is large and the relationships between the response and the predictors are complex. This powerful
tool includes three nonparametric smoothing methods: local linear smoother, cubic B-spline smoother, and nonparametric additive
models. The first two models are used to explore the one-on-one correlations between the response variable and individual predictor
variables, whereas the third technique is applied to quantify the correlation between the response variable and all predictor variables.
This correlation can then be used to identify hot zones in the target reservoir and to help develop drilling and completion strategies that
would yield highly productive wells. However, when the data are too sparsely populated, it would become difficult to accurately capture
the nonlinear patterns among variables, and the algorithms for calibrating the smoothing parameters could become unstable. In this
study, we used two real data sets—one with 249 tight oil horizontal wells from the Middle Bakken and the other with 2,064 shale gas
horizontal wells from the Marcellus Shale—to train and validate our tool. Results from the two case studies will be presented and dis-
cussed in the second half of the paper.
Methodology
In this section, we introduce the following nonparametric smoothing models used in this study: local linear smoother, cubic B-splines,
and the nonparametric additive models. The models introduced in this section are applied in two case studies in the following sections.
All modeling procedures are implemented in R, which is a popular free statistical-programming software, Version 3.3.3. The main
packages used are “mgcv,” “splines,” and “ggplot2”.
Local Linear Smoother. The local linear smoother (Fan 1993) is a nonparametric smoothing method to model the relationship
between predictor variables (usually less than three) and a response variable. The general idea is to fit a locally linear regression in a
small region by giving more weights to nearer data points and fewer weights to farther data points. To illustrate, suppose x is a 1D pre-
dictor variable with n observations ðx1 ; ; xn ÞT and y is a 1D response variable with n observations ðy1 ; ; yn ÞT : The goal is to esti-
mate the mapping function f for which y ¼ f ðxÞ þ e; where e is a random error. Fig. 1 shows how the local linear smoother (red curve)
fits the sample data points (black dots). The sample data points in Fig. 1 are from simulation. As illustrated in Fig. 1, when the goal is to
estimate the response variable at x ¼ 7.4, the observations close to 7.4 are assigned with more weights [dots in the least absolute shrink-
age and selection operator (LASSO) region on the right], and the observations relatively far away from 7.4 are assigned with less
weights (dots in the LASSO region on the left). The true curve (blue dash) is plotted in the graph as well to show the goodness of fit for
the model. Mathematically, the prediction of a data point at x0 2 ½minxi ; maxxi is y0 ¼ a1 þ a2 x0 ; where ða1 ; a2 Þ minimizes the fol-
lowing weighted sum of square error:
X
n
L1 ða1 ; a2 Þ ¼ Kh ðx0 xi Þðyi a1 a2 xi Þ2 ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð1Þ
i¼1
1 x
where Kh ðxÞ ¼ K is a symmetric kernel function with bandwidth h: The solution is
h h
y0 ¼ a1 þ a2 x0 ¼ ð1; x0 Þ½XT Wðx0 ÞX1 XT Wðx0 Þy; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð2Þ
1; 1; …; 1
where XT ¼ and Wðx0 Þ ¼ diag½Kh ðx0 xi Þ: The kernel function is a weight function such that when x0 xi is closer
x1 ; x2 ; …; xn
to zero, it has a larger weight. It has a smaller
2weight
when x0 xi is farther away from zero. The most frequently used kernel function
1 x
is the Gaussian kernel: KðxÞ ¼ pffiffiffiffiffiffi exp ; which is also used in this study. When the kernel equals 1.0, Eq. 1 reduces to the
2p 2
standard least-square-error loss. Minimizing the standard least-square-error loss leads to the simple linear-regression method. Therefore,
the local linear smoother is a weighted linear-regression method with weight Kh ðx0 xi Þ for each data point ðxi ; yi Þ when making pre-
diction at x0 : The bandwidth h controls the smoothness of the fitted curve. If h is too large, the fitted curve flattens out, and it cannot dis-
cover the hidden pattern of the true function f : If h is too small, the fitted curve is too wiggly, leading to overfitting. The optimal
bandwidth can be determined by cross validation (Fan 1993).
Assign
Assign more weights
less weights
1 Local linear fit
in a small region
Response
0
–1 Local linear
smoother
True curve
Fig. 1—The illustration of local linear smoother fitting the sample data at x 5 7.4. The true curve is plotted for baseline reference.
The main advantage of the local linear-smoothing estimator is that no strong assumptions are needed to fit the sampled data. Further-
more, it can capture the complicated nonlinear relationship between the predictor variables and the response variable. The main disad-
vantage is that when there are more than two predictors, it is difficult to find the optimal multidimensional bandwidth matrix for the
local linear smoother. Also, the algorithm will become unstable. The tendency of falling into local optimal bandwidths increases as the
number of the predictor variables increases. In light of this, in this paper, we only apply the local linear smoother to explore the relation-
ship between a single predictor variable and the response variable.
Smoothing Cubic B-Spline. Polynomial regression is an alternative to linear regression when there is a nonlinear pattern in the sam-
ple. However, it suffers from overfitting with low prediction power as the polynomial degree increases, because the real relationship
between the predictor variable and the response variable may not be polynomial in the whole domain. On the contrary, smoothing spline
(Cook and Peters 1981) that serves as a local estimation method combines piecewise continuous polynomial in each small region. The
cutoff points for regions are called knots. They satisfy some smoothness conditions to make the fitted curve smooth globally. There is a
variety of spline functions with different orders from which to choose. Among all splines, the B-spline with order three, called the cubic
B-spline, is one of the most widely used splines. The main advantages of B-spline are its ability to avoid overfitting and the flexibility
to model data with complex nonlinear relationships between the response variable and the predictors with no assumption of normality
required. As is known, basis vectors of a vector space V are mutually linearly independent, and they can be used to construct any vector
in the vector space V: Similarly, the cubic B-spline basis is linearly independent. For a given set of knots, every cubic spline can be rep-
resented as a linear combination of a cubic B-spline basis. To illustrate this, we simulated the cubic B-spline basis in the range [0, 10]
in Fig. 2. It shows that, in the interval [0, 10], if we set three knots at points 3, 5, and 8, there are seven of the linearly independent cubic
spline basis as shown in different colors in the figure. By calculating the coefficients for spline basis, we can estimate any unknown
spline curve within the range [0, 10].
1.00
Three knots with seven basis
b1 b7
from b1 to b 7
0.75 b6
Normalized b Value
b2
b3 b4 b5
0.50
0.25
0.00
0.0 2.5 5.0 7.5 10.0
Cubic B-Spline Basis
Fig. 2—Cubic B-spline basis illustration: In the range [0, 10], manually set three internal knots. There are seven cubic spline basis:
b1 through b7 .
With the assumption that t1 < t2 < < tk are the user-defined k internal knots, the B-spline basis functions can be expressed
recursively:
Now, the problem of estimating f is transformed to the problem of estimating bj ; j ¼ 1; …; m; where normal equation can be used to
solve the problem. However, the choice of knots controls the smoothness of the function, which is both subjective and computationally
expensive. For this reason, the penalized smoothing spline method can be used in the B-spline regression that is not sensitive for knots
selection (Hastie et al. 2005). In this case, the knots can be set at the observed predictor values or at a random sample of predictor values
Xm size is large. It only requires an estimation of a single smoothing parameter c: To estimate a univariate smooth func-
when the sample
tion f ðxÞ b ðxÞbj ; we minimize the following penalized sum of square error
j¼1 j
X
n ð
2
L2 ðb1 ; …; bm ; cÞ ¼ ½yi f ðxÞ2 þ c ½ f 00 ðxÞ dx
i¼1
" #2 ð
X
n X
m
2
yi bj ðxi Þbj þ c ½ f 00 ðxÞ dx; ð5Þ
i¼1 j¼1
where c controls the smoothness of the estimated function. Writing it in the matrix form,
Similarly, to avoid the sensitivity of a knots selection, we minimize the following penalized sum-of-square error
X
n X
m1 X
m2
L3 ðb; cÞ ¼ ½yi gj1 j2 ðx1 ; x2 Þbj1 j2 2 þ cJj f j; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð11Þ
i¼1 j1 ¼1 j2 ¼1
ðð " 2 2 2 2 #
2
@2f @ f @ f
where Jjf j ¼ þ2 þ dx1 dx2 : Similarly, the GCV method can be used to select the optimal value of c: With
@x21 @x1 @x2 @x22
the estimated c, we get estimates of bj1 j2 ; j1 ¼ 1; …; m1 ; j2 ¼ 1; …; m2 : Then, the bivariate smooth function f ðx1 ; x2 Þ is estimated.
Similar to the local linear smoothing method, the smoothing cubic B-spline method is generally applicable when the function f is
less than three dimensions. However, it can be applied to nonparametric additive models (NAMs) efficiently when the relationship
between multiple predictor variables (three or more) and the response variable can be estimated.
NAM Models. The NAM (Wood 2006) is especially useful when there are more than two predictors involved that are nonlinearly
related to the response variable while the response variable is not normally distributed. It has the form
X
p
y¼lþ fk ðxk Þ þ e; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð12Þ
k¼1
where l is the mean of the response variable and p is the number of the predictor variables. The 1D or 2D function fk takes values at 1D
or 2D predictor variables xk , and e is the error term with mean zero.
For example, if all fk are univariate functions, then the NAM can be written with the linear combination of spline basis as
X
p X
mk
y¼lþ bj ðxk Þbkj þ e: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð13Þ
k¼1 j¼1
When some fk are bivariate functions, the NAM can be expressed with the combination of univariate spline basis and bivariate
tensor-product basis. The goal is to estimate all bkj for k ¼ 1; …; p; j ¼ 1; …; mk :
The NAM can be estimated by means of smoothing spline methods iteratively with the back-fitting algorithm. In each iteration step,
a cubic B-spline model is estimated. The detailed estimation algorithm is described as follows. We set the value of l to be the sample
mean y of yi : In the initial step, we set starting values for bkj ; k ¼ 1; …; p; j ¼ 1; …; mk ; or equivalently, for all fk ; k ¼ 1; …; p: To
make it simple, we can set fk ¼ 0 for all k: The second step is to update each fk sequentially with all other functions fl ; l 6¼ k fixed.
Suppose that we are at the stage to getXan updated estimate of f1 : In this case, we use the smoothing cubic B-spline method described
in the previous section to fit y~ ¼ y l f^ on X1 ; where all f^l are current estimates of fl : After getting the optimal smoothing pa-
l6¼1 l
rameters by GCV, we have the estimated f^1 ; which is the one-step update for f1 : The one-step update for fl ; l ¼ 2; …; p; follows the
same procedure as the estimate of f^1 : With all updated fk ; k ¼ 1; …; p; we check the convergence status by calculating the differences
for fk ; k ¼ 1; …; p; between the previous step and the updated step. If the differences are all under the preset threshold, the updated
fk ; k ¼ 1; …; p; are our final estimates. Otherwise, f^k ; k ¼ 1; …; p are set to be the initial values for fk ; k ¼ 1; …; p; in the next updat-
ing step. The flow chart of the back-fitting algorithm is shown in Fig. 3.
fk, k = 1, ..., p
For k = 1, ..., p,
calculate ~
〈
y = y – μˆ – ∑l≠k fl
get a new fk
1
〈
n
Update fk = fk – – ∑l=k
n
fk (xkl)
Check
convergence
〈
In theory, when applying multiple regression models, the predictive variables will need to be independent. Unfortunately, this is not
usually possible with real-world data. Multicollinearity exists when some predictors can be predicted from the linear combination of
other predictors. This redundancy of predictors will cause inconsistency of coefficient estimations. The presence and severity of multi-
collinearity can be measured by the variance inflation factor (VIF). VIF is a measure of the inflated variance of each regression coeffi-
cient with the existence of other predictors in linear models. The larger the VIF value, the more severe is the multicollinearity. A VIF
value close to 1.0 indicates almost no correlation among predictive variables. As long as multicollinearity is not severe with VIF <5,
we can safely apply NAM. (See Appendix B for more-detailed discussions.)
States Geological Survey (USGS) reported that the Middle Bakken has an estimated average oil resource of 3.65 billion bbl (USGS
2013). In this study, we focus on a small data set of 249 tight oil horizontal wells from the Middle Bakken. Six predictor variables were
selected and investigated: well depth, well length, sand injection per foot, water injection per foot, longitude, and latitude. Well depth
refers to the true vertical depth, and well length represents the perforated lateral length from the first perforation cluster to the last perfo-
ration cluster of the wellbore. They are slightly correlated but not multicollinearly correlated. The response variable is the maximum
oil-flow rate within 9 months, denoted as max oil-flow rate hereafter.
To begin, we first explore the distribution of the response variable. Unlike other multiple regression models, we do not need the
response variable to meet the normality requirement to apply the NAM. All we need is a response variable with a symmetric distribu-
tion where the mean can accurately describe the center of the data distribution (mean median). However, the histogram in Fig. 4a
shows that the response variable (maximum oil-flow rate) is heavily right-skewed. We can either transform the response variable to cor-
rect the skewness or use a generalized model from the exponential family. It turns out that the response variable, max oil-flow rate,
became fairly symmetric after a natural log transformation (Fig. 4b) with a mean and median of 2.13 and 2.08, respectively. Results
from the Shapiro-Wilk test show that the p-value for the log-transformed data is 0.03, which is less than 0.05, indicating that it failed
the normality test. As discussed earlier in this paragraph, even if the transformed distribution failed the normality test, we could still
safely apply the NAM. Hence, we will use the log-transformed max oil-flow rate as the new response variable in the model and trans-
form it back when making model predictions.
0.15 1.00
0.75
0.10
Density
Density
0.50
0.05
0.25
0.00 0.00
0 10 20 30 40 0 1 2 3 4
Max Oil-Flow Rate log[Max Oil-Flow Rate
(MSTB/month) (MSTB/month)]
(a) (b)
Fig. 4—Histogram and log-transformed histogram of the maximum oil-flow rate for Middle Bakken data: (a) Distribution of max oil-
flow rate; (b) distribution of max oil-flow rate with log transformation.
After the transformation of the response variable, we explore the marginal relationship between each predictor and the response.
Because longitude and latitude jointly determine the location, we treat these two as one predictor that represents the well-location infor-
mation, as shown in Fig. 5.
MSTB/month
30
25
20
Latitude
15
10
0
Longitude
Fig. 5—Maximum oil-flow rate within a 9-month period for Middle Bakken data.
Figs. 6 through 9 show the scatter plots (black dots), local linear smoothing estimators (blue curves), and the 95% pointwise confi-
dence band (gray areas) between each predictor variable and the log-transformed max oil-flow rate. Fig. 6 shows the relationship
between the max oil rate and the well depth. As shown in Fig. 6, the max oil-flow rate increases slowly as the well depth increases from
9,700 to 10,600 ft. The max-oil rate then dips slightly and starts to increase again when the wells go deeper from 10,600 to 10,800 ft.
Fig. 7 shows the relationship between the max oil rate and the well length. As shown in Fig. 7, the max oil-flow rate remains flat when
the well length is shorter than 9,200 ft. As the well length increases from 9,200 to 9,500 ft, the max oil-flow rate increases sharply with
the increase in well length. For wells longer than 9,500 ft, the increase in the max oil-flow rate slows down but remains significant
because the well length goes beyond 9,500 ft. Figs. 8 and 9 show the relationship between the max oil-flow flow rate and the amount of
sand and water injected, respectively. As shown in Fig. 8, the max oil-flow rate increases almost linearly with the amount of sand
injection. The relationship between the max oil-flow rate and the amount of water injected is, however, more complicated. Fig. 9 shows
a general trend of the increased max oil-flow rate with an increasing amount of water injected when the amount of water injected
exceeds 104 bbl/ft. A very sharp increase in the max oil-flow rate can be seen between 95 and 120 bbl/ft water injected. For all figures,
the confidence band is wider when the data density is low and narrower when the data density is high. Figs. 6, 7, and 9 clearly show the
complicated and nonlinear correlations between the response variable and the predictor variables. These findings clearly suggest that
nonparametric models are more suitable for this data set than MLR models.
40
(MSTB/month)
20
10
0
9,500 9,800 10,100 10,400 10,700 11,000
Well Depth (ft)
Fig. 6—Scatter plot and its local linear smoothing of max oil-flow rate with well depth for Middle Bakken data.
40
Max Oil-Flow Rate
30
(MSTB/month)
20
10
Fig. 7—Scatter plot and its local linear smoothing of max oil-flow rate with well length for Middle Bakken data.
40
Max Oil-Flow Rate
30
(MSTB/month)
20
10
0
0 100 200 300 400
Sand (lbm/ft)
Fig. 8—Scatter plot and its local linear smoothing of max oil-flow rate with sand injection for Middle Bakken data.
The next step is to discover the relative importance of each predictor variable to the response variable. We apply GBM to find the
relative variable importance. GBM is a boosting method that resamples the data set several times to generate results that form a
weighted average of the resampled data set. It is rooted from growing a tree-based model in a greedy forward stage-wise approach,
which iteratively minimizes the sum of square error. Similar to local linear smoothing or smoothing splines, it makes no assumptions
about the distribution of the data. The relative importance is calculated by the number of times a predictor variable is selected for split-
ting weighted by the impurity improvement for the split, and finally average over all trees. As shown in Fig. 10, well length is the most
important variable (42.6%), followed by the amount of water injected (25.6%), well depth (11.8%), and amount of sand injected
(10.4%). In this case, well location has the least relative importance (9.7%) to the max oil rate.
40
(MSTB/month)
20
10
Fig. 9—Scatter plot and its local linear smoothing of max oil-flow rate with water injection for Middle Bakken data.
Latitude 6.6
Longitude 3.1
Fig. 10—Order of relative importance of six predictor variables on the basis of the gradient-boosting method for Middle Bakken
data.
From the exploratory data analysis, we have the following findings: (1) the location variable is a bivariate predictor consisting of
longitude and latitude, which should be modeled with a bivariate function; (2) well length, depth, and water injection are nonlinearly
related to the well performance; and (3) the max oil-flow rate is heavily right-skewed while the log transformation makes the data more
symmetrically distributed. These findings suggest that the common MLR models are not suitable because the predictors are nonlinearly
related to the response variable and the response variable is not normally distributed whereas the NAM does not require these assump-
tions. Therefore, we decided to use the NAM with the cubic B-spline to model the relationship between log-transformed well perform-
ance and the location (longitude, latitude), well length, well depth, sand injection, and water injection.
Because of the small sample size, we use tenfold cross validation to perform the modeling. We randomly split the data into 10 folds:
for each time, nine folds of data are used for training the model and the last fold of the data is used to test the goodness of the trained
model. The model is then evaluated 10 times. The goodness and accuracy of the model performance are evaluated by averaging these
10 testing results. To formulate the model, let y be the well performance and x1 ; x2 ; x3 ; x4 ; x5 ; x6 be well length, well depth, water injec-
tion, sand injection, longitude, and latitude, respectively. The NAM has the form
where f1 ; f2 ; f3 , and f4 are four univariate smooth curves and f5 is a bivariate smooth surface. The model is estimated by minimizing the
following penalized sum of square error
X
n
L4 ðb1 ; …; bm ; cÞ ¼ ½logðyi Þ l f1 ðx1i Þ f2 ðx2i Þ f3 ðx3i Þ f4 ðx4i Þ f5 ðx5i ; x6i Þ2
i¼1
X
4 ð
þ ck ½ f 00k ðxk Þ2 dxk þ c5 Jj f 5 j; ð15Þ
k¼1
where
X
mk
fk ðxki Þ ¼ bkj ðxki Þbkj ; k ¼ 1; 2; 3; 4; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð16Þ
j¼1
X
m5 X
m6
f5 ðx5i ; x6i Þ ¼ b5j ðx5i Þb6l ðx6i Þb5jl ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð17Þ
j¼1 l¼1
ðð " 2 2 2 2 #
@ 2 f5 @ 2 f5 @ f5
Jj f 5 j ¼ þ2 þ dx5 dx6 ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð18Þ
@x25 @x5 x6 @x26
The knots are selected at the observed predictor values. We pick log(n) knots with equal quantile distance. The penalized sum of square
L4 can be minimized by the back-fitting algorithm described in Fig. 3. The smoothness parameters are determined by GCV.
In Table 1, we compare the performance of the NAM with that of MLR, MLRþIT, and MLR with interaction and quadratic terms
(MLRþQuad). The comparison is based on the following criteria: AIC, BIC, R2 , adjusted R2 (Adj R2 ), and the cross-validation mean
squared error (CV.MSE). The CV.MSE is used to evaluate the prediction power, whereas the other criteria are used to evaluate the ex-
planation power of the models. As shown in Table 1, all criteria except BIC are in favor of the NAM. It suggests that the NAM has a
better prediction and explanation power than the other models.
R
2
Adj R
2
Model AIC BIC CV.MSE
NAM 139.17 246.18 0.70 0.64 4.45
MLR 202.20 227.61 0.44 0.42 4.76
MLR+IT 181.43 254.48 0.58 0.52 5.22
MLR+Quad 178.38 270.49 0.61 0.54 5.58
Table 1—Comparison of NAM and linear-regression models with all six predictors for Middle Bakken
data.
If we only use four predictors—longitude, latitude, well depth, and length, without water and sand injection—Table 2 shows that
the NAM is still the best in all criteria except for BIC. A comparison of model performance with four and six predictors (Tables 1 and
2) clearly shows that the model with six predictors outperformed the model with only four predictors. This indicates that more available
predictors should result in better predictive power for the model.
R
2
Adj R
2
Model AIC BIC CV.MSE
Table 2—Comparison of NAM and linear-regression models with four predictors (without water and
sand injection as predictors) for Middle Bakken data.
Figs. 11 through 13 are contour plots that can be used to visualize the relationship between the predicted max oil-production rate
and two closely related predictor variables. These figures were generated by keeping the other predictor variables at their respective me-
dian values. The warmer the color in the contour plots, the higher the max oil-production rate. As shown in Fig. 11, the hot-zone loca-
tion with the potential to yield highly productive wells easily can be identified. This plot also can provide an estimate of predicted oil-
production rate at a given well location. This information can be very useful in choosing future drillsites with high economic potential.
From Fig. 12, we can identify the well length and depth that would yield the highest well productivity at a given location in the reser-
voir. Fig. 13 can be used to determine the amount of water and proppant injected that would result in the highest well productivity. This
information can be very useful in developing the optimal drilling and completion strategies.
MSTB/month
30
25
20
Latitude
15
10
0
Longitude
Fig. 11—Contour plot of max oil-flow rate vs. longitude and latitude with well length, depth, water injection, and sand injections
fixed at their respective median values for Middle Bakken data.
MSTB/month
11,000 20
10,740
15
Depth (ft)
10,480
10
10,220
5
9,960
9,700 0
5,000 6,400 7,800 9,200 10,600 12,000
Length (ft)
Fig. 12—Contour plot of max oil-flow rate vs. well depth and well length with location, water injection, and sand injection fixed at
their respective median values for Middle Bakken data.
MSTB/month
280 30
25
230
Water (bbl/ft)
20
180
15
130
10
80
5
30 0
50 160 270 380 490 600
Sand (lbm/ft)
Fig. 13—Contour plot of max oil- flow rate vs. water and sand injection with location, well ength, and well depth fixed at their re-
spective median values for Middle Bakken data.
These contour plots are powerful tools that can be used for future drillsite selections as well as providing guidance in developing
optimal drilling and completion strategies to maximize well productivity.
well-length range. Figs. 16 and 17 clearly show nonlinear patterns between the predictor variables and the response variable. Therefore,
nonparametric models are more suitable for this data set than the MLR models.
0.005 0.6
0.004 0.5
0.4
0.003
Density
Density
0.3
0.002
0.2
0.001
0.1
0.000 0.0
0 150 300 450 600 750 900 1 2 3 4 5 6 7 8
Max Gas-Flow Rate (MMscf/month) log[Max Gas-Flow Rate (MMscf/month)]
(a) (b)
Fig. 14—Histogram and log-transformed histogram of the max gas-flow rate for Marcellus data: (a) Distribution of max gas-flow
rate; (b) distribution of max gas-flow rate with log transformation.
MMscf/month
42 600
500
41.9
400
41.8
Latitude
300
41.7
200
41.6
100
41.5 0
–78 –77.5 –77 –76.5 –76 –75.5
Longitude
Fig. 15—Max gas-flow rate within a 9-month period for Marcellus data.
1,000
Max Gas-Flow Rate
750
(MMscf/month)
500
250
Fig. 16—Scatter plot and its local linear smoothing of max gas-flow rate with well depth for Marcellus data.
Fig. 18 shows the relative variable importance from GBM of the four predictor variables. As we can see from the results summar-
ized in Fig. 18, longitude is the most important (38.8%), followed by the well length (25.8%). The other location variable, latitude, has
a similar relative importance (24.0%), and well depth has the lowest relative importance (11.4%) to the max gas-flow rate.
Next, we use NAM with the cubic B-spline to model the relationship between log-transformed well performance and the location
(longitude, latitude), well length, and well depth. Because of the big sample size, we randomly split the data into two parts: 70% of the
data is used only for training the model, and the remaining 30% of the data is withheld only to test the goodness of the trained model.
Let y be the well performance and x1 ; x2 ; x3 ; x4 be well length, well depth, longitude, and latitude, respectively. Similar to the data anal-
ysis in the previous section, NAM has the form
where f1 and f2 are two univariate smooth curves and f3 is a bivariate smooth surface. The model is estimated by minimizing the follow-
ing penalized sum of square error
X
n ð ð
L5 ðb1 ; …; bm ; cÞ ¼ ½logðyi Þ l f1 ðx1i Þ f2 ðx2i Þ f3 ðx3i ; x4i Þ2 þ c1 ½ f 001 ðx1 Þ2 dx1 þ c2 ½ f 002 ðx2 Þ2 dx2 þ c3 Jj f 3 j; . . . . ð21Þ
i¼1
where
X
m1
f1 ðx1i Þ ¼ b1j ðx1i Þb1j ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð22Þ
j¼1
X
m2
f2 ðx2i Þ ¼ b2j ðx2i Þb2j ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð23Þ
j¼1
X
m4 X
m3
f3 ðx3i ; x4i Þ ¼ b3j ðx3i Þb4l ðx4i Þb3jl ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð24Þ
j¼1 l¼1
ðð " 2 2 2 2 2 #
@ 2 f3 @ f3 @ f3
Jj f 3 j ¼ þ2 þ dx3 dx4 ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð25Þ
@x23 @x3 x4 @x24
1,000
Max Gas-Flow Rate
750
(MMscf/month)
500
250
Fig. 17—Scatter plot and its local linear smoothing of max gas-flow rate with well length for Marcellus data.
Longitude 38.8
Latitude 24.0
Fig. 18—Order of relative importance of four predictor variables on the basis of the gradient-boosting method for Marcellus data.
Fig. 19 plots the actual well performance vs. the predicted well performance with NAM. Fig. 19a is for the training data set, and
Fig. 19b is for the test data set. For the data points falling below the red line in these plots, the model underpredicts the actual well per-
formance. The model overpredicts the actual well performance when the data points fall above the red line. A careful examination of
Figs. 19a and 19b reveals that, in both the training and test cases, the model did a very good job fitting the actual well performance
when the max gas-flow rate is less than 400 MMscf/month. When the actual well performance is more than 400 MMscf/month, the
model tends to underpredict the actual well performance. The difference could be attributed to fewer data points in the region with the
max gas-flow rate greater than 400 MMscf/month.
(MMscf/month)
400 400
200 200
0 0
0 200 400 600 800 0 200 400 600 800
Actual Max Gas-Flow Rate (MMscf/month) Actual Max Gas-Flow Rate (MMscf/month)
(a) (b)
Fig. 19—Comparison of actual and predicted well performance with nonparametric additive model for Marcellus data: (a) training
data set; (b) test data set.
Next, we compare the model performance between NAM and other commonly used MLR models. The same 70:30 data split for
training and test data sets is used for all models. In Table 3, we compare the performance of NAM with MLR, MLRþIT, and MLR
MLRþQuad. We use the AIC, BIC, R2 , adjusted R2 (Adj R2 ), as well as the relative mean squared error of the training data set (Rel
MSETR) to measure the explanation power. The relative mean squared error of the test data set (Rel MSETE) is used to measure the pre-
diction power. The relative mean square error is defined as the ratio of the MSE of another model and the MSE of NAM. Results sum-
marized in Table 3 show that NAM outperforms the other models in terms of explanation power. The Rel MSETE of NAM is more than
18% higher than that for the other models, indicating better predictive power than the other models. These findings again suggest that
the nonparametric models are superior to the MLR models for the field data used in this study in which nonlinear and complicated cor-
relations exist between the predictors and the response variable.
R
2
Adj R
2 TR TE
Model AIC BIC Rel MSE Rel MSE
NAM 2125 2369 0.64 0.63 1.00 1.00
LM 2540 2571 0.49 0.49 1.22 1.23
LM+IT 2450 2513 0.53 0.52 1.20 1.90
LM+Quad 2387 2471 0.55 0.54 1.20 1.18
Table 3—Comparison of the nonparametric additive model and other models for Marcellus data.
Adj R2 can be used to compare the model explanation power between data sets with different sample sizes and the number of predic-
tors. Not surprisingly, a comparison of the Adj R2 values in Tables 2 and 3 clearly demonstrates the advantage of having a bigger data
set for the same number of predictors (four in this case).
A comparison of the Adj R2 square values in Tables 1 and 3 shows that the NAM can achieve explanation power in the big-data-set
field case (2,064 Marcellus wells) with four predictor variables (Adj R2 ¼ 0.63) that is similar to that in the small-data-set field case
(249 Bakken wells) with six predictor variables (Adj R2 square ¼ 0.64). This implies that when we have a big sample data set, even if
the number of predictor variables is limited, the NAM can still achieve a high explanation power.
Figs. 20 and 21 are contour plots that can be used to visualize the relationship between the predicted max gas-flow rate and two
closely related predictor variables. These figures were generated by keeping the other predictor variables at their respective median val-
ues. The warmer the color in the contour plots, the higher the max oil-production rate. As shown in Fig. 20, the hot-zone location with
the potential to yield highly productive wells easily can be identified. This plot also can provide an estimate of predicted gas-production
rate at a given well location. This information can be very useful in choosing future drillsites in areas with high economic potential.
From Fig. 21, we can identify the well length and depth that would yield the highest well productivity at a given location in the reser-
voir. This information can be used to develop the optimal drilling and completion strategy.
Conclusion
In this study, we developed an efficient and effective data-analysis tool with nonparametric smoothing models that are superior to the
traditional MLR models in both the predictive power and the ability to deal with nonlinear, higher-order variable interactions (see
Tables 1, 2, and 3).
To validate our tool, we used two real data sets—one with 249 tight oil horizontal wells from the Middle Bakken and the other with
2,064 shale gas horizontal wells from the Marcellus Shale. Results from the preliminary data analyses revealed that the interactions
between the predictor and response variables are highly nonlinear and very complicated.
With a nonparametric additive model with cubic B-spline to identify the relationship between log-transformed well performance
and the location (longitude, latitude), well length, well depth, sand injection, and water injection, our tool successfully explained
approximately 65 to 70% of the well-performance variation in the two data sets as opposed to 44 to 55% with other traditional MLR
models, representing a 30 to 50% increase in model performance.
MMscf/month
42 600
500
41.9
400
41.8
Latitude 300
41.7
200
41.6
100
41.5 0
–78 –77.5 –77 –76.5 –76 –75.5
Longitude
Fig. 20—Contour plot of max gas-flow rate vs. well location (longitude and latitude) with well length and well depth fixed at their re-
spective median values for Marcellus data.
MMscf/month
10,000 1,400
1,200
8,000
1,000
Well Length (ft)
6,000
800
4,000 600
400
2,000
200
0 0
4,000 6,000 8,000 10,000 12,000 14,000
Fig. 21—Contour plot of max gas-flow rate vs. well depth, and well length with location variables (longitude and latitude) fixed at
their respective median values [–76.2, 41.65] for Marcellus data.
With contour plots, we can identify hot zones in the reservoir with the potential to yield highly productive wells. They also can be
used to predict well performance at a selected future drillsite with different combinations of well length, well depth, and amount of
proppant and water to inject. These contour plots are powerful tools that can be used for future drillsite selections as well as providing
guidance in developing optimal drilling and completion strategies to maximize well productivity.
By comparing results from the two data sets, we found that our tool can achieve the same model performance with the big data set
(2,064 Marcellus wells) with only four predictor variables as the small data set (249 Bakken wells) with six predictor variables. This
implies that for big data sets, even with a limited number of available predictor variables, our tool still can be very effective in identify-
ing hot zones that would yield highly productive wells. If more predictors are available, we can expect better prediction results.
The data sets that we have access to in this study contain very limited completion, geological, and petrophysical information. Results
from this study clearly demonstrated that the data-analysis tool is certainly powerful and flexible enough to take advantage of any addi-
tional engineering and geology data to allow the operators to gain insights on the impact of these factors on well performance.
Nomenclature
bpi ðxÞ ¼ the ith spline basis of polynomial order p
EðyÞ ¼ expected value of random variable y
f ðxÞ ¼ a univariate smooth function
f ðx1 ; x2 Þ ¼ a bivariate smooth function
logðyÞ ¼ the natural log transformation of y
Lk ¼ sum of square error, k ¼ 1; …5
Kh ðxÞ ¼ kernel function with bandwidth h
R2 ¼ coefficient of determination
Sc ði; iÞ ¼ the ith diagonal element of the smoothing matrix
Acknowledgments
We would like to acknowledge financial support from Texas A&M Engineering Experiment Station. We would also like to acknowl-
edge that DrillingInfo provided the production data analyzed in the work.
References
Cook, E. R. and Peters, K. 1981. The Smoothing Spline: A New Approach to Standardizing Forest Interior Tree-ring Width Series for Dendroclimatic
Studies. Tree-Ring Bull. 41: 45–53.
Duchamp, T. and Werner, S. 2003. Spline Smoothing on Surfaces. Journal of Computational and Graphical Statistics 12 (2): 354–381. https://doi.org/
10.1198/1061860031743.
Eburi, S., Jones, S., Houston, T. et al. 2014. Analysis and Interpretation of Haynesville Shale Subsurface Properties, Completion Variables, and Produc-
tion Performance Using Ordination, A Multivariate Statistical Analysis Technique. Presented at the SPE Annual Technical Conference and Exhibi-
tion, Amsterdam, 27–29 October. SPE-170834-MS. https://doi.org/10.2118/170834-MS.
Esmaili, S. and Mohaghegh, S. D. 2016. Full-Field Reservoir Modeling of Shale Assets Using Advanced Data-Driven Analytics. Geoscience Frontiers 7
(1): 11–20. https://doi.org/10.1016/j.gsf.2014.12.006.
Fan, J. 1993. Local Linear Regression Smoothers and Their Minimax Efficiencies. The Annals of Statistics 21 (1): 196–216.
Gao, C. and Gao, H. 2013. Evaluating Early-Time Eagle Ford Well Performance Using Multivariate Adaptive Regression Splines (MARS). Presented at
the SPE Annual Technical Conference and Exhibition, New Orleans, 30 September–2 October. SPE-166462-MS. https://doi.org/10.2118/166462-
MS.
Golub, G. H., Heath, M., and Wahba, G. 1979. Generalized Cross-validation as a Method for Choosing a Good Ridge Parameter. Technometrics 21 (2):
215–223.
Grujic, O., Silva, C. D., and Caers, J. 2015. Functional Approach to Data Mining, Forecasting, and Uncertainty Quantification in Unconventional Reser-
voirs. Presented at the SPE Annual Technical Conference and Exhibition, Houston, 28–30 September. SPE-174849-MS. https://doi.org/10.2118/
174849-MS.
Gupta, S., Fuehrer, F., and Jeyachandra, B. C. 2014. Production Forecasting in Unconventional Resources Using Data Mining and Time Series Analysis.
Presented at the SPE/CSUR Unconventional Resources Conference, Calgary, 30 September–2 October. SPE-171588-MS. https://doi.org/10.2118/
171588-MS.
Hastie, T., Tibshirani, R., and Friedman, J. 2005. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer: New York.
King, V. M. and Wray, L. 2014. Completion Optimization Utilizing Multivariate Analysis in the Bakken and Three Forks Formations. Presented at the
SPE Western North American and Rocky Mountain Joint Regional Meeting, Denver, 16–18 April. SPE-169534-MS. https://doi.org/10.2118/169534-
MS.
LaFollette, R. F., Izadi, G., and Zhong, M. 2014. Application of Multivariate Statistical Modeling and Geographic Information Systems Pattern-Recogni-
tion Analysis to Production Results in the Eagle Ford Formation of South Texas. Presented at the SPE Hydraulic Fracturing Technology Conference,
The Woodlands, Texas, USA, 4–6 February. SPE-168628-MS. https://doi.org/10.2118/168628-MS.
Lolon, E., Hamidieh, K., Weijers, L. et al. 2016. Evaluating the Relationship Between Well Parameters and Production Using Multivariate Statistical
Models: A Middle Bakken and Three Forks Case History. Presented at the SPE Hydraulic Fracturing Technology Conference, The Woodlands,
Texas, USA, 9–11 February. SPE-179171-MS. https://doi.org/10.2118/179171-MS.
Maučec, M., Singh, A. P., Bhattacharya, S. et al. 2015. Multivariate Analysis and Data Mining of Well-Stimulation Data by Use of Classification-and-
Regression Tree With Enhanced Interpretation and Prediction Capabilities. SPE Econ & Mgmt 7 (2): 60–71. SPE-166472-PA. https://doi.org/
10.2118/166472-PA.
Mohaghegh, S. D. 2016. Determining the Main Drivers in Hydrocarbon Production From Shale Using Advanced Data-driven Analytics—A Case Study
in Marcellus Shale. Journal of Unconventional Oil and Gas Resources 15: 146–157. https://doi.org/10.1016/j.juogr.2016.07.004.
Neter, J., Wasserman, W., and Kutner, M. 1989. Applied Linear Regression Model, second edition. Boston, Massachusetts: Richard D. Irwin, Inc.
Picton, P. 2000. Neural Networks. New York City: PALGRAVE.
Pham, T. D. and Liu, X. 1995. Neural Networks for Indentification, Prediction and Control. London: Springler-Verlag London Limited.
Schuetter, J., Mishra, S., Zhong, M. et al. 2015. Data Analytics for Production Optimization in Unconventional Reservoirs. Presented at the SPE/AAPG/
SEC Unconventional Resource Technology Conference, San Antonio, Texas, USA, 20–22 July. URTEC-2167005-MS. https://doi.org/10.15530/
URTEC-2015-2167005.
Singh, A. 2015. Root-Cause Identification and Production Diagnostic for Gas Wells With Plunger Lift. Presented at the SPE Reservoir Characterization
and Simulation Conference and Exhibition, Abu Dhabi, 14–16 September. SPE-175564-MS. https://doi.org/10.2118/175564-MS.
Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 58
(1): 267–288.
US Department of Energy (DOE). 2013. Modern Shale Gas Development in the United States: An Update, http://www.netl.doe.gov/File%20Library/
Research/Oil-Gas/shale-gas-primer-update-2013.pdf.
US Energy Information Administration 2013a. Technically Recoverable Shale Oil and Shale Gas Resources: An Assessment of 137 Shale Formations in
41 Countries Outside the United States. http://www.eia.gov/analysis/studies/worldshalegas/.
US Energy Information Administration. 2013b. Early Release Overview. http://www.eia.gov/forecasts/aeo/er/pdf/0383er%282013%29.pdf.
US Energy Information Administration 2014a. Annual Energy Outlook. http://www.eia.gov/forecasts/aeo/mt_naturalgas.cfm.
US Energy Information Administration. 2014b. Drilling Productivity Report, http://www.eia.gov/petroleum/drilling/#tabs-summary-1.
United States Geological Survey (USGS). 2013. USGS Releases New Oil and Gas Assessment for Bakken and Three Forks Formations. USGS
Webpost, “<https://www2.usgs.gov/blogs/features/usgs_top_story/usgs-releases-new-oil-and-gas-assessment-for-bakken-and-three-forks-formations/>”.
Voneiff, G., Sadeghi, S., Bastian, P. et al. 2013. A Well-Performance Model Based on Multivariate Analysis of Completion and Production Data From
Horizontal Well in the Montney Formation in British Columbia. Presented at the SPE Unconventional Resources Conference, Calgary, 5–7 Novem-
ber. SPE-167154-MS. https://doi.org/10.2118/167154-MS.
Voneiff, G., Sadeghi, S., Bastian, P. et al. 2014. Probabilistic Forecasting of Horizontal Well Performance in Unconventional Reservoirs Using Publicly-
Available Completion Data. Presented at the SPE Unconventional Resources Conference, The Woodlands, Texas, USA, 1–3 April. SPE-168978-MS.
https://doi.org/10.2118/168978-MS.
Willigers, B. J. A., Begg, S., and Bratvold, R. B. 2014. Combining Geostatistics With Bayesian Updating to Continually Optimize Drilling Strategy in
Shale-Gas Plays. SPE Res Eval & Eng 17 (4): 507–519. SPE-164816-PA. https://doi.org/10.2118/164816-PA.
Wood, S. 2006. Generalized Additive Models: An Introduction With R. New York: CRC Press.
Zhong, M., Schuetter, J., Mishra, S. et al. 2015. Do Data Mining Methods Matter?: A Wolfcamp Shale Case Study. Presented at the SPE Hydraulic Frac-
turing Technology Conference, The Woodlands, Texas, USA, 3–5 February. SPE-173334-MS. https://doi.org/10.2118/173334-MS.
logðyÞ ¼ b0 þ b1 x1 þ b2 x2 þ b3 x3 þ b4 x4 þ e: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ðA-1Þ
The following second model is a more-complicated version of MLR. It is the basic MLRþIT. The formula is
The third model to be compared with is the model MLRþIT with all quadratic terms, denoted as MLRþQuad. The formula is
41.9
Lat
41.5
–75.5
Lon
–77.5
4,000 14,000
Depth
0 6,000
Length
41.5 41.8 4,000 10,000
Fig. B-1—Scatter-plot matrix for all four predictor variables in Marcellus data: Lat 5 Latitude, Lon 5 Longitude, Depth 5 Well Depth,
and Length 5 Well Length.
10,600
Depth
9,800
12,000
Length
5,000 8,000
300
Sand
0 100
150 250
Water
50
47.6
Lat
47.4
–102.6
Lon
–102.9
9,800 10,200 10,600 11,000 0 100 200 300 47.4 47.6 47.8 48.0
Fig. B-2—Scatter-plot matrix for all six predictor variables in Middle Bakken data: Depth 5 Well Depth, Length 5 Well Length,
Sand 5 Sand Injection, Water 5 Water Injection, Lat 5 Latitude, and Lon 5 Longitude.
Table B-2—Variance inflation factors (VIFs) for all four predictors in Marcellus data.
Table B-3—Correlation matrix for all six predictors in Middle Bakken data.
Quan Cai is a PhD-degree candidate in the Department of Statistics at Texas A&M University. His research interests include longi-
tudinal data analysis, nonparametric and semiparametric statistics, and statistical learning. Cai holds a BS degree in mathemati-
cal statistics from Zhejiang University, China.
Wei Yu (corresponding author) is a research associate in the Harold Vance Department of Petroleum Engineering at Texas A&M
University. His research interests include reservoir modeling and simulation of shale gas and tight oil production, CO2-enhanced
shale gas-and-oil recovery, assisted history matching, uncertainty quantification, data mining, and nanoparticles enhanced oil
recovery (EOR). Yu has authored or coauthored more than 50 technical papers and holds one patent. He holds a BS degree in
applied chemistry from the University of Jinan in China, an MS degree in chemical engineering from Tsinghua University in China,
and a PhD degree in petroleum engineering from the University of Texas at Austin. Yu is an active member of SPE.
Hwa Chi Liang is an instructional assistant professor in the Department of Statistics at Texas A&M University. Previously, she was an
associate professor in the Department of Mathematics and Statistics at Washburn University. Liang’s research interests include lin-
ear models, Bayesian analysis, and statistical education. She has authored or coauthored more than 15 technical papers. Liang
has been an associate editor in statistics for the Missouri Journal of Mathematical Sciences since 2013. She holds a PhD degree
in statistics from the University of New Mexico.
Jenn-Tai Liang is a professor in the Harold Vance Department of Petroleum Engineering at Texas A&M University and also the
holder of the John E. & Deborah F. Bethancourt endowed professorship. His current research focus is on developing promising uses
of nanotechnology to improve oil recovery in both conventional and unconventional reservoirs. Example applications include in-
depth conformance control, chemical and microbial EOR, flow assurance, and hydraulic-fracturing-fluid cleanup. Liang holds six
US patents, with five more pending for his work in this area. He holds a PhD degree in petroleum engineering from the University of
Texas at Austin. Liang is an SPE Distinguished Member and was selected as an SPE Distinguished Lecturer for the 2015–2016 season.
Suojin Wang is a professor of statistics and epidemiology and biostatistics at Texas A&M University. His research interests include
semi- and nonparametric statistical methodology, missing-and mismeasured-data analyses, asymptotic theory, and applied
statistics. Wang has authored or coauthored more than 150 refereed research papers. He was the editor-in-chief of the Journal
of Nonparametric Statistics from 2007 to 2012. Wang holds a PhD degree in statistics from the University of Texas at Austin. He is
an elected fellow of the American Statistical Association, an elected fellow of the Institute of Mathematical Statistics, and an
elected member of the International Statistical Institute.
Kan Wu is an assistant professor in the Harold Vance Department of Petroleum Engineering at Texas A&M University. Her research
interests include hydraulic fracturing in unconventional reservoirs, proppant transport in complex fracture networks, coupled
geomechanics/fluid-flow modeling, and optimization of well performance from unconventional gas and oil reservoirs. Wu holds
a PhD degree in petroleum engineering from the University of Texas at Austin. She is a member of SPE.