Sie sind auf Seite 1von 18

J189440 DOI: 10.

2118/189440-PA Date: 28-October-17 Stage: Page: 1 Total Pages: 18

Development of a Powerful Data-Analysis


Tool Using Nonparametric Smoothing
Models To Identify Drillsites in Tight Shale
Reservoirs With High Economic Potential
Quan Cai, Wei Yu, Hwa Chi Liang, Jenn-Tai Liang, Suojin Wang, and Kan Wu, Texas A&M University

Summary
The oil-and-gas industry is entering an era of “big data” because of the huge number of wells drilled with the rapid development of
unconventional oil-and-gas reservoirs during the past decade. The massive amount of data generated presents a great opportunity for
the industry to use data-analysis tools to help make informed decisions. The main challenge is the lack of the application of effective
and efficient data-analysis tools to analyze and extract useful information for the decision-making process from the enormous amount
of data available. In developing tight shale reservoirs, it is critical to have an optimal drilling strategy, thereby minimizing the risk of
drilling in areas that would result in low-yield wells. The objective of this study is to develop an effective data-analysis tool capable of
dealing with big and complicated data sets to identify hot zones in tight shale reservoirs with the potential to yield highly productive
wells. The proposed tool is developed on the basis of nonparametric smoothing models, which are superior to the traditional multiple-
linear-regression (MLR) models in both the predictive power and the ability to deal with nonlinear, higher-order variable interactions.
This data-analysis tool is capable of handling one response variable and multiple predictor variables. To validate our tool, we used two
real data sets—one with 249 tight oil horizontal wells from the Middle Bakken and the other with 2,064 shale gas horizontal wells from
the Marcellus Shale. Results from the two case studies revealed that our tool not only can achieve much better predictive power than
the traditional MLR models on identifying hot zones in the tight shale reservoirs but also can provide guidance on developing the opti-
mal drilling and completion strategies (e.g., well length and depth, amount of proppant and water injected). By comparing results from
the two data sets, we found that our tool can achieve model performance with the big data set (2,064 Marcellus wells) with only four
predictor variables that is similar to that with the small data set (249 Bakken wells) with six predictor variables. This implies that, for
big data sets, even with a limited number of available predictor variables, our tool can still be very effective in identifying hot zones
that would yield highly productive wells. The data sets that we have access to in this study contain very limited completion, geological,
and petrophysical information. Results from this study clearly demonstrated that the data-analysis tool is certainly powerful and flexible
enough to take advantage of any additional engineering and geology data to allow the operators to gain insights on the impact of these
factors on well performance.

Introduction
During the past decade, the technological advancements in horizontal drilling and multistage hydraulic fracturing enabled the produc-
tion boom in unconventional oil-and-gas resources worldwide. US Energy Information Administration (EIA 2013a) reported that the
worldwide technically recoverable shale oil and gas reserves are 345 billion barrels and 7,299 Tcf respectively. In addition, according
to the EIA forecast in 2014, the share of shale gas production will increase from 40% of total US dry-gas production in 2012 to 53% in
2040 (EIA 2014a) and that of onshore tight oil production will increase from 33% of total lower-48-state onshore oil production to 51%
in 2040 (EIA 2013b).
The vast amount of data generated from drilling and completion activities in unconventional tight oil-and-gas reservoirs has led the
oil-and-gas industry to a “big-data” era (Willigers et al. 2014; Gupta et al. 2014; Grujic et al. 2015; Zhong et al. 2015). As more
well data became available, significant variations in well performance in different areas of a tight shale reservoir were also observed
(Schuetter et al. 2015; Esmaili and Mohaghegh 2016). To be a commercial success in developing a tight shale reservoir, it is necessary
to identify and drill in the most productive areas in the reservoir. However, it remains a major challenge for the industry to distinguish
areas of poor economic potential from areas of high economic potential. Even after numerous wells are drilled, without an effective
quantitative approach to evaluate and analyze the data generated, high uncertainty in well performance of newly drilled wells in the
same general area remains (Willigers et al. 2014).
Physics-based models are often limited in their ability to solve this problem because of the numerical complexity and computational
effort required. Statistical models, on the other hand, rely on analyzing available well data from a reservoir to quantify the correlations
between well performance and completion as well as geologic variables. The results can then be used to identify drillsites with high
economic potential and to develop drilling and completion strategies to maximize well productivity.
The artificial neural network (ANN) has been used in a variety of applications in the oil-and-gas industry; see Esmaili and
Mohaghegh (2016) and Mohaghegh (2016). ANN mimics the biological neural network of a human brain (Pham and Liu 1995; Picton
2000). It matches the output parameter (response) by adjusting the weights assigned to the input parameters (predictors) through an iter-
ative process. Basically, it is a pattern-recognition method. When the input parameters are properly tuned, they can provide good pre-
dictive power especially in complex cases. However, it not very effective in modeling the relationships between the predictors and
the response.
MLR techniques (Neter et al. 1989) have been widely used to determine the influence of various completion and geologic variables
on well performance. MLR builds an empirical model to describe the relationship between predictor variables and an objective response
variable through linear regression. Voneiff et al. (2013) in their study found very weak correlations between the individual predictor

Copyright V
C 2017 Society of Petroleum Engineers

Original SPE manuscript received for review 10 February 2017. Revised manuscript received for review 10 August 2017. Paper (SPE 189440) peer approved 13 August 2017.

2017 SPE Journal 1

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 2 Total Pages: 18

variable and the response variable using 2D linear regression. They then used an MLR model to try to identify the effects of four com-
pletion variables (number of fracture stages, perforation clusters per stage, lateral length, and fluid volume) on the average gas rate dur-
ing the first year of production in the Montney Formation. This model matched the mean of the observed well-performance well.
However, it failed to provide a reasonable match to the range of performance of individual wells (Voneiff et al. 2014). Gao and Gao
(2013) used MLR with the multivariate-adaptive-regression-splines (MARS) algorithm to determine the relationship of the early-time
well performance with nine completion and geologic variables in the Eagle Ford Formation. King and Wray (2014) used multivariate
analysis with nonlinear regression to analyze the effects of lateral length, proppant volumes, stage length, proppant type, treatment rate,
and water cut on well performance in the Bakken and Three Forks Formations. However, they did not address the potential issue of
overfitting the data. Also, no attempt was made in their study to use well location (longitude and latitude) as a predictor variable.
LaFollette et al. (2014) used a gradient-boosting method and geographic-information systems (GISs) to analyze well productivity in
the Eagle Ford Formation. They concluded that the well location is an important predictor variable for well productivity, which is
linked to the variation in fundamental reservoir parameters such as shale permeability, thickness, reservoir pressure, and reservoir-fluid
viscosity. Eburi et al. (2014) used the ordination technique called detrended correspondence analysis (DCA) to identify the key varia-
bles affecting well performance in Haynesville Shale. They reported that subsurface variables are the most significant drivers of well
performance, followed by completion variables.
Lolon et al. (2016) compared MLR without interaction terms (ITs), MLR with ITs determined by Bayesian information criterion
(BIC) and Akaike information criterion (AIC), random forests, and gradient-boosting machine (GBM) to evaluate the relationship
between well parameters and the production in the Bakken and Three Forks Formations. Both random forests and GBM are tree-based
methods (Zhong et al. 2015; Schuetter et al. 2015; Singh 2015). On the basis of cross-validation comparison, they concluded that the
BIC and AIC chosen models are simpler and have a better well-performance prediction than random forests and GBM. Although MLR
is widely applied to perform data analysis in shale reservoirs, it remains a challenge to choose the right algorithm to avoid the issue of
overfitting and the resulting inaccuracy in model prediction (Schuetter et al. 2015). In addition, MLR is one of the parametric methods
that requires the data to meet some restrictive assumptions such as linearity, normality, homogeneity, and independence. It will become
more complicated for higher-order variable interactions (Maučec et al. 2015).
In the following sections, we will demonstrate how we develop an effective and efficient data-analysis tool by use of nonparametric
smoothing models to identify hot zones in the tight shale reservoirs and to provide guidance on developing optimal drilling and comple-
tion strategies. Unlike parametric models (e.g., MLR models), the nonparametric models used in this study do not require assumptions
of variable distribution and linear relationships between the predictor variables and response variable. It has broad applications in mod-
eling, especially when the data set is large and the relationships between the response and the predictors are complex. This powerful
tool includes three nonparametric smoothing methods: local linear smoother, cubic B-spline smoother, and nonparametric additive
models. The first two models are used to explore the one-on-one correlations between the response variable and individual predictor
variables, whereas the third technique is applied to quantify the correlation between the response variable and all predictor variables.
This correlation can then be used to identify hot zones in the target reservoir and to help develop drilling and completion strategies that
would yield highly productive wells. However, when the data are too sparsely populated, it would become difficult to accurately capture
the nonlinear patterns among variables, and the algorithms for calibrating the smoothing parameters could become unstable. In this
study, we used two real data sets—one with 249 tight oil horizontal wells from the Middle Bakken and the other with 2,064 shale gas
horizontal wells from the Marcellus Shale—to train and validate our tool. Results from the two case studies will be presented and dis-
cussed in the second half of the paper.

Methodology
In this section, we introduce the following nonparametric smoothing models used in this study: local linear smoother, cubic B-splines,
and the nonparametric additive models. The models introduced in this section are applied in two case studies in the following sections.
All modeling procedures are implemented in R, which is a popular free statistical-programming software, Version 3.3.3. The main
packages used are “mgcv,” “splines,” and “ggplot2”.

Local Linear Smoother. The local linear smoother (Fan 1993) is a nonparametric smoothing method to model the relationship
between predictor variables (usually less than three) and a response variable. The general idea is to fit a locally linear regression in a
small region by giving more weights to nearer data points and fewer weights to farther data points. To illustrate, suppose x is a 1D pre-
dictor variable with n observations ðx1 ;    ; xn ÞT and y is a 1D response variable with n observations ðy1 ;    ; yn ÞT : The goal is to esti-
mate the mapping function f for which y ¼ f ðxÞ þ e; where e is a random error. Fig. 1 shows how the local linear smoother (red curve)
fits the sample data points (black dots). The sample data points in Fig. 1 are from simulation. As illustrated in Fig. 1, when the goal is to
estimate the response variable at x ¼ 7.4, the observations close to 7.4 are assigned with more weights [dots in the least absolute shrink-
age and selection operator (LASSO) region on the right], and the observations relatively far away from 7.4 are assigned with less
weights (dots in the LASSO region on the left). The true curve (blue dash) is plotted in the graph as well to show the goodness of fit for
the model. Mathematically, the prediction of a data point at x0 2 ½minxi ; maxxi  is y0 ¼ a1 þ a2 x0 ; where ða1 ; a2 Þ minimizes the fol-
lowing weighted sum of square error:
X
n
L1 ða1 ; a2 Þ ¼ Kh ðx0  xi Þðyi  a1  a2 xi Þ2 ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð1Þ
i¼1
1 x
where Kh ðxÞ ¼ K is a symmetric kernel function with bandwidth h: The solution is
h h
y0 ¼ a1 þ a2 x0 ¼ ð1; x0 Þ½XT Wðx0 ÞX1 XT Wðx0 Þy; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð2Þ
 
1; 1; …; 1
where XT ¼ and Wðx0 Þ ¼ diag½Kh ðx0  xi Þ: The kernel function is a weight function such that when x0  xi is closer
x1 ; x2 ; …; xn
to zero, it has a larger weight. It has a smaller
 2weight
 when x0  xi is farther away from zero. The most frequently used kernel function
1 x
is the Gaussian kernel: KðxÞ ¼ pffiffiffiffiffiffi exp  ; which is also used in this study. When the kernel equals 1.0, Eq. 1 reduces to the
2p 2
standard least-square-error loss. Minimizing the standard least-square-error loss leads to the simple linear-regression method. Therefore,

2 2017 SPE Journal

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 3 Total Pages: 18

the local linear smoother is a weighted linear-regression method with weight Kh ðx0  xi Þ for each data point ðxi ; yi Þ when making pre-
diction at x0 : The bandwidth h controls the smoothness of the fitted curve. If h is too large, the fitted curve flattens out, and it cannot dis-
cover the hidden pattern of the true function f : If h is too small, the fitted curve is too wiggly, leading to overfitting. The optimal
bandwidth can be determined by cross validation (Fan 1993).

Assign
Assign more weights
less weights
1 Local linear fit
in a small region

Response
0

–1 Local linear
smoother
True curve

0.0 2.5 5.0 7.5 10.0


Predictor

Fig. 1—The illustration of local linear smoother fitting the sample data at x 5 7.4. The true curve is plotted for baseline reference.

The main advantage of the local linear-smoothing estimator is that no strong assumptions are needed to fit the sampled data. Further-
more, it can capture the complicated nonlinear relationship between the predictor variables and the response variable. The main disad-
vantage is that when there are more than two predictors, it is difficult to find the optimal multidimensional bandwidth matrix for the
local linear smoother. Also, the algorithm will become unstable. The tendency of falling into local optimal bandwidths increases as the
number of the predictor variables increases. In light of this, in this paper, we only apply the local linear smoother to explore the relation-
ship between a single predictor variable and the response variable.

Smoothing Cubic B-Spline. Polynomial regression is an alternative to linear regression when there is a nonlinear pattern in the sam-
ple. However, it suffers from overfitting with low prediction power as the polynomial degree increases, because the real relationship
between the predictor variable and the response variable may not be polynomial in the whole domain. On the contrary, smoothing spline
(Cook and Peters 1981) that serves as a local estimation method combines piecewise continuous polynomial in each small region. The
cutoff points for regions are called knots. They satisfy some smoothness conditions to make the fitted curve smooth globally. There is a
variety of spline functions with different orders from which to choose. Among all splines, the B-spline with order three, called the cubic
B-spline, is one of the most widely used splines. The main advantages of B-spline are its ability to avoid overfitting and the flexibility
to model data with complex nonlinear relationships between the response variable and the predictors with no assumption of normality
required. As is known, basis vectors of a vector space V are mutually linearly independent, and they can be used to construct any vector
in the vector space V: Similarly, the cubic B-spline basis is linearly independent. For a given set of knots, every cubic spline can be rep-
resented as a linear combination of a cubic B-spline basis. To illustrate this, we simulated the cubic B-spline basis in the range [0, 10]
in Fig. 2. It shows that, in the interval [0, 10], if we set three knots at points 3, 5, and 8, there are seven of the linearly independent cubic
spline basis as shown in different colors in the figure. By calculating the coefficients for spline basis, we can estimate any unknown
spline curve within the range [0, 10].

1.00
Three knots with seven basis
b1 b7
from b1 to b 7

0.75 b6
Normalized b Value

b2
b3 b4 b5

0.50

0.25

0.00
0.0 2.5 5.0 7.5 10.0
Cubic B-Spline Basis

Fig. 2—Cubic B-spline basis illustration: In the range [0, 10], manually set three internal knots. There are seven cubic spline basis:
b1 through b7 .

With the assumption that t1 < t2 <    < tk are the user-defined k internal knots, the B-spline basis functions can be expressed
recursively:

2017 SPE Journal 3

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 4 Total Pages: 18

b1i ðxÞ ¼ 1; ti  x  tiþ1


x  ti xiþk  x p1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð3Þ
bpi ðxÞ ¼ bp1 ðxÞ þ b ðxÞ;
tiþk1  ti i tiþk  tiþ1 iþ1
where bpi ðxÞ is the ith spline basis of polynomial order p: The cubic B-spline basis is b3i ðxÞ for all possible i, denoted by bi ðxÞ hereafter.
The number of spline basis is determined by the number of knots (the number of spline basis ¼ the number of internal
knots þ degree þ 1). In the previous example, because we have three internal knots for the cubic (degree ¼ 3) B-spline, there exists a
3 þ 3 þ 1 ¼ 7 spline basis.
When the internal knots are given, the cubic B-spline basis functions are known. We can regress the response variable y on the basis
functions to estimate the smooth functions f with equation y ¼ f ðxÞ þ e: Suppose that the cubic B-spline basis functions are
fb1 ; …; bm g: Then, we can model a univariate smooth function f by
X
m
f ðxÞ ¼ bj ðxÞbj þ e: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð4Þ
j¼1

Now, the problem of estimating f is transformed to the problem of estimating bj ; j ¼ 1; …; m; where normal equation can be used to
solve the problem. However, the choice of knots controls the smoothness of the function, which is both subjective and computationally
expensive. For this reason, the penalized smoothing spline method can be used in the B-spline regression that is not sensitive for knots
selection (Hastie et al. 2005). In this case, the knots can be set at the observed predictor values or at a random sample of predictor values
Xm size is large. It only requires an estimation of a single smoothing parameter c: To estimate a univariate smooth func-
when the sample
tion f ðxÞ  b ðxÞbj ; we minimize the following penalized sum of square error
j¼1 j

X
n ð
2
L2 ðb1 ; …; bm ; cÞ ¼ ½yi  f ðxÞ2 þ c ½ f 00 ðxÞ dx
i¼1
" #2 ð
X
n X
m
2
 yi  bj ðxi Þbj þ c ½ f 00 ðxÞ dx;                                         ð5Þ
i¼1 j¼1

where c controls the smoothness of the estimated function. Writing it in the matrix form,

L2 ðb; cÞ ¼ ðY  BbÞT ðY  BbÞ þ cbT Xb; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð6Þ


ð
00 00
where b ¼ ðb1 ; …; bm ÞT , Y ¼ ðy1 ; …; yn ÞT , B ¼ Bij ¼ bj ðxi Þ, and X ¼ Xjk ¼ bk ðxÞbk ðxÞdx:
The minimizer of L2 ðb; cÞ over b can be written as

b^ ¼ ðBT B þ cXÞ2 BT Y ¼ Sc Y; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð7Þ

where Sc ¼ ðBT B þ cXÞ2 BT is known as the smoother matrix.


To find the optimal smoothing parameter c; we apply the generalized cross validation (GCV) (Golub et al. 1979):
" #2
Xn
yi  f^ðxi Þ
^
GCVð f k Þ ¼ n 1
; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð8Þ
i¼1 1S
Xn
where S ¼ n1 i¼1 Sc ði; iÞ: Here, Sc ði; iÞ is the ith diagonal element of the smoothing matrix. The optimal value of c is chosen when
GCVð ^f k Þ is minimized. The GCV outperforms cross validation in the sense that it underweights the leverage points and outliers, so it is
a more robust and stable model-selection method.
To estimate a bivariate smooth function f ðx1 ; x2 Þ (Duchamp and Werner 2003), suppose that we have a spline basis of functions
b1j1 ðx1 Þ, j1 ¼ 1; …; m1 for the first predictor variable and a basis of functions b2j2 ðx2 Þ, j2 ¼ 1; …; m2 for the second predictor variable.
Then, the m1  m2 dimensional tensor-product basis is defined by

gj1 j2 ðx1 ; x2 Þ ¼ b1j1 ðx1 Þb2j2 ðx2 Þ; j1 ¼ 1; …; m1 ; j2 ¼ 1; …; m2 : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð9Þ

The bivariate smooth function f ðx1 ; x2 Þ then can be approximately represented as


X
m1 X
m2
f ðx1 ; x2 Þ  gj1 j2 ðx1 ; x2 Þbj1 j2 : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð10Þ
j1 ¼1 j2 ¼1

Similarly, to avoid the sensitivity of a knots selection, we minimize the following penalized sum-of-square error

X
n X
m1 X
m2
L3 ðb; cÞ ¼ ½yi  gj1 j2 ðx1 ; x2 Þbj1 j2 2 þ cJj f j; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð11Þ
i¼1 j1 ¼1 j2 ¼1

ðð "  2 2  2 2 #
2
@2f @ f @ f
where Jjf j ¼ þ2 þ dx1 dx2 : Similarly, the GCV method can be used to select the optimal value of c: With
@x21 @x1 @x2 @x22
the estimated c, we get estimates of bj1 j2 ; j1 ¼ 1; …; m1 ; j2 ¼ 1; …; m2 : Then, the bivariate smooth function f ðx1 ; x2 Þ is estimated.
Similar to the local linear smoothing method, the smoothing cubic B-spline method is generally applicable when the function f is
less than three dimensions. However, it can be applied to nonparametric additive models (NAMs) efficiently when the relationship
between multiple predictor variables (three or more) and the response variable can be estimated.

4 2017 SPE Journal

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 5 Total Pages: 18

NAM Models. The NAM (Wood 2006) is especially useful when there are more than two predictors involved that are nonlinearly
related to the response variable while the response variable is not normally distributed. It has the form
X
p
y¼lþ fk ðxk Þ þ e; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð12Þ
k¼1

where l is the mean of the response variable and p is the number of the predictor variables. The 1D or 2D function fk takes values at 1D
or 2D predictor variables xk , and e is the error term with mean zero.
For example, if all fk are univariate functions, then the NAM can be written with the linear combination of spline basis as
X
p X
mk
y¼lþ bj ðxk Þbkj þ e: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð13Þ
k¼1 j¼1

When some fk are bivariate functions, the NAM can be expressed with the combination of univariate spline basis and bivariate
tensor-product basis. The goal is to estimate all bkj for k ¼ 1; …; p; j ¼ 1; …; mk :
The NAM can be estimated by means of smoothing spline methods iteratively with the back-fitting algorithm. In each iteration step,
a cubic B-spline model is estimated. The detailed estimation algorithm is described as follows. We set the value of l to be the sample
mean y of yi : In the initial step, we set starting values for bkj ; k ¼ 1; …; p; j ¼ 1; …; mk ; or equivalently, for all fk ; k ¼ 1; …; p: To
make it simple, we can set fk ¼ 0 for all k: The second step is to update each fk sequentially with all other functions fl ; l 6¼ k fixed.
Suppose that we are at the stage to getXan updated estimate of f1 : In this case, we use the smoothing cubic B-spline method described
in the previous section to fit y~ ¼ y  l  f^ on X1 ; where all f^l are current estimates of fl : After getting the optimal smoothing pa-
l6¼1 l
rameters by GCV, we have the estimated f^1 ; which is the one-step update for f1 : The one-step update for fl ; l ¼ 2; …; p; follows the
same procedure as the estimate of f^1 : With all updated fk ; k ¼ 1; …; p; we check the convergence status by calculating the differences
for fk ; k ¼ 1; …; p; between the previous step and the updated step. If the differences are all under the preset threshold, the updated
fk ; k ¼ 1; …; p; are our final estimates. Otherwise, f^k ; k ¼ 1; …; p are set to be the initial values for fk ; k ¼ 1; …; p; in the next updat-
ing step. The flow chart of the back-fitting algorithm is shown in Fig. 3.

– initial guesses for


Set μˆ = y,

fk, k = 1, ..., p

For k = 1, ..., p,
calculate ~

y = y – μˆ – ∑l≠k fl

Cubic B-spline fit to y~on xk to


Update

get a new fk

1

n
Update fk = fk – – ∑l=k
n
fk (xkl)

Check
convergence

Get estimate of fk, k = 1, ...,p

Fig. 3—The back-fitting algorithm of the nonparametric additive model.

In theory, when applying multiple regression models, the predictive variables will need to be independent. Unfortunately, this is not
usually possible with real-world data. Multicollinearity exists when some predictors can be predicted from the linear combination of
other predictors. This redundancy of predictors will cause inconsistency of coefficient estimations. The presence and severity of multi-
collinearity can be measured by the variance inflation factor (VIF). VIF is a measure of the inflated variance of each regression coeffi-
cient with the existence of other predictors in linear models. The larger the VIF value, the more severe is the multicollinearity. A VIF
value close to 1.0 indicates almost no correlation among predictive variables. As long as multicollinearity is not severe with VIF <5,
we can safely apply NAM. (See Appendix B for more-detailed discussions.)

Case Study I—Middle Bakken


We now consider our first case study. The Bakken Formation with multiple oil-bearing layers in the Williston Basin is one the most pro-
ductive tight oil reservoirs in North America. The Middle Bakken is one formation of primary layers for the oil production. The United

2017 SPE Journal 5

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 6 Total Pages: 18

States Geological Survey (USGS) reported that the Middle Bakken has an estimated average oil resource of 3.65 billion bbl (USGS
2013). In this study, we focus on a small data set of 249 tight oil horizontal wells from the Middle Bakken. Six predictor variables were
selected and investigated: well depth, well length, sand injection per foot, water injection per foot, longitude, and latitude. Well depth
refers to the true vertical depth, and well length represents the perforated lateral length from the first perforation cluster to the last perfo-
ration cluster of the wellbore. They are slightly correlated but not multicollinearly correlated. The response variable is the maximum
oil-flow rate within 9 months, denoted as max oil-flow rate hereafter.
To begin, we first explore the distribution of the response variable. Unlike other multiple regression models, we do not need the
response variable to meet the normality requirement to apply the NAM. All we need is a response variable with a symmetric distribu-
tion where the mean can accurately describe the center of the data distribution (mean  median). However, the histogram in Fig. 4a
shows that the response variable (maximum oil-flow rate) is heavily right-skewed. We can either transform the response variable to cor-
rect the skewness or use a generalized model from the exponential family. It turns out that the response variable, max oil-flow rate,
became fairly symmetric after a natural log transformation (Fig. 4b) with a mean and median of 2.13 and 2.08, respectively. Results
from the Shapiro-Wilk test show that the p-value for the log-transformed data is 0.03, which is less than 0.05, indicating that it failed
the normality test. As discussed earlier in this paragraph, even if the transformed distribution failed the normality test, we could still
safely apply the NAM. Hence, we will use the log-transformed max oil-flow rate as the new response variable in the model and trans-
form it back when making model predictions.

0.15 1.00

0.75
0.10
Density

Density
0.50

0.05
0.25

0.00 0.00
0 10 20 30 40 0 1 2 3 4
Max Oil-Flow Rate log[Max Oil-Flow Rate
(MSTB/month) (MSTB/month)]
(a) (b)

Fig. 4—Histogram and log-transformed histogram of the maximum oil-flow rate for Middle Bakken data: (a) Distribution of max oil-
flow rate; (b) distribution of max oil-flow rate with log transformation.

After the transformation of the response variable, we explore the marginal relationship between each predictor and the response.
Because longitude and latitude jointly determine the location, we treat these two as one predictor that represents the well-location infor-
mation, as shown in Fig. 5.

MSTB/month
30

25

20
Latitude

15

10

0
Longitude

Fig. 5—Maximum oil-flow rate within a 9-month period for Middle Bakken data.

Figs. 6 through 9 show the scatter plots (black dots), local linear smoothing estimators (blue curves), and the 95% pointwise confi-
dence band (gray areas) between each predictor variable and the log-transformed max oil-flow rate. Fig. 6 shows the relationship
between the max oil rate and the well depth. As shown in Fig. 6, the max oil-flow rate increases slowly as the well depth increases from
9,700 to 10,600 ft. The max-oil rate then dips slightly and starts to increase again when the wells go deeper from 10,600 to 10,800 ft.
Fig. 7 shows the relationship between the max oil rate and the well length. As shown in Fig. 7, the max oil-flow rate remains flat when
the well length is shorter than 9,200 ft. As the well length increases from 9,200 to 9,500 ft, the max oil-flow rate increases sharply with
the increase in well length. For wells longer than 9,500 ft, the increase in the max oil-flow rate slows down but remains significant
because the well length goes beyond 9,500 ft. Figs. 8 and 9 show the relationship between the max oil-flow flow rate and the amount of
sand and water injected, respectively. As shown in Fig. 8, the max oil-flow rate increases almost linearly with the amount of sand

6 2017 SPE Journal

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 7 Total Pages: 18

injection. The relationship between the max oil-flow rate and the amount of water injected is, however, more complicated. Fig. 9 shows
a general trend of the increased max oil-flow rate with an increasing amount of water injected when the amount of water injected
exceeds 104 bbl/ft. A very sharp increase in the max oil-flow rate can be seen between 95 and 120 bbl/ft water injected. For all figures,
the confidence band is wider when the data density is low and narrower when the data density is high. Figs. 6, 7, and 9 clearly show the
complicated and nonlinear correlations between the response variable and the predictor variables. These findings clearly suggest that
nonparametric models are more suitable for this data set than MLR models.

40

Max Oil-Flow Rate


30

(MSTB/month)
20

10

0
9,500 9,800 10,100 10,400 10,700 11,000
Well Depth (ft)

Fig. 6—Scatter plot and its local linear smoothing of max oil-flow rate with well depth for Middle Bakken data.

40
Max Oil-Flow Rate

30
(MSTB/month)

20

10

4,000 6,000 8,000 10,000 12,000


Well Depth (ft)

Fig. 7—Scatter plot and its local linear smoothing of max oil-flow rate with well length for Middle Bakken data.

40
Max Oil-Flow Rate

30
(MSTB/month)

20

10

0
0 100 200 300 400
Sand (lbm/ft)

Fig. 8—Scatter plot and its local linear smoothing of max oil-flow rate with sand injection for Middle Bakken data.

The next step is to discover the relative importance of each predictor variable to the response variable. We apply GBM to find the
relative variable importance. GBM is a boosting method that resamples the data set several times to generate results that form a
weighted average of the resampled data set. It is rooted from growing a tree-based model in a greedy forward stage-wise approach,
which iteratively minimizes the sum of square error. Similar to local linear smoothing or smoothing splines, it makes no assumptions
about the distribution of the data. The relative importance is calculated by the number of times a predictor variable is selected for split-
ting weighted by the impurity improvement for the split, and finally average over all trees. As shown in Fig. 10, well length is the most

2017 SPE Journal 7

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 8 Total Pages: 18

important variable (42.6%), followed by the amount of water injected (25.6%), well depth (11.8%), and amount of sand injected
(10.4%). In this case, well location has the least relative importance (9.7%) to the max oil rate.

40

Max Oil-Flow Rate


30

(MSTB/month)
20

10

0 100 200 300


Water (bbl/ft)

Fig. 9—Scatter plot and its local linear smoothing of max oil-flow rate with water injection for Middle Bakken data.

Well length 42.6

Water injection 25.6


Parameter

Well depth 11.8

Sand injection 10.4

Latitude 6.6

Longitude 3.1

0.0 10.0 20.0 30.0 40.0 50.0


Influence of Important Parameters (%)

Fig. 10—Order of relative importance of six predictor variables on the basis of the gradient-boosting method for Middle Bakken
data.

From the exploratory data analysis, we have the following findings: (1) the location variable is a bivariate predictor consisting of
longitude and latitude, which should be modeled with a bivariate function; (2) well length, depth, and water injection are nonlinearly
related to the well performance; and (3) the max oil-flow rate is heavily right-skewed while the log transformation makes the data more
symmetrically distributed. These findings suggest that the common MLR models are not suitable because the predictors are nonlinearly
related to the response variable and the response variable is not normally distributed whereas the NAM does not require these assump-
tions. Therefore, we decided to use the NAM with the cubic B-spline to model the relationship between log-transformed well perform-
ance and the location (longitude, latitude), well length, well depth, sand injection, and water injection.
Because of the small sample size, we use tenfold cross validation to perform the modeling. We randomly split the data into 10 folds:
for each time, nine folds of data are used for training the model and the last fold of the data is used to test the goodness of the trained
model. The model is then evaluated 10 times. The goodness and accuracy of the model performance are evaluated by averaging these
10 testing results. To formulate the model, let y be the well performance and x1 ; x2 ; x3 ; x4 ; x5 ; x6 be well length, well depth, water injec-
tion, sand injection, longitude, and latitude, respectively. The NAM has the form

logðyÞ ¼ l þ f1 ðx1 Þ þ f2 ðx2 Þ þ f3 ðx3 Þ þ f4 ðx4 Þ þ f5 ðx5 ; x6 Þ þ e; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð14Þ

where f1 ; f2 ; f3 , and f4 are four univariate smooth curves and f5 is a bivariate smooth surface. The model is estimated by minimizing the
following penalized sum of square error
X
n
L4 ðb1 ; …; bm ; cÞ ¼ ½logðyi Þ  l  f1 ðx1i Þ  f2 ðx2i Þ  f3 ðx3i Þ  f4 ðx4i Þ  f5 ðx5i ; x6i Þ2
i¼1
X
4 ð
þ ck ½ f 00k ðxk Þ2 dxk þ c5 Jj f 5 j;                                             ð15Þ
k¼1

where
X
mk
fk ðxki Þ ¼ bkj ðxki Þbkj ; k ¼ 1; 2; 3; 4; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð16Þ
j¼1

X
m5 X
m6
f5 ðx5i ; x6i Þ ¼ b5j ðx5i Þb6l ðx6i Þb5jl ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð17Þ
j¼1 l¼1

8 2017 SPE Journal

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 9 Total Pages: 18

ðð " 2  2  2 2 #
@ 2 f5 @ 2 f5 @ f5
Jj f 5 j ¼ þ2 þ dx5 dx6 ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð18Þ
@x25 @x5 x6 @x26

bk ¼ ðbk1 ; …; bkmk Þ; k ¼ 1; 2; 3; 4: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð19Þ

The knots are selected at the observed predictor values. We pick log(n) knots with equal quantile distance. The penalized sum of square
L4 can be minimized by the back-fitting algorithm described in Fig. 3. The smoothness parameters are determined by GCV.
In Table 1, we compare the performance of the NAM with that of MLR, MLRþIT, and MLR with interaction and quadratic terms
(MLRþQuad). The comparison is based on the following criteria: AIC, BIC, R2 , adjusted R2 (Adj R2 ), and the cross-validation mean
squared error (CV.MSE). The CV.MSE is used to evaluate the prediction power, whereas the other criteria are used to evaluate the ex-
planation power of the models. As shown in Table 1, all criteria except BIC are in favor of the NAM. It suggests that the NAM has a
better prediction and explanation power than the other models.

R
2
Adj R
2
Model AIC BIC CV.MSE
NAM 139.17 246.18 0.70 0.64 4.45
MLR 202.20 227.61 0.44 0.42 4.76
MLR+IT 181.43 254.48 0.58 0.52 5.22
MLR+Quad 178.38 270.49 0.61 0.54 5.58

Table 1—Comparison of NAM and linear-regression models with all six predictors for Middle Bakken
data.

If we only use four predictors—longitude, latitude, well depth, and length, without water and sand injection—Table 2 shows that
the NAM is still the best in all criteria except for BIC. A comparison of model performance with four and six predictors (Tables 1 and
2) clearly shows that the model with six predictors outperformed the model with only four predictors. This indicates that more available
predictors should result in better predictive power for the model.

R
2
Adj R
2
Model AIC BIC CV.MSE

NAM 188.39 245.50 0.53 0.49 4.85

LM 235.57 254.63 0.30 0.29 5.34

LM+IT 237.96 276.07 0.34 0.30 5.43

LM+Quad 232.59 283.40 0.39 0.33 6.47

Table 2—Comparison of NAM and linear-regression models with four predictors (without water and
sand injection as predictors) for Middle Bakken data.

Figs. 11 through 13 are contour plots that can be used to visualize the relationship between the predicted max oil-production rate
and two closely related predictor variables. These figures were generated by keeping the other predictor variables at their respective me-
dian values. The warmer the color in the contour plots, the higher the max oil-production rate. As shown in Fig. 11, the hot-zone loca-
tion with the potential to yield highly productive wells easily can be identified. This plot also can provide an estimate of predicted oil-
production rate at a given well location. This information can be very useful in choosing future drillsites with high economic potential.
From Fig. 12, we can identify the well length and depth that would yield the highest well productivity at a given location in the reser-
voir. Fig. 13 can be used to determine the amount of water and proppant injected that would result in the highest well productivity. This
information can be very useful in developing the optimal drilling and completion strategies.

MSTB/month
30

25

20
Latitude

15

10

0
Longitude

Fig. 11—Contour plot of max oil-flow rate vs. longitude and latitude with well length, depth, water injection, and sand injections
fixed at their respective median values for Middle Bakken data.

2017 SPE Journal 9

ID: jaganm Time: 16:02 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 10 Total Pages: 18

MSTB/month
11,000 20

10,740
15

Depth (ft)
10,480
10
10,220

5
9,960

9,700 0
5,000 6,400 7,800 9,200 10,600 12,000
Length (ft)

Fig. 12—Contour plot of max oil-flow rate vs. well depth and well length with location, water injection, and sand injection fixed at
their respective median values for Middle Bakken data.

MSTB/month
280 30

25
230
Water (bbl/ft)

20
180
15
130
10

80
5

30 0
50 160 270 380 490 600
Sand (lbm/ft)

Fig. 13—Contour plot of max oil- flow rate vs. water and sand injection with location, well ength, and well depth fixed at their re-
spective median values for Middle Bakken data.

These contour plots are powerful tools that can be used for future drillsite selections as well as providing guidance in developing
optimal drilling and completion strategies to maximize well productivity.

Case Study II—Marcellus Shale


In the second case study, we apply the same methodology as that used for a small data set of 247 wells described in the previous section
for a much bigger data set of 3,897 shale gas horizontal wells in Marcellus shale gas reservoirs. Marcellus shale is one of six key tight
oil and shale gas regions in the United States (EIA 2014b), which covers a total area of larger than 100,000 sq miles and has a range of
depth from 4,000 to 8,500 ft (US Department of Energy 2013). In addition, it has approximately 1,500 Tcf of original gas in place and
141 Tcf of technically recoverable gas (US Department of Energy 2013). To fully realize its huge economic potential, it is essential to
be able to identify hot zones in the reservoirs with the potential to yield highly productive wells.
This case study involves a big data set including 3,897 shale gas horizontal wells in the area with a longitude range [–77.61, –75.52]
and latitude range [41.50, 42.00]. With the limited information available in the public data set, four predictor variables were selected
and investigated: well depth, well length, longitude, and latitude. The response variable is the maximum-gas flow rate within nine
months, denoted as max gas-flow rate hereafter. After reviewing the production data, we decided to focus on 2,064 horizontal dry-gas
wells with more than nine months of gas-production data.
Following the same procedure as in the previous section, we first explore the distribution of the response variable. The histogram in
Fig. 14a shows that the maximum gas-flow rate is also heavily right-skewed. We again try the natural log transformation, and it turns
out to be quite symmetric, as shown in Fig. 14b, with the mean and median of 4.90 and 4.93, respectively. Therefore, we decided to use
the log-transformed max gas-flow rate as the new response variable.
After the transformation of the response variable, we explore the marginal relationship between each predictor and the response.
Again, we treat longitude and latitude as one predictor representing the well location. From Fig. 15, we can see a clear pattern of high-
productivity gas wells mainly concentrating in the lower right-hand corner of the map, whereas the majority of the wells in other parts
of the map have relatively low gas production.
Figs. 16 and 17 show the scatter plots (black dots), local linear smoothing estimators (blue curves), and the 95% pointwise confi-
dence band (gray areas) between well depth and well length as predictive variables and the log-transformed max gas-flow rates. Fig. 16
shows a sharp increase in the max gas-flow rate because the well goes deeper at first. The max gas-flow rate peaked at approximately
7,000 to 8,000 ft deep. Then the max gas-flow rate starts to decrease as the wells continue to go deeper. The 95% confidence band indi-
cates that the model prediction is more accurate with well depths from 5,000 to 10,000 ft where we have the most data points. Fig. 17
shows a clear trend where the max gas-flow rate increases with the well length. Initially, the increase is moderate with well length rang-
ing from zero to 4,000 ft. When the well is longer than 4,000 ft, the increase in max gas-flow rate starts to accelerate with the increasing
well length. The 95% confidence band indicates that the estimated curve is generally accurate with a fairly narrow band in the whole

10 2017 SPE Journal

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 11 Total Pages: 18

well-length range. Figs. 16 and 17 clearly show nonlinear patterns between the predictor variables and the response variable. Therefore,
nonparametric models are more suitable for this data set than the MLR models.

0.005 0.6

0.004 0.5

0.4
0.003
Density

Density
0.3
0.002
0.2
0.001
0.1

0.000 0.0
0 150 300 450 600 750 900 1 2 3 4 5 6 7 8
Max Gas-Flow Rate (MMscf/month) log[Max Gas-Flow Rate (MMscf/month)]
(a) (b)

Fig. 14—Histogram and log-transformed histogram of the max gas-flow rate for Marcellus data: (a) Distribution of max gas-flow
rate; (b) distribution of max gas-flow rate with log transformation.

MMscf/month
42 600

500
41.9

400
41.8
Latitude

300
41.7
200

41.6
100

41.5 0
–78 –77.5 –77 –76.5 –76 –75.5

Longitude

Fig. 15—Max gas-flow rate within a 9-month period for Marcellus data.

1,000
Max Gas-Flow Rate

750
(MMscf/month)

500

250

5,000 7,500 10,000 12,500 15,000


Well Depth (ft)

Fig. 16—Scatter plot and its local linear smoothing of max gas-flow rate with well depth for Marcellus data.

Fig. 18 shows the relative variable importance from GBM of the four predictor variables. As we can see from the results summar-
ized in Fig. 18, longitude is the most important (38.8%), followed by the well length (25.8%). The other location variable, latitude, has
a similar relative importance (24.0%), and well depth has the lowest relative importance (11.4%) to the max gas-flow rate.
Next, we use NAM with the cubic B-spline to model the relationship between log-transformed well performance and the location
(longitude, latitude), well length, and well depth. Because of the big sample size, we randomly split the data into two parts: 70% of the
data is used only for training the model, and the remaining 30% of the data is withheld only to test the goodness of the trained model.

2017 SPE Journal 11

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 12 Total Pages: 18

Let y be the well performance and x1 ; x2 ; x3 ; x4 be well length, well depth, longitude, and latitude, respectively. Similar to the data anal-
ysis in the previous section, NAM has the form

logðyÞ ¼ l þ f1 ðx1 Þ þ f2 ðx2 Þ þ f3 ðx3 ; x4 Þ þ e; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð20Þ

where f1 and f2 are two univariate smooth curves and f3 is a bivariate smooth surface. The model is estimated by minimizing the follow-
ing penalized sum of square error
X
n ð ð
L5 ðb1 ; …; bm ; cÞ ¼ ½logðyi Þ  l  f1 ðx1i Þ  f2 ðx2i Þ  f3 ðx3i ; x4i Þ2 þ c1 ½ f 001 ðx1 Þ2 dx1 þ c2 ½ f 002 ðx2 Þ2 dx2 þ c3 Jj f 3 j; . . . . ð21Þ
i¼1

where
X
m1
f1 ðx1i Þ ¼ b1j ðx1i Þb1j ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð22Þ
j¼1

X
m2
f2 ðx2i Þ ¼ b2j ðx2i Þb2j ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð23Þ
j¼1

X
m4 X
m3
f3 ðx3i ; x4i Þ ¼ b3j ðx3i Þb4l ðx4i Þb3jl ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð24Þ
j¼1 l¼1
ðð " 2  2 2  2 2 #
@ 2 f3 @ f3 @ f3
Jj f 3 j ¼ þ2 þ dx3 dx4 ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð25Þ
@x23 @x3 x4 @x24

bk ¼ ðbk1 ; …; bkmk Þ; k ¼ 1; 2: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ð26Þ

1,000
Max Gas-Flow Rate

750
(MMscf/month)

500

250

0 2,500 5,000 7,500 10,000


Well Length (ft)

Fig. 17—Scatter plot and its local linear smoothing of max gas-flow rate with well length for Marcellus data.

Longitude 38.8

Well length 25.8


Parameter

Latitude 24.0

Well depth 11.4

0.0 10.0 20.0 30.0 40.0 50.0


Influence of Important Parameters (%)

Fig. 18—Order of relative importance of four predictor variables on the basis of the gradient-boosting method for Marcellus data.

Fig. 19 plots the actual well performance vs. the predicted well performance with NAM. Fig. 19a is for the training data set, and
Fig. 19b is for the test data set. For the data points falling below the red line in these plots, the model underpredicts the actual well per-
formance. The model overpredicts the actual well performance when the data points fall above the red line. A careful examination of
Figs. 19a and 19b reveals that, in both the training and test cases, the model did a very good job fitting the actual well performance

12 2017 SPE Journal

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 13 Total Pages: 18

when the max gas-flow rate is less than 400 MMscf/month. When the actual well performance is more than 400 MMscf/month, the
model tends to underpredict the actual well performance. The difference could be attributed to fewer data points in the region with the
max gas-flow rate greater than 400 MMscf/month.

Predicted Max Gas-Flow Rate 800 800

Predicted Max Gas-Flow Rate


600 600
(MMscf/month)

(MMscf/month)
400 400

200 200

0 0
0 200 400 600 800 0 200 400 600 800
Actual Max Gas-Flow Rate (MMscf/month) Actual Max Gas-Flow Rate (MMscf/month)
(a) (b)

Fig. 19—Comparison of actual and predicted well performance with nonparametric additive model for Marcellus data: (a) training
data set; (b) test data set.

Next, we compare the model performance between NAM and other commonly used MLR models. The same 70:30 data split for
training and test data sets is used for all models. In Table 3, we compare the performance of NAM with MLR, MLRþIT, and MLR
MLRþQuad. We use the AIC, BIC, R2 , adjusted R2 (Adj R2 ), as well as the relative mean squared error of the training data set (Rel
MSETR) to measure the explanation power. The relative mean squared error of the test data set (Rel MSETE) is used to measure the pre-
diction power. The relative mean square error is defined as the ratio of the MSE of another model and the MSE of NAM. Results sum-
marized in Table 3 show that NAM outperforms the other models in terms of explanation power. The Rel MSETE of NAM is more than
18% higher than that for the other models, indicating better predictive power than the other models. These findings again suggest that
the nonparametric models are superior to the MLR models for the field data used in this study in which nonlinear and complicated cor-
relations exist between the predictors and the response variable.

R
2
Adj R
2 TR TE
Model AIC BIC Rel MSE Rel MSE
NAM 2125 2369 0.64 0.63 1.00 1.00
LM 2540 2571 0.49 0.49 1.22 1.23
LM+IT 2450 2513 0.53 0.52 1.20 1.90
LM+Quad 2387 2471 0.55 0.54 1.20 1.18

Table 3—Comparison of the nonparametric additive model and other models for Marcellus data.

Adj R2 can be used to compare the model explanation power between data sets with different sample sizes and the number of predic-
tors. Not surprisingly, a comparison of the Adj R2 values in Tables 2 and 3 clearly demonstrates the advantage of having a bigger data
set for the same number of predictors (four in this case).
A comparison of the Adj R2 square values in Tables 1 and 3 shows that the NAM can achieve explanation power in the big-data-set
field case (2,064 Marcellus wells) with four predictor variables (Adj R2 ¼ 0.63) that is similar to that in the small-data-set field case
(249 Bakken wells) with six predictor variables (Adj R2 square ¼ 0.64). This implies that when we have a big sample data set, even if
the number of predictor variables is limited, the NAM can still achieve a high explanation power.
Figs. 20 and 21 are contour plots that can be used to visualize the relationship between the predicted max gas-flow rate and two
closely related predictor variables. These figures were generated by keeping the other predictor variables at their respective median val-
ues. The warmer the color in the contour plots, the higher the max oil-production rate. As shown in Fig. 20, the hot-zone location with
the potential to yield highly productive wells easily can be identified. This plot also can provide an estimate of predicted gas-production
rate at a given well location. This information can be very useful in choosing future drillsites in areas with high economic potential.
From Fig. 21, we can identify the well length and depth that would yield the highest well productivity at a given location in the reser-
voir. This information can be used to develop the optimal drilling and completion strategy.

Conclusion
In this study, we developed an efficient and effective data-analysis tool with nonparametric smoothing models that are superior to the
traditional MLR models in both the predictive power and the ability to deal with nonlinear, higher-order variable interactions (see
Tables 1, 2, and 3).
To validate our tool, we used two real data sets—one with 249 tight oil horizontal wells from the Middle Bakken and the other with
2,064 shale gas horizontal wells from the Marcellus Shale. Results from the preliminary data analyses revealed that the interactions
between the predictor and response variables are highly nonlinear and very complicated.
With a nonparametric additive model with cubic B-spline to identify the relationship between log-transformed well performance
and the location (longitude, latitude), well length, well depth, sand injection, and water injection, our tool successfully explained

2017 SPE Journal 13

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 14 Total Pages: 18

approximately 65 to 70% of the well-performance variation in the two data sets as opposed to 44 to 55% with other traditional MLR
models, representing a 30 to 50% increase in model performance.

MMscf/month
42 600

500
41.9

400
41.8

Latitude 300
41.7
200

41.6
100

41.5 0
–78 –77.5 –77 –76.5 –76 –75.5

Longitude

Fig. 20—Contour plot of max gas-flow rate vs. well location (longitude and latitude) with well length and well depth fixed at their re-
spective median values for Marcellus data.

MMscf/month
10,000 1,400

1,200
8,000
1,000
Well Length (ft)

6,000
800

4,000 600

400
2,000
200

0 0
4,000 6,000 8,000 10,000 12,000 14,000

Well Depth (ft)

Fig. 21—Contour plot of max gas-flow rate vs. well depth, and well length with location variables (longitude and latitude) fixed at
their respective median values [–76.2, 41.65] for Marcellus data.

With contour plots, we can identify hot zones in the reservoir with the potential to yield highly productive wells. They also can be
used to predict well performance at a selected future drillsite with different combinations of well length, well depth, and amount of
proppant and water to inject. These contour plots are powerful tools that can be used for future drillsite selections as well as providing
guidance in developing optimal drilling and completion strategies to maximize well productivity.
By comparing results from the two data sets, we found that our tool can achieve the same model performance with the big data set
(2,064 Marcellus wells) with only four predictor variables as the small data set (249 Bakken wells) with six predictor variables. This
implies that for big data sets, even with a limited number of available predictor variables, our tool still can be very effective in identify-
ing hot zones that would yield highly productive wells. If more predictors are available, we can expect better prediction results.
The data sets that we have access to in this study contain very limited completion, geological, and petrophysical information. Results
from this study clearly demonstrated that the data-analysis tool is certainly powerful and flexible enough to take advantage of any addi-
tional engineering and geology data to allow the operators to gain insights on the impact of these factors on well performance.

Nomenclature
bpi ðxÞ ¼ the ith spline basis of polynomial order p
EðyÞ ¼ expected value of random variable y
f ðxÞ ¼ a univariate smooth function
f ðx1 ; x2 Þ ¼ a bivariate smooth function
logðyÞ ¼ the natural log transformation of y
Lk ¼ sum of square error, k ¼ 1; …5
Kh ðxÞ ¼ kernel function with bandwidth h
R2 ¼ coefficient of determination
Sc ði; iÞ ¼ the ith diagonal element of the smoothing matrix

14 2017 SPE Journal

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 15 Total Pages: 18

c ¼ smoothing parameter for B-spline


l ¼ mean of the response variable

Acknowledgments
We would like to acknowledge financial support from Texas A&M Engineering Experiment Station. We would also like to acknowl-
edge that DrillingInfo provided the production data analyzed in the work.

References
Cook, E. R. and Peters, K. 1981. The Smoothing Spline: A New Approach to Standardizing Forest Interior Tree-ring Width Series for Dendroclimatic
Studies. Tree-Ring Bull. 41: 45–53.
Duchamp, T. and Werner, S. 2003. Spline Smoothing on Surfaces. Journal of Computational and Graphical Statistics 12 (2): 354–381. https://doi.org/
10.1198/1061860031743.
Eburi, S., Jones, S., Houston, T. et al. 2014. Analysis and Interpretation of Haynesville Shale Subsurface Properties, Completion Variables, and Produc-
tion Performance Using Ordination, A Multivariate Statistical Analysis Technique. Presented at the SPE Annual Technical Conference and Exhibi-
tion, Amsterdam, 27–29 October. SPE-170834-MS. https://doi.org/10.2118/170834-MS.
Esmaili, S. and Mohaghegh, S. D. 2016. Full-Field Reservoir Modeling of Shale Assets Using Advanced Data-Driven Analytics. Geoscience Frontiers 7
(1): 11–20. https://doi.org/10.1016/j.gsf.2014.12.006.
Fan, J. 1993. Local Linear Regression Smoothers and Their Minimax Efficiencies. The Annals of Statistics 21 (1): 196–216.
Gao, C. and Gao, H. 2013. Evaluating Early-Time Eagle Ford Well Performance Using Multivariate Adaptive Regression Splines (MARS). Presented at
the SPE Annual Technical Conference and Exhibition, New Orleans, 30 September–2 October. SPE-166462-MS. https://doi.org/10.2118/166462-
MS.
Golub, G. H., Heath, M., and Wahba, G. 1979. Generalized Cross-validation as a Method for Choosing a Good Ridge Parameter. Technometrics 21 (2):
215–223.
Grujic, O., Silva, C. D., and Caers, J. 2015. Functional Approach to Data Mining, Forecasting, and Uncertainty Quantification in Unconventional Reser-
voirs. Presented at the SPE Annual Technical Conference and Exhibition, Houston, 28–30 September. SPE-174849-MS. https://doi.org/10.2118/
174849-MS.
Gupta, S., Fuehrer, F., and Jeyachandra, B. C. 2014. Production Forecasting in Unconventional Resources Using Data Mining and Time Series Analysis.
Presented at the SPE/CSUR Unconventional Resources Conference, Calgary, 30 September–2 October. SPE-171588-MS. https://doi.org/10.2118/
171588-MS.
Hastie, T., Tibshirani, R., and Friedman, J. 2005. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer: New York.
King, V. M. and Wray, L. 2014. Completion Optimization Utilizing Multivariate Analysis in the Bakken and Three Forks Formations. Presented at the
SPE Western North American and Rocky Mountain Joint Regional Meeting, Denver, 16–18 April. SPE-169534-MS. https://doi.org/10.2118/169534-
MS.
LaFollette, R. F., Izadi, G., and Zhong, M. 2014. Application of Multivariate Statistical Modeling and Geographic Information Systems Pattern-Recogni-
tion Analysis to Production Results in the Eagle Ford Formation of South Texas. Presented at the SPE Hydraulic Fracturing Technology Conference,
The Woodlands, Texas, USA, 4–6 February. SPE-168628-MS. https://doi.org/10.2118/168628-MS.
Lolon, E., Hamidieh, K., Weijers, L. et al. 2016. Evaluating the Relationship Between Well Parameters and Production Using Multivariate Statistical
Models: A Middle Bakken and Three Forks Case History. Presented at the SPE Hydraulic Fracturing Technology Conference, The Woodlands,
Texas, USA, 9–11 February. SPE-179171-MS. https://doi.org/10.2118/179171-MS.
Maučec, M., Singh, A. P., Bhattacharya, S. et al. 2015. Multivariate Analysis and Data Mining of Well-Stimulation Data by Use of Classification-and-
Regression Tree With Enhanced Interpretation and Prediction Capabilities. SPE Econ & Mgmt 7 (2): 60–71. SPE-166472-PA. https://doi.org/
10.2118/166472-PA.
Mohaghegh, S. D. 2016. Determining the Main Drivers in Hydrocarbon Production From Shale Using Advanced Data-driven Analytics—A Case Study
in Marcellus Shale. Journal of Unconventional Oil and Gas Resources 15: 146–157. https://doi.org/10.1016/j.juogr.2016.07.004.
Neter, J., Wasserman, W., and Kutner, M. 1989. Applied Linear Regression Model, second edition. Boston, Massachusetts: Richard D. Irwin, Inc.
Picton, P. 2000. Neural Networks. New York City: PALGRAVE.
Pham, T. D. and Liu, X. 1995. Neural Networks for Indentification, Prediction and Control. London: Springler-Verlag London Limited.
Schuetter, J., Mishra, S., Zhong, M. et al. 2015. Data Analytics for Production Optimization in Unconventional Reservoirs. Presented at the SPE/AAPG/
SEC Unconventional Resource Technology Conference, San Antonio, Texas, USA, 20–22 July. URTEC-2167005-MS. https://doi.org/10.15530/
URTEC-2015-2167005.
Singh, A. 2015. Root-Cause Identification and Production Diagnostic for Gas Wells With Plunger Lift. Presented at the SPE Reservoir Characterization
and Simulation Conference and Exhibition, Abu Dhabi, 14–16 September. SPE-175564-MS. https://doi.org/10.2118/175564-MS.
Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 58
(1): 267–288.
US Department of Energy (DOE). 2013. Modern Shale Gas Development in the United States: An Update, http://www.netl.doe.gov/File%20Library/
Research/Oil-Gas/shale-gas-primer-update-2013.pdf.
US Energy Information Administration 2013a. Technically Recoverable Shale Oil and Shale Gas Resources: An Assessment of 137 Shale Formations in
41 Countries Outside the United States. http://www.eia.gov/analysis/studies/worldshalegas/.
US Energy Information Administration. 2013b. Early Release Overview. http://www.eia.gov/forecasts/aeo/er/pdf/0383er%282013%29.pdf.
US Energy Information Administration 2014a. Annual Energy Outlook. http://www.eia.gov/forecasts/aeo/mt_naturalgas.cfm.
US Energy Information Administration. 2014b. Drilling Productivity Report, http://www.eia.gov/petroleum/drilling/#tabs-summary-1.
United States Geological Survey (USGS). 2013. USGS Releases New Oil and Gas Assessment for Bakken and Three Forks Formations. USGS
Webpost, “<https://www2.usgs.gov/blogs/features/usgs_top_story/usgs-releases-new-oil-and-gas-assessment-for-bakken-and-three-forks-formations/>”.
Voneiff, G., Sadeghi, S., Bastian, P. et al. 2013. A Well-Performance Model Based on Multivariate Analysis of Completion and Production Data From
Horizontal Well in the Montney Formation in British Columbia. Presented at the SPE Unconventional Resources Conference, Calgary, 5–7 Novem-
ber. SPE-167154-MS. https://doi.org/10.2118/167154-MS.
Voneiff, G., Sadeghi, S., Bastian, P. et al. 2014. Probabilistic Forecasting of Horizontal Well Performance in Unconventional Reservoirs Using Publicly-
Available Completion Data. Presented at the SPE Unconventional Resources Conference, The Woodlands, Texas, USA, 1–3 April. SPE-168978-MS.
https://doi.org/10.2118/168978-MS.

2017 SPE Journal 15

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 16 Total Pages: 18

Willigers, B. J. A., Begg, S., and Bratvold, R. B. 2014. Combining Geostatistics With Bayesian Updating to Continually Optimize Drilling Strategy in
Shale-Gas Plays. SPE Res Eval & Eng 17 (4): 507–519. SPE-164816-PA. https://doi.org/10.2118/164816-PA.
Wood, S. 2006. Generalized Additive Models: An Introduction With R. New York: CRC Press.
Zhong, M., Schuetter, J., Mishra, S. et al. 2015. Do Data Mining Methods Matter?: A Wolfcamp Shale Case Study. Presented at the SPE Hydraulic Frac-
turing Technology Conference, The Woodlands, Texas, USA, 3–5 February. SPE-173334-MS. https://doi.org/10.2118/173334-MS.

Appendix A–MLR Models for Marcellus Shale Data


For the Marcellus data, we have max gas-flow rate y and predictor variables x1 , x2 , x3 , and x4 : The existing MLR models that we com-
pare with are listed as follows.
The first model is the MLR model, which has the form

logðyÞ ¼ b0 þ b1 x1 þ b2 x2 þ b3 x3 þ b4 x4 þ e: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ðA-1Þ

The back-transformed estimated model has expectation

EðyÞ ¼ expð148:91 þ 0:00021x1  0:00008x2 þ 0:76x3 þ 2:09x4 Þ: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ðA-2Þ

The following second model is a more-complicated version of MLR. It is the basic MLRþIT. The formula is

logðyÞ ¼ b0 þ b1 x1 þ b2 x2 þ b3 x3 þ b4 x4 þ b5 x1 x2 þ b6 x1 x3 þ b7 x1 x4 þ b8 x2 x3 þ b9 x2 x4 þ b10 x3 x4 þ e: . . . . . . . . . . . . ðA-3Þ

The back-transformed estimated model has expectation

EðyÞ ¼ expð12694:96  0:025x1  0:033x2 þ 161:98b3 x3  300:47b4 x4 þ 0:000000030x1 x2  0:000077x1 x3


þ 0:00046x1 x4  0:00012x2 x3 þ 0:00057x2 x4  3:83x3 x4 Þ:                                    ðA-4Þ

The third model to be compared with is the model MLRþIT with all quadratic terms, denoted as MLRþQuad. The formula is

logðyÞ ¼ b0 þ b1 x1 þ b2 x2 þ b3 x3 þ b4 x4 b5 x1 x2 þ b6 x1 x3 þ b7 x1 x4 þ b8 x2 x3 þ b9 x2 x4 þ b10 x3 x4 þb11 x21 þ b12 x22


þ b13 x23 þ b14 x24 þ e:                                                          ðA-5Þ
The back-transformed estimated model has expectation

EðyÞ ¼ expð5909:30 þ 0:0010x1 þ 0:028x2 þ 136:12x3 þ 530:38x4  0:000000031x1 x2  0:000028x1 x3  0:000060x1 x4


 0:000036x2 x3  0:000071b9 x2 x4  2:87x3 x4 þ 0:000000029x21  0:000000054x22 þ 0:099x23  8:93x24 Þ:        ðA-6Þ
For the Middle Bakken data, we have two more predictors—water injection and sand injection. The formulas for the corresponding
MLR models are similar but more tedious. The details are omitted.

Appendix B–Test of Independence Among Predictive Variables


In theory, when applying multiple-regression models, the predictive variables will need to be independent. Unfortunately, this is not
usually possible with complicated real-world data. Multicollinearity exists when some predictors can be predicted from the linear com-
bination of other predictors. This redundancy of predictors will cause inconsistency of coefficient estimations. The presence and sever-
ity of multicollinearity can be measured by the VIF. VIF is a measure of the inflated variance of each regression coefficient with the
existence of other predictors in linear models. The larger the VIF value is, the more severe is the multicollinearity. A VIF value close to
1.0 indicates almost no correlation among predictive variables. A VIF value of 5.0 is usually used as the cutoff point for variable selec-
tion and dimension reduction. So, as long as multicollinearity is not severe with VIF <5, we can safely apply NAM.
In the following, we present the scatter-plot matrices, correlation matrices, and VIFs for all predictors from the two data sets used in
this paper (Figs. B-1 and B-2; Tables B-1 through B-4). We can visualize the general patterns among predictors from the scatter plots.
The correlation coefficient (r) in the correlation matrix summarizes the linear association between two predictors. It has a range
between 1.0 and –1.0. A correlation coefficient of zero means no linear association. Correlation coefficients of –1.0 and 1.0 indicate
absolute negative and absolute positive linear associations, respectively. Because all VIFs are less than 2.0 for both data sets, no severe
multicollinearity exists among the predictors selected although some pairs of predictors are weakly or moderately linearly associated
(e.g., r ¼ 0.624 for sand and water in Table B-3).

16 2017 SPE Journal

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 17 Total Pages: 18

–77.5 –76.5 –75.5 0 4,000 10,000

41.9
Lat

41.5
–75.5
Lon

–77.5

4,000 14,000
Depth

0 6,000
Length
41.5 41.8 4,000 10,000

Fig. B-1—Scatter-plot matrix for all four predictor variables in Marcellus data: Lat 5 Latitude, Lon 5 Longitude, Depth 5 Well Depth,
and Length 5 Well Length.

5,000 7,000 9,000 11,000 50 100 200 –102.9 –102.7 –102.5

10,600
Depth

9,800
12,000

Length
5,000 8,000

300
Sand

0 100
150 250

Water
50

47.6
Lat

47.4
–102.6

Lon
–102.9

9,800 10,200 10,600 11,000 0 100 200 300 47.4 47.6 47.8 48.0

Fig. B-2—Scatter-plot matrix for all six predictor variables in Middle Bakken data: Depth 5 Well Depth, Length 5 Well Length,
Sand 5 Sand Injection, Water 5 Water Injection, Lat 5 Latitude, and Lon 5 Longitude.

Lat Lon Depth Length


Lat 1.000 0.020 0.051 –0.047
Lon 0.020 1.000 0.285 0.024
Depth 0.051 0.285 1.000 –0.248
Length –0.047 0.024 –0.248 1.000

Table B-1—Correlation matrix for all four predictors in Marcellus data.

Lat Lon Depth Length


VIF 1.48 1.16 1.73 1.14

Table B-2—Variance inflation factors (VIFs) for all four predictors in Marcellus data.

2017 SPE Journal 17

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091


J189440 DOI: 10.2118/189440-PA Date: 28-October-17 Stage: Page: 18 Total Pages: 18

Depth Length Sand Water Lat Lon


Depth 1.000 –0.244 –0.340 –0.217 –0.096 –0.179
Length –0.244 1.000 0.380 0.339 0.094 0.133
Sand –0.340 0.380 1.000 0.624 0.307 0.222
Water –0.217 0.339 0.624 1.000 0.137 0.040
Lat –0.096 0.094 0.307 0.137 1.000 0.412
Lon –0.179 0.133 0.222 0.040 0.412 1.000

Table B-3—Correlation matrix for all six predictors in Middle Bakken data.

Depth Length Sand Water Lat Lon


VIF 1.17 1.22 2.00 1.70 1.29 1.26

Table B-4—VIFs for all six predictors in Middle Bakken data.

SI Metric Conversion Factors


ft  3.048* E01 ¼ m
ft3  2.832 E02 ¼ m3
bbl  1.589 Eþ00 ¼ m3
*
Conversion factor is exact.

Quan Cai is a PhD-degree candidate in the Department of Statistics at Texas A&M University. His research interests include longi-
tudinal data analysis, nonparametric and semiparametric statistics, and statistical learning. Cai holds a BS degree in mathemati-
cal statistics from Zhejiang University, China.
Wei Yu (corresponding author) is a research associate in the Harold Vance Department of Petroleum Engineering at Texas A&M
University. His research interests include reservoir modeling and simulation of shale gas and tight oil production, CO2-enhanced
shale gas-and-oil recovery, assisted history matching, uncertainty quantification, data mining, and nanoparticles enhanced oil
recovery (EOR). Yu has authored or coauthored more than 50 technical papers and holds one patent. He holds a BS degree in
applied chemistry from the University of Jinan in China, an MS degree in chemical engineering from Tsinghua University in China,
and a PhD degree in petroleum engineering from the University of Texas at Austin. Yu is an active member of SPE.
Hwa Chi Liang is an instructional assistant professor in the Department of Statistics at Texas A&M University. Previously, she was an
associate professor in the Department of Mathematics and Statistics at Washburn University. Liang’s research interests include lin-
ear models, Bayesian analysis, and statistical education. She has authored or coauthored more than 15 technical papers. Liang
has been an associate editor in statistics for the Missouri Journal of Mathematical Sciences since 2013. She holds a PhD degree
in statistics from the University of New Mexico.
Jenn-Tai Liang is a professor in the Harold Vance Department of Petroleum Engineering at Texas A&M University and also the
holder of the John E. & Deborah F. Bethancourt endowed professorship. His current research focus is on developing promising uses
of nanotechnology to improve oil recovery in both conventional and unconventional reservoirs. Example applications include in-
depth conformance control, chemical and microbial EOR, flow assurance, and hydraulic-fracturing-fluid cleanup. Liang holds six
US patents, with five more pending for his work in this area. He holds a PhD degree in petroleum engineering from the University of
Texas at Austin. Liang is an SPE Distinguished Member and was selected as an SPE Distinguished Lecturer for the 2015–2016 season.
Suojin Wang is a professor of statistics and epidemiology and biostatistics at Texas A&M University. His research interests include
semi- and nonparametric statistical methodology, missing-and mismeasured-data analyses, asymptotic theory, and applied
statistics. Wang has authored or coauthored more than 150 refereed research papers. He was the editor-in-chief of the Journal
of Nonparametric Statistics from 2007 to 2012. Wang holds a PhD degree in statistics from the University of Texas at Austin. He is
an elected fellow of the American Statistical Association, an elected fellow of the Institute of Mathematical Statistics, and an
elected member of the International Statistical Institute.
Kan Wu is an assistant professor in the Harold Vance Department of Petroleum Engineering at Texas A&M University. Her research
interests include hydraulic fracturing in unconventional reservoirs, proppant transport in complex fracture networks, coupled
geomechanics/fluid-flow modeling, and optimization of well performance from unconventional gas and oil reservoirs. Wu holds
a PhD degree in petroleum engineering from the University of Texas at Austin. She is a member of SPE.

18 2017 SPE Journal

ID: jaganm Time: 16:03 I Path: //chenas03/cenpro/ApplicationFiles/Journals/SA/SPE/J###/Vol00000/170091/Comp/APPFile/SA-J###170091

Das könnte Ihnen auch gefallen