Beruflich Dokumente
Kultur Dokumente
Abier Akiry
Asma Marghalani
ETR 790
Fall 2019
1
Data Set
The data set that we used for this project is Ithaka faculty data set from www.icpsr.org.
The data is survey data. A total of 44,218 participants were selected via an "every nth" selection
from a list of faculties by Schonfeld, Wulfson, and Housewright (2012) in the United States. A
total of 5261 individuals responded to the survey. This survey was conducted to explore faculty
attitudes and perceptions toward the digital research, teaching, and communicating.
Variables
Dependent variable
The predicted variable is a mean composite score measuring the faculty’s attitude toward
the value of improvements in digital library has eight items. Each item includes ten-point Likert
scale response options (1,2,3,4and 5 lowest to 6,7,8,9, and 10 highest). SPSS coded as two
groups 1,2,3,4and 5 into variable lowest and t6,7,8,9, and 10 into variable highest. Specifically,
7_3) You may have the opportunity to read scholarly monographs in electronic format, either
through a library subscription database or as a standalone e-book. Certain changes in the future
may make digital versions more valuable to you. Use the scales below to rate how much more
valuable each of the following would make digital versions of scholarly monographs to you than
they are today, from 10 to 1 where 10 equals "Much more valuable than they are today" and 1
equals "Not at all more valuable than they are today." Please select one answer for each item
Access to a wider range of materials in digital form:
A. Improved ability to highlight, annotate, and print materials as needed
B. Improved ability to download and organize a personal collection of monographs
C. Improved ability to navigate through and among monographs
D. Improved ability to read scholarly monographs on my device of choice
E. Access to a wider range of materials in digital form
F. Ability to perform computational analysis (text mining) over a corpus of electronic
monographs
G. More effective integration of images, multimedia, and graphs linked to the text
H. Certified preservation of digital scholarly monographs
2
Independent variables
The independent variables are gender and job title, which are categorical predictors. The
categorical predictors gender has two responses options (1= male, 2=female) and the job title
Descriptive statistics:
Summary statistic for dependent variables (Table 1) includes the mean of each variable
including survey items: Q7_3_A, Q7_3_B, Q7_3_C, Q7_3_D, Q7_3_E, Q7_3_F, Q7_3_G,
Q7_3_H.
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 100 57 75 62 202 249 394 778 963 2257
Proportion 0.019 0.011 0.015 0.012 0.039 0.048 0.077 0.151 0.187 0.439
Q7_3_B : You / may have the opportunity to read scholarly monographs in electronic format, / either through a
lib...-Improved ability to download and organize a personal collection of monographs Format:F2.0
n missing distinct Info Mean Gmd
5148 113 10 0.934 8.102 2.339
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 155 72 123 75 269 319 425 847 897 1966
Proportion 0.030 0.014 0.024 0.015 0.052 0.062 0.083 0.165 0.174 0.382
Q7_3_C : You / may have the opportunity to read scholarly monographs in electronic format, / either through a
lib...-Improved ability to navigate through and among monographs Format:F2.0
n missing distinct Info Mean Gmd .
5144 117 10 0.933 8.209 2.18
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
3
Frequency 121 63 81 75 257 300 448 926 920 1953
Proportion 0.024 0.012 0.016 0.015 0.050 0.058 0.087 0.180 0.179 0.380
Q7_3_D : You / may have the opportunity to read scholarly monographs in electronic format, / either through a
lib...-Improved ability to read scholarly monographs on my device of choice Format:F2.0
n missing distinct Info Mean Gmd
5146 115 10 0.957 7.568 2.804
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 276 128 160 96 405 384 470 792 756 1679
Proportion 0.054 0.025 0.031 0.019 0.079 0.075 0.091 0.154 0.147 0.326
Q7_3_E : You / may have the opportunity to read scholarly monographs in electronic format, / either through a
lib...-Improved ability to highlight, annotate, and print materials as needed Format:F2.0
n missing distinct Info Mean Gmd
5151 110 10 0.918 8.133. 2.409
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 179 83 117 87 291 267 414 691 856 2166
Proportion 0.035 0.016 0.023 0.017 0.056 0.052 0.080 0.134 0.166 0.421
Q7_3_F : You / may have the opportunity to read scholarly monographs in electronic format, / either through a
lib...-Ability to perform computational analysis (text mining) over a corpus of electronic monographs
Format:F2.0
n missing distinct Info Mean Gmd .
5129 132 10 0.973 6.826. 3.375
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 544 216 233 168 424 390 437 688 652 1377
Proportion 0.106 0.042 0.045 0.033 0.083 0.076 0.085 0.134 0.127 0.268
Q7_3_G : You / may have the opportunity to read scholarly monographs in electronic format, / either through a
lib...-More effective integration of images, multimedia, and graphs linked to the text Format:F2.0
n missing distinct Info Mean Gmd
5138 123 10 0.958 7.612 2.737
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 229 124 180 121 380 364 478 822 796 1644
Proportion 0.045 0.024 0.035 0.024 0.074 0.071 0.093 0.160 0.155 0.320
Q7_3_H : You / may have the opportunity to read scholarly monographs in electronic format, / either through a
lib...-Certified preservation of digital scholarly monographs Format:F2.0
n missing distinct Info Mean Gmd
5106 155 10 0.971 7.03 3.143
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
4
Value 1 2 3 4 5 6 7 8 9 10
Frequency 405 150 208 136 566 462 469 672 620 1418
Proportion 0.079 0.029 0.041 0.027 0.111 0.090 0.092 0.132 0.121 0.278
Summary statistic for independent variables (gender and title) (Table 2) shows the
missing for gender is 181 and the missing for the title is 96.
TitleFac
n missing distinct
5165 96 6
Value Professor Associate Professor Assistant Professor Adjunct Professor Lecturer Other
Frequency 1949 1358 877 365 270 346
Proportion 0.377 0.263 0.170 0.071 0.052 0.067
The missing data pattern (Figure: 1) provides a graphical summary of the missing
patterns. 4967 cases with no missing values on the variables. On the other hand, 26 cases from
the sample with high percentage of missing values. In addition, the pattern shows the total
number of missing values for each variable. For instance, the total number of missing values for
Title is 96 and total number of missing values for the gender is 181. The missing from the overall
5
Figure:1 Missing data patterns
Graphical Representations of Missing Data
This plot (Figure: 2) provides a specific visualization of the amount of missing data,
showing in black the location of missing values (2.4%) and providing information on the overall
percentage of present values (97.6%) overall and in each variable. None of the variables are
6
Missing data plot (Figure 3) shows that all demographic variables have missing data. The
plot also shows that the most of missing data come from the gender variable.
7
For missing values, first we removed the cases that are missing values for all the items
(Figure 4). The total of missing value was 1226 missing after removing missing value became
1006 missing value. Then, we assessed to determine whether data is missing completely at
random (MCAR) or missing at random (MAR). The MCAR test showed that X2 (252) = 288.48,
p = 0.056 which indicated that the null hypothesis was not rejected, and the missing values were
assessed to be MCAR.
Before started doing MAR test, the missing values for each variable were dummy coded
as 1 = value is missing and 0 = value is not missing. Examine the relationship of this
“missingness” using logistic regression. Table 3 below shows that the missingness of variables
predicted by other variables in the second row of the table (P<.05). For example, the missingness
of Q_7_3_ G which the variable “more effective integration of images, multimedia, and graphs
linked to the text” predicted by the title “Professor” and gender “Female.
Imputation
Single imputation was employed for dealing with missing values in this analysis based on
the results of tests for MCAR p>0.5 to impute the missing values. Before doing this process, we
computed of alpha assumes that the scale is measuring a single construct. The result after random
8
10 Variables 5261 Observations
------------------------------------------------------------------------------------------------------------------
Q7_3_A
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5261 0 10 0.897 8.487 1.997 4 6 8 9 10 10 10
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 100 57 75 62 202 249 394 778 963 2381
Proportion 0.019 0.011 0.014 0.012 0.038 0.047 0.075 0.148 0.183 0.453
------------------------------------------------------------------------------------------------------------------
Q7_3_B
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5261 0 10 0.938 7.95 2.538 1 4 7 9 10 10 10
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 268 72 123 75 269 319 425 847 897 1966
Proportion 0.051 0.014 0.023 0.014 0.051 0.061 0.081 0.161 0.170 0.374
------------------------------------------------------------------------------------------------------------------
Q7_3_C
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5261 0 10 0.935 8.227 2.152 3 5 7 9 10 10 10
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 121 63 81 75 257 300 448 926 1037 1953
Proportion 0.023 0.012 0.015 0.014 0.049 0.057 0.085 0.176 0.197 0.371
------------------------------------------------------------------------------------------------------------------
Q7_3_D
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5261 0 10 0.958 7.599 2.772 1 3 6 8 10 10 10
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 276 128 160 96 405 384 470 792 871 1679
Proportion 0.052 0.024 0.030 0.018 0.077 0.073 0.089 0.151 0.166 0.319
------------------------------------------------------------------------------------------------------------------
Q7_3_E
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5261 0 10 0.912 8.172 2.386 3 5 7 9 10 10 10
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 179 83 117 87 291 267 414 691 856 2276
Proportion 0.034 0.016 0.022 0.017 0.055 0.051 0.079 0.131 0.163 0.433
------------------------------------------------------------------------------------------------------------------
Q7_3_F
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5261 0 10 0.974 6.679 3.493 1 1 5 8 10 10 10
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
9
Value 1 2 3 4 5 6 7 8 9 10
Frequency 676 216 233 168 424 390 437 688 652 1377
Proportion 0.128 0.041 0.044 0.032 0.081 0.074 0.083 0.131 0.124 0.262
------------------------------------------------------------------------------------------------------------------
Q7_3_G
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5261 0 10 0.96 7.574 2.734 2 3 6 8 10 10 10
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 229 124 180 121 380 487 478 822 796 1644
Proportion 0.044 0.024 0.034 0.023 0.072 0.093 0.091 0.156 0.151 0.312
------------------------------------------------------------------------------------------------------------------
Q7_3_H
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
5261 0 10 0.972 7.088 3.105 1 2 5 8 10 10 10
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 405 150 208 136 566 462 469 672 775 1418
Proportion 0.077 0.029 0.040 0.026 0.108 0.088 0.089 0.128 0.147 0.270
------------------------------------------------------------------------------------------------------------------
GenderFac
n missing distinct Info Mean Gmd
5261 0 2 0.731 1.421 0.4877
Value 1 2
Frequency 3044 2217
Proportion 0.579 0.421
---------------------------------------------------------------------------------------------------------------------------
TitleFac
n missing distinct Info Mean Gmd
5261 0 6 0.919 2.334 1.565
lowest : 1 2 3 4 5, highest: 2 3 4 5 6
Value 1 2 3 4 5 6
Frequency 2045 1358 877 365 270 346
Proportion 0.389 0.258 0.167 0.069 0.051 0.066
Inferential Analysis
The inferential analysis was used the Regression and the randomly imputed was carried
out to determine the relationship between the independent variables (job title, gender) and
dependent variables (the faculty’s attitude) toward the value of improvements in digital library).
10
After the randomly-imputed, Gender is significant predictor (p=.0001). The Job title is not
Discussion
respondent is more likely to place high value on improvements in elibraries if they are a woman
with a lower ranked job title. Male professors are most likely to place a low value on
improvements. The possible reasons why data were missing because some participants agreed to
participate and did not respond to all survey items, or (2) participants began the survey but only
responded to the first section of the online survey Likert scales. (the demographic questions
11
References
Housewright, R., Schonfeld, R. C., & Wulfson, K. (2013). Ithaka S+ R US faculty survey 2012 (pp.
45-80). Ithaka S+ R.
12
Appendix: R Code
#Plotting the number of variables with missing values for each case
naniar::gg_miss_case(ithakaSubset1)
naniar::gg_miss_case(ithakaSubset1,
facet=GenderFac)
naniar::gg_miss_case(ithakaSubset1,
facet=TitleFac)
13
library(mice)
mice::md.pattern(ithakaSubset1)
If this occurs, remove those cases that have no valid responses from the dataframe
#Removing cases that are missing values for all the items
ithakaQ7_3items2<-dplyr::filter(ithakaSubset1,
mice::md.pattern(ithakaQ7_3items2)
"NA=1;
else=0")
ithakaQ7_3items2$Q7_3_Bmiss<-car::recode(ithakaQ7_3items2$Q7_3_B,
"NA=1;
14
else=0")
ithakaQ7_3items2$Q7_3_Cmiss<-car::recode(ithakaQ7_3items2$Q7_3_C,
"NA=1;
else=0")
ithakaQ7_3items2$Q7_3_Dmiss<-car::recode(ithakaQ7_3items2$Q7_3_D,
"NA=1;
else=0")
ithakaQ7_3items2$ Q7_3_Emiss<-car::recode(ithakaQ7_3items2$Q7_3_E,
"NA=1;
else=0")
ithakaQ7_3items2$ Q7_3_Fmiss<-car::recode(ithakaQ7_3items2$Q7_3_F,
"NA=1;
else=0")
ithakaQ7_3items2$ Q7_3_Gmiss<-car::recode(ithakaQ7_3items2$Q7_3_G,
"NA=1;
else=0")
ithakaQ7_3items2$Q7_3_Hmiss<-car::recode(ithakaQ7_3items2$Q7_3_H,
"NA=1;
else=0")
ithakaQ7_3items2$GenderFacmiss<-car::recode(ithakaQ7_3items2$GenderFac,
"NA=1;
else=0")
ithakaQ7_3items2$TitleFacmiss<-car::recode(ithakaQ7_3items2$TitleFac,
"NA=1;
else=0")
Logistic regression predicting “missingness” of Q7_3_A from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Q7_3_A from education Title, Gender, and other Q7_3 items
Q7_3_AMARcheck<-glm(data=ithakaQ7_3items2,
Q7_3_Amiss~
TitleFac+GenderFac+Q7_3_B+Q7_3_C+Q7_3_D+Q7_3_E+Q7_3_F+Q7_3_G+Q7_3_H,
family=binomial)
summary(Q7_3_AMARcheck)
Logistic regression predicting “missingness” of Q7_3_B from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Q7_3_B from education Title, Gender, and other Q7_3 items
Q7_3_BMARcheck<-glm(data=ithakaQ7_3items2,
15
Q7_3_Bmiss~
TitleFac+GenderFac+Q7_3_A+Q7_3_C+Q7_3_D+Q7_3_E+Q7_3_F+Q7_3_G+Q7_3_H,
family=binomial)
summary(Q7_3_BMARcheck)
Logistic regression predicting “missingness” of Q7_3_C from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Q7_3_C from education Title, Gender, and other Q7_3 items
Q7_3_CMARcheck<-glm(data=ithakaQ7_3items2,
Q7_3_Cmiss~
TitleFac+GenderFac+Q7_3_A+Q7_3_B+Q7_3_D+Q7_3_E+Q7_3_F+Q7_3_G+Q7_3_H,
family=binomial)
summary(Q7_3_CMARcheck)
Logistic regression predicting “missingness” of Q7_3_D from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Q7_3_D from education Title, Gender, and other Q7_3 items
Q7_3_DMARcheck<-glm(data=ithakaQ7_3items2,
Q7_3_Dmiss~
TitleFac+GenderFac+Q7_3_A+Q7_3_B+Q7_3_C+Q7_3_E+Q7_3_F+Q7_3_G+Q7_3_H,
family=binomial)
summary(Q7_3_DMARcheck)
Logistic regression predicting “missingness” of Q7_3_E from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Q7_3_E from education Title, Gender, and other Q7_3 items
Q7_3_EMARcheck<-glm(data=ithakaQ7_3items2,
Q7_3_Emiss~
TitleFac+GenderFac+Q7_3_A+Q7_3_B+Q7_3_C+Q7_3_D+Q7_3_F+Q7_3_G+Q7_3_H,
family=binomial)
summary(Q7_3_EMARcheck)
Logistic regression predicting “missingness” of Q7_3_F from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Q7_3_F from education Title, Gender, and other Q7_3 items
Q7_3_FMARcheck<-glm(data=ithakaQ7_3items2,
16
Q7_3_Fmiss~
TitleFac+GenderFac+Q7_3_A+Q7_3_B+Q7_3_C+Q7_3_D+Q7_3_E+Q7_3_G+Q7_3_H,
family=binomial)
summary(Q7_3_FMARcheck)
Logistic regression predicting “missingness” of Q7_3_G from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Q7_3_G from education Title, Gender, and other Q7_3 items
Q7_3_GMARcheck<-glm(data=ithakaQ7_3items2,
Q7_3_Gmiss~
TitleFac+GenderFac+Q7_3_A+Q7_3_B+Q7_3_C+Q7_3_D+Q7_3_E+Q7_3_F+Q7_3_H,
family=binomial)
summary(Q7_3_GMARcheck)
Logistic regression predicting “missingness” of Q7_3_H from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Q7_3_H from education Title, Gender, and other Q7_3 items
Q7_3_HMARcheck<-glm(data=ithakaQ7_3items2,
Q7_3_Hmiss~
TitleFac+GenderFac+Q7_3_A+Q7_3_B+Q7_3_C+Q7_3_D+Q7_3_E+Q7_3_F+Q7_3_G,
family=binomial)
summary(Q7_3_HMARcheck)
Logistic regression predicting “missingness” of Title from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Title from education Gender, and other Q7_3 items
TitleFacMARcheck<-glm(data=ithakaQ7_3items2,
TitleFacmiss~
GenderFac+Q7_3_A+Q7_3_B+Q7_3_C+Q7_3_D+Q7_3_E+Q7_3_F+Q7_3_G+Q7_3_H,
family=binomial)
summary(TitleFacMARcheck)
Logistic regression predicting “missingness” of Gender from the remaining variables in the dataframe: R
code
#Logistic regression predicting missingness of Gender from education Title, and other Q7_3 items
GenderFacMARcheck<-glm(data=ithakaQ7_3items2,
17
GenderFacmiss~
TitleFac+Q7_3_A+Q7_3_B+Q7_3_C+Q7_3_D+Q7_3_E+Q7_3_F+Q7_3_G+Q7_3_H,
family=binomial)
summary(GenderFacMARcheck)
Create a composite (total) “Faculty’s Attitude “ score for each item: R code
#Compute a composite ‘Faculty’s Attitude’ (SUeB) score and add it to the dataframe
ithakaSubset1$SUeB<- rowMeans(cbind(ithaka$Q7_3_A,
ithaka$Q7_3_B,
ithaka$Q7_3_C,
ithaka$Q7_3_D,
ithaka$Q7_3_E,
ithaka$Q7_3_F,
ithaka$Q7_3_G,
ithaka$Q7_3_H),
na.rm=TRUE)
labels=c("Professor",
"Associate Professor",
"Assistant Professor",
"Adjunct Professor",
"Lecturer",
"Other"))
Hmisc::describe(ithakaRanImpute)
18
Missing data pattern plot using naniar package
19