Beruflich Dokumente
Kultur Dokumente
research-article2018
EPXXXX10.1177/0895904818810520Educational PolicyMintz and Kelly
Article
Educational Policy
1–38
Science Teacher © The Author(s) 2018
Article reuse guidelines:
Motivation and sagepub.com/journals-permissions
DOI: 10.1177/0895904818810520
https://doi.org/10.1177/0895904818810520
Evaluation Policy in a journals.sagepub.com/home/epx
High-Stakes Testing
State
Abstract
This qualitative case study explored the teachers’ and administrators’
perceptions of a newly implemented teacher evaluation policy in a high-
stakes testing state, and how this policy impacted their motivation. Five
science teachers and their immediate supervisors were interviewed,
and their perceptions were analyzed through motivational theories of
incentivizing career behaviors. Findings suggest the overarching goal
of improving teacher practice through accountability was facilitated
by intrinsic motivation and challenged by weaknesses in policy design.
These tensions could be mediated by localized control that improves
stakeholder agency, peer learning communities, and the adoption of more
reliable evaluation metrics. Implications for teacher buy-in of evaluation
policy are discussed.
Keywords
educational policy, high-stakes accountability, policy implementation,
qualitative research, science education, secondary education, state policies,
supervision, teacher-administrator relations, teacher quality
Corresponding Author:
Angela M. Kelly, Associate Professor, Institute for STEM Education, Stony Brook University,
092 Life Sciences, Stony Brook, NY 11794-5233, USA.
Email: angela.kelly@stonybrook.edu
2 Educational Policy 00(0)
Introduction
Teacher evaluation and accompanying educational reforms have received
considerable attention across the United States in recent years. Research in
educational policy has called for the need to evaluate the practices, percep-
tions, and motivation of teachers to understand the successes and challenges
of policy changes (Cuevas, Ntoumanis, Fernández-Bustos, & Bartholomew,
2018; Datnow & Castellano, 2000). The most promising evaluation reform
initiatives involved multiple sources of data and actively cultivated teacher
agency (Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012;
Louis, Febey, & Schroeder, 2005). Successful implementation of new policy
depends largely upon teacher acceptance and motivation, therefore, research-
ers and policy makers must be attentive to teachers’ responses to change (de
Jesus & Lens, 2005).
National- and state-level policies have impacted local teacher evaluation
systems by requiring measurement of teacher performance, pathways for pro-
fessional improvement, and mechanisms for identifying underperforming
teachers. In May 2011, the New York State Education Department (NYSED)
initiated a plan to use state standardized assessment (“Regents”) results as a
partial quantitative factor in measuring the effectiveness of science teachers
(New York State Legislature, 2010). The decision to use these assessment
scores was influenced by funds made available by Race to the Top, a program
that offered competitive grant money to states that linked student achieve-
ment data to individual teachers (American Recovery and Reinvestment Act
[ARRA], 2009).
New York’s Annual Professional Performance Review (APPR) system
placed emphasis on evaluating science teachers’ abilities with high stakes
attached. Teachers may have been placed on a teacher improvement plan or
terminated if found to be developing or ineffective, rather than using scores
to provide meaningful professional development (Pallas, 2012). Examining
teacher and administrator insights regarding teacher evaluation is essential as
they are the key players in the learning process (Goe, Bell, & Little, 2008).
Because teacher quality has been identified as the primary influence on stu-
dent achievement (Stronge, Ward, Tucker, & Hindman, 2007), their motiva-
tion and perceptions of the evaluation process impact pedagogical practice
and, thus, student learning (Hopkins, 2016).
This study is one of the first to focus on the motivation and perceptions
of science teachers as well as their direct supervisors in performance review.
Research has indicated that New York science teachers have been impacted
by state standards in the evaluation process more so than teachers in other
subjects and science teachers in other states (Louis et al., 2005; NORC at
Mintz and Kelly 3
the University of Chicago, 2018), though there has been little or no recent
research targeting the evaluation of these teachers in light of recent legisla-
tion. This is consistent with research that suggested studies on reform
efforts lag behind policy implementation, which in turn shifts in response to
constituent push-back (Coburn, Hill, & Spillane, 2016). In New York state,
nearly all high school science educators teach curricula that culminate in
high-stakes Regents exams, which is not the case for most teachers in other
disciplines. These exams often determine graduation eligibility as students
must pass a biology and physical science exam to earn a Regents diploma
(NYSED, 2015a). Secondary science teachers are also unique in that their
disciplinary supervisors often do not share their areas of certification;
research has suggested that subject-specific feedback is necessary to pro-
vide useful knowledge leading to pedagogical improvement (Hill &
Grossman, 2013).
The purpose of this study is to provide insights on teacher motivation
related to the implementation of science teacher evaluation reform in a high-
stakes testing state, New York, with the ultimate goal of providing recom-
mendations to inform future efforts to promote professional motivation,
excellence in science teaching, and student learning. The results from this
study have broader implications for the use of student performance data in
teacher evaluations and reach beyond science teachers in New York, inform-
ing policy across disciplines and states. The findings contribute to the larger
discussion of the use of high-stakes achievement measures in evaluating
teachers, policy enactment and its impact on motivation, and mechanisms for
effective accountability implementation.
This qualitative study explored connections among teachers’ and
administrators’ motivation and perceptions, reform implementation fac-
tors, and consequences of a teacher evaluation system based upon class-
room observations and student outcome measures. This study included
five secondary science teachers with varied years of experience and the
administrators responsible for their evaluations. Both teachers and admin-
istrators were included to reflect the varying perspectives of two groups of
key stakeholders—those responsible for overseeing implementation of the
policy at the departmental level, and those directly bearing the conse-
quences of the policy in terms of their professional performance evalua-
tions. The research questions were as follows:
Research Questions 1: How did the APPR system affect the motivation,
perceptions, and professional satisfaction of science teachers and their
administrators? What challenges have been identified?
4 Educational Policy 00(0)
Background
This study employed a motivational lens to generate explanatory con-
structs for improving teacher evaluation policy implementation. Teacher
evaluation requires effective measurement, capacity building, and appro-
priate incentives to motivate teachers to improve professional practice
(Firestone, 2014). Research has shown that most teachers have exhibited a
sense of agency regarding their students’ academic performance, and
teachers’ individual motivation was dependent upon the alignment of pol-
icy incentives with their desire to see their students succeed (Finnigan &
Gross, 2007). Accountability policies that overemphasized student testing
performance often resulted in teachers experiencing pressure, stress, and
diminished motivation (Cuevas et al., 2018). School leaders have been a
critical factor in mitigating teachers’ responses to accountability policies,
and their efforts have impacted teacher motivation to comply with educa-
tional reforms (Leithwood, Steinbach, & Jantzi, 2002). In the current cli-
mate of teacher performance measures gaining importance with respect to
school-level measures, more research is needed on educators’ perceptions
and responses to accountability reforms, and how such reforms influence
motivation (Harris & Herrington, 2015).
Conceptual Framework
This qualitative study incorporated a priori theories of motivational incen-
tives in exploring science teacher and administrator perspectives of APPR.
Research has suggested that teachers must be motivated to implement reforms
that are based upon top-down policy initiatives (de Jesus & Lens, 2005).
Motivation has been defined as inspiration or reason for acting or behaving a
certain way (Ryan & Deci, 2000). Two overarching theories of motivation are
often evident in the design of teacher evaluation systems: intrinsic and extrin-
sic. Intrinsically motivated individuals engage in tasks because they experi-
ence inherent interest and joy in their work (Eccles & Wigfield, 2002).
Competence and autonomy are basic psychological needs that often lead to
intrinsic motivation. Extrinsic motivation involves performing a task because
of anticipated separate consequences. Although extrinsic motivation has
often been perceived as the weaker incentive for self-directed change, indi-
viduals may integrate external prompts if they are congruent with their values
Mintz and Kelly 5
and beliefs (Ryan & Deci, 2000). Consequently, intrinsic and extrinsic moti-
vations are not necessarily discrete entities.
Teachers with autonomy have often internalized the goals of administra-
tors, principals, or educational leaders if they found those goals reasonable
and within their power to achieve (de Jesus & Lens, 2005). When the goals
came from a valid authority figure, such as a respected principal or adminis-
trator, specific and challenging goals produced greater effort. Even with rapid
proliferation of accountability policies, teachers still reported their work was
less influenced by these policies and influenced more by student and peer
validation (Finnigan & Gross, 2007; Firestone, Nordin, Shcherbakov, Blitz,
& Kirova, 2014).
Behind (NCLB; U.S. Department of Education, 2001) and Race to the Top
(U.S. Department of Education, 2009), the connection between teacher effec-
tiveness and student achievement has prompted reform efforts in teacher
evaluation (Strong, Gargani, & Hacifazlioglu, 2011). States have developed
and implemented new evaluation systems to improve upon prior methods
that were unable to differentiate teacher quality (Weisberg, Sexton, Mulhern,
& Keeling, 2009). The current practice of teacher compensation based upon
years of experience and academic credentials has had little impact on student
learning (Hanushek, Kain, O’Brien, & Rivkin, 2005). Further research has
shown that teacher evaluations ranked most teachers as satisfactory or good,
but failed to recognize great teachers and offered little professional develop-
ment for poorly performing teachers (Forman & Markson, 2015; Kraft &
Gilmour, 2017). Most researchers and policy makers have agreed that the
evaluation process is more productive when it facilitates meaningful profes-
sional development, which promotes intrinsic motivation, rather than serving
as a summative judgment of teacher performance, which may negatively
influence motivation due to external pressure (Finnigan & Gross, 2007;
Gigante & Firestone, 2008).
The most common way to measure teacher effectiveness has been through
subjective observation. This type of formative evaluation, intended to develop
pedagogical skills, has been criticized because it often relies upon one evalu-
ator’s perception of teacher effectiveness without validating data to support
his or her interpretation (Strong et al., 2011). School leaders have tended to
be lenient in teacher evaluation scores (Kimball & Milanowski, 2009), which
have rarely been used in making consequential personnel decisions (Murphy,
Hallinger, & Heck, 2013). However, practice-based teacher evaluation is
arguably the most direct evidence of a teacher’s ability to affect student learn-
ing and would suggest that these observations be heavily weighted when cal-
culating a teacher summative score, although multiple measures are needed
(Darling-Hammond et al., 2012). Teacher observation is a complex process
that requires appropriate instruments and training to yield valid and reliable
results. The accuracy and fairness of evaluation metrics are essential for pro-
viding useful feedback and incentivizing excellent teaching (Firestone,
2014).
Teachers have also been evaluated based on value-added models (VAM)
of student achievement; these models consider student growth in the teach-
er’s evaluation in an attempt to control for outside factors that affect student
achievement, measuring teacher effectiveness through adjusted gains in
standardized test scores (Harris, 2009). Student performance data have
sometimes been associated with more accurate teacher evaluations
(Hopkins, 2016). However, other researchers questioned the usefulness of
8 Educational Policy 00(0)
(Baker et al., 2010; Papay, 2012). Although research has shown that teacher
evaluation policy offers promise for improving student learning (Louis et al.,
2005), few studies have explored teacher and administrator views of these
reforms and how the process impacts professional motivation. Teachers have
resisted educational policy change if they felt that reforms did not match the
views of their professional communities (Jiang, Sporte, & Luppescu, 2015).
Their perceptions of new policies have been influenced by structural and
social conditions of schools and relationships with school administration
(Louis et al., 2005; Malen, 2003).
Educators generally believe that evaluation is a worthwhile practice
(Clipa, 2015). Teachers have expressed that evaluations should be used to
measure and develop their pedagogical skills, but development should be
prioritized (Marzano, 2012). A recent study on teacher perceptions of evalu-
ation reform in Chicago showed that teachers were concerned about the addi-
tion of student growth as a part of their evaluations but found the observation
process provided useful feedback (Jiang et al., 2015). Principals voiced con-
cerns about the perceived inequities of the teacher evaluation system and its
impact on teachers, particularly when evaluations were not used for instruc-
tional improvement (Kimball & Milanowski, 2009). These concerns were
intensified when their resource and time management issues were not
addressed during transitions to new evaluation policy (Derrington &
Campbell, 2015). The concerns of teachers and administrators are critical
considerations when examining the potential of evaluation initiatives to moti-
vate teachers and build capacity for student achievement.
Method
Research Design
The focus of this phenomenological case study was to explore and describe
the shared meaning of lived experiences for a group of individuals
(Creswell, 2013), and analyze the impact of evaluation policy upon teacher
motivation in a localized context. Case study is an appropriate approach
for examining policy implementation with key stakeholders as units of
analysis (Yin, 1994). The shared experience in this case study was the
external implementation of a state-mandated teacher evaluation system
that was primarily based upon student performance scores and teacher
observations. Qualitative research has been recommended for evaluating
educational policy because understanding district subsystems provides
nuanced insights into the connection between policy and practice (Sadler,
1985). The researchers explored teacher and administrator perceptions of
10 Educational Policy 00(0)
Context
APPR in New York State. On May 28, 2011, New York’s Senate and Assembly
voted to structure teacher evaluations with 40% of the composite score based
upon student achievement and 60% based upon observations. A primary fea-
ture of the law was that every school district in the state prepared and imple-
mented an APPR that began in the 2012-2013 academic year. Teachers were
annually reviewed for performance based upon a composite score of student
growth, student achievement, and other measures such as teacher observa-
tions. These categories are described in more detail below.
Student growth. Student growth was the measure of the change in a stu-
dent’s scores between two or more points in time (NYSED, 2015b). To mea-
sure student growth, objectives needed to be defined to show evidence that
a student had learned more science. A Student Learning Objective (SLO)
was an academic goal set by the educator at the beginning of each school
year (Tyler, 2011). The SLO contained information on the student population,
the learning content, the instructional time frame, the assessments used to
measure the goal, the baseline level of the students in the class, the expected
Mintz and Kelly 11
target by the end of the course, district-based HEDI ratings, and rationale
as to why the teacher chose such targets. SLOs for Regents-level courses
were required to use the Regents exam results as evidence of student learning
during the instructional time frame. The living environment and chemistry
courses measured student growth using a baseline exam and the state Regents
exam as a summative assessment.
Participants. The study included five secondary science teachers with var-
ied ranges of experience and five administrators responsible for their evalu-
ations across Suffolk County, New York. The geographical area is comprised
of the eastern end of Long Island, measuring 2,400 square miles with 69
different school districts. The county encapsulates the characteristic fea-
tures of a much larger school district in a small area and has a long tradition
of state-mandated test-based accountability. There is a great range of vari-
ance in school tax revenue, expenditures, and educational attainment due to
segregation, the fractured structure of education, and the major differences
in property taxes (Long Island Index, 2009). However, schools in the area
typically experienced organizational stability with low teacher turnover due
to relatively high teacher salaries.
Participants were chosen based upon their years of experience, school dis-
trict demographics, and position as a secondary living environment or chem-
istry teacher or school science administrator/supervisor. The goal of the
maximum variation sampling process (Patton, 1990) was to gather teachers
with a range of educational experience among different school districts to
obtain a representative view of teacher and administrator perceptions. The
use of such purposeful sampling allowed the researchers to employ cross-
case analysis to identify key motivational themes that were consistent among
a variety of school contexts.
The study required participants to have experienced the change in teacher
evaluation law, and science teachers with more than 5 years of experience
would have been teaching during this transitional period. Science teachers
were also chosen based on their content certifications. Science teachers who
taught the living environment course were chosen because the accompanying
high-stakes Regents exam was required for graduation. To broaden the per-
spectives of the participants and to make the study more generalizable, sci-
ence teachers who taught Regents chemistry were chosen as their courses
Mintz and Kelly 13
were not technically required for graduation but also culminated in a Regents
exam. Consequently, qualitative data revealed unique descriptions of teacher
and administrator perspectives, as well as shared patterns elicited from a het-
erogeneous sample (Patton, 1990).
The five participating pairs, with each pair including a teacher and the
administrator responsible for her evaluation, were employed by the same
school district. Because it was the administrator’s responsibility to explain
the evaluation process to the teachers, these pairs were purposely chosen
to analyze the relationship between the evaluator perceptions and the
teacher perceptions. The participants had a variety of procedures for
teacher evaluation. The administrators reported a range of 40 to 78 teach-
ers they were directly responsible for supervising. Most of the administra-
tors had undergraduate majors and certifications in science, with one
exception. The two chemistry teachers who participated in this study were
observed and evaluated by administrators who did not have chemistry
majors or certifications. The teacher, administrator, and district descrip-
tions are summarized in Table 2.
14
Science teachers
Undergraduate Degree Science—Med Tech Elementary Education Biology Physics Environmental Science
Certifications Biology Kindergarten and Biology Physics Earth Science
General Science Grades 1-6 General Science General Science Biology
Chemistry Administrator Administration Math 7-12 General Science
Administrator Administrator Administration
Type of Administrator Science & Technology Assistant Principal Science & Technology Science & Technology Science Director
Director Director Chairperson
Years Experience 36 19 28 17 24
(Teaching + Admin)
Teachers Directly Supervised 78 50-60 40-50 49 42
Findings
Data revealed that the policy goal to foster teacher motivation and profes-
sional development through accountability met with mixed results. Teachers
expressed evidence of intrinsic motivation, however, there were clear chal-
lenges related to accountability metrics and lack of stakeholder agency. The
secondary science teachers in this study were particularly affected because of
content specialization and the high-stakes exams their students were required
Mintz and Kelly 17
I think it’s the passion for the subject in terms of what I’m teaching. I’ve taken
such an interest in chemistry. I’ve also taken such an interest in learning new
things through the science research program. The second part is the interaction
with the students and just working with kids—students—that interaction is
incredible. It’s fun and it’s the reason I like coming here every day. (Robert)
The data also indicated the science teachers recognized the importance
of evaluation and accountability as extrinsic motivations, and felt the
observation process was the most effective of the three APPR categories.
They generally felt the dialogues with administrators were practical and
constructive, and they demonstrated reasonable knowledge of the observa-
tion process and identified the rubric used to generate their observation
scores. Both science teachers and administrators found the conversations
about science lessons had improved as a result of APPR and the required
observation rubrics. The teachers found that the observation rubrics pro-
vided the instrumentality or the explicit means to succeed with clear lesson
expectations, and they sometimes received more productive feedback
when administrators used the rubric as a discussion guide after the lesson.
Christine felt as if the conversations during the post-conferences with her
18 Educational Policy 00(0)
administrator, Charles, were useful and provided her with information that
could improve her practice:
The expectations were made clear and she found the dialogue formative,
which was an important aspect of her buying into the process. Despite their
general positive feelings about observations, all participating teachers desired
more content-specific recommendations. It is notable that the science teach-
ers observed by administrators who did not share the same teaching certifica-
tion voiced this concern more often. For example, Annie, a teacher whose
certification was different from her supervisor’s, stated, “I am a strong
believer in [evaluation]. And I think it should always be external. I do think
that it needs to be someone who has experience in teaching and possibly even
experience teaching that subject.”
Some teachers in this study voiced concerns with extrinsic motivators,
such as publicly available ratings, stating that it would lead to teacher
competition rather than teacher collaboration. One of the goals of APPR
was to develop excellent teachers, however, categorical rating compari-
sons to within-school and out-of-school colleagues had consequences.
When external evaluation promotes competitiveness and ranks teachers in
relation to their colleagues, these same teachers may compete for a higher
composite score and demonstrate unwillingness to share pedagogical
tools. Jack described this potential threat to the spirit of professional
collaboration:
Colleagues should really work together. Part of teaching, part of the idea of
education is mentoring, learning from other individuals, sharing knowledge
and I feel as though when you put that competition aspect into it you’re not
going to inspire people to be better, you’re going to inspire people to kind of be
greedy and money always leads to that and when you bring that variable into
an equation, who knows where it goes, who knows how it’s going to fracture
education. (Jack)
Mintz and Kelly 19
Other teachers concurred that the APPR did not nurture their motivation,
rather, the extrinsic incentives attached to APPR were not congruent with
their teaching beliefs. The dynamic interaction between intrinsic and extrin-
sic motivation suggested that teachers were committed to educating children
and improving their practice through collegial interactions, but they expressed
some concerns about extrinsic rewards that might foster competition rather
than collaboration. This and other constructs are explored in more detail in
the next section.
Accountability Challenges
The teachers and administrators cited several challenges that impacted
their perceptions of APPR and professional motivation, including time
constraints, lack of clarity, perceptions of unfairness, and lack of agency.
Administrators found the observations to be important, however, the
increased volume of observations reduced the amount of time they could
dedicate to being teacher leaders. They voiced concerns about the commit-
ment required, which included a pre-observation meeting, the observation,
and a post-observation conference. As Jane stated, “So I think with some-
one who has a large staff like myself it’s very difficult to get through. I
think I do 78 observations.” The recommendations made by the adminis-
trators focused on reducing the amount of time associated with formal
observations, as they believed the formal observation provided a limited
view of the teacher’s overall practice; the science teachers also mentioned
formal observations being a “show” of teaching. The administrators
believed increasing the amount of informal classroom visits was a more
practical way to gain information about the professional needs of their
teaching staffs. The time involved in the observation process was signifi-
cant in terms of practicality, and they preferred devoting attention toward
staff development and improvement rather than the typically rote and cur-
sory observation process. Charles described the increase in workload on
science administrators and how it shifted his priorities:
Some teachers voiced concerns about the observation process when they
felt they were not getting productive feedback from observers. Annie com-
mented that the delay time between the observation and written feedback was
20 Educational Policy 00(0)
too long for the evaluation to be useful. She also felt there was little clarity
regarding what could have been improved in the lesson. In this sense, she
could not understand how the observer differentiated between “effective” and
“highly effective” and felt “short changed” to receive a lower rating:
. . . my observation, I didn’t get back for weeks later, even though I know it’s
supposed to be 48 hours . . . and I feel like they’re just—the administration is
so overworked with doing observations that I think they write just the basic
stuff that they I think it’s almost like a form letter. But yet it’s just like the old
way of observations . . . so I got an “effective,” but yet there was nothing in
there that you would’ve improved about the lesson. (Annie)
This teacher was not only disappointed with the excessive time to receive
feedback, but also the cursory manner in which the administrator used the
rubric as a list of items or behaviors to verify. She believed the observer
should have exerted greater effort in evaluating her work to produce insight-
ful instructional guidance. Lack of differentiation was also problematic.
Sarah explained the science teachers within her district were all classified as
“highly effective,” which painted an inaccurate picture of the true makeup of
her colleagues. She felt that teachers within her department were less skilled
than others but still received a highly effective rating:
I could work my butt off, and someone could come in, leave, lecture every day,
do no activities, not even meet the lab requirements, they’re still highly
effective. So it’s just, it’s unfair. The whole system’s unfair. I don’t feel like it
achieved its goal. (Sarah)
I have the observations, and then that growth score from my students based on
the end of the year, their Regent scores—how they’ve grown. Then there’s
another component. I don’t think it’s either the state or the local, and get very
Mintz and Kelly 21
confused on this so I’m really sorry. This is the part that I don’t always—I’m
not always sure of. But it’s component-based on I think something with the
school. I don’t even—I don’t know. (Christine)
When Jack was questioned about the difference between the achievement on
exams and the improvement on exams, he seemed unclear and wondered
what the expectations were for the testing components, stating, “That’s one of
the areas I’ve actually always felt as though it was very muddled. I don’t see
much of a difference there at all and I don’t understand what the expectations
of the state or the expectations of my district are.”
Another issue was the use of baseline student performance data in the
calculation of the SLO. Of the five teachers interviewed, all but one used a
science assessment as a baseline. One district used the seventh-grade lan-
guage arts exam score as a baseline, while another used an eighth-grade sci-
ence exam score. The other teachers used practice Regents exams that were
distributed at the start of the academic year. Students were aware that base-
line exams were used to evaluate the teacher and did not affect their grades;
consequently, many students were reportedly less serious about the pretests.
In each case, teachers found the use of these scores problematic because the
validity and reliability of the measures were called into question. The SLO
was intended to measure the teacher’s influence on student growth, yet it
became impractical and was perceived as unfair if it did not measure what
was intended. The reliance on unreliable metrics often negatively impacted
motivation. Science teachers generally expressed that the observation pro-
cess was important in evaluation and development, however, they believed
that using student test scores as a means of evaluating teacher performance
should be eliminated or adjusted.
The variation in student population among different levels and from year-
to-year was a commonly articulated concern. Jack pointed out equity issues
with the APPR, as he taught the living environment course to a group of
English as a second language (ESL) students. They were held to the same
standards as matriculated English-speaking students in terms of Regents
exam performance. He found this part of evaluation troublesome because the
performance of nontraditional students was not considered fairly when the
evaluation system was changed, and consequently changed his teaching to
what he had done before the policy was implemented:
The curriculum is far too wide for ESL students to begin with and then that’s
never taken into account. And then the fact that I felt as though I was going to
be scored negatively on it I felt as though I had to rush through a curriculum
and I’m sure that led to my students really not enjoying the class and not even
22 Educational Policy 00(0)
comprehending things as well as they should have. From that perspective that
was really, really difficult for me and after the first year of APPR, I stepped
back, I re-evaluated myself, and I went back to some old strategies. (Jack)
Other teachers pointed out the inequity between classes and between differ-
ent school years. Some teachers taught higher level students when students
were tracked. Students were placed in different levels, based on academic
ability, within the same content course. An example of this would be an
enriched living environment course, taught with additional topics not tested
on the exam. The teacher who has higher level students might have typically
received a higher composite score. Also, in some academic years, students
were higher performing than others. Because of this variability, the teachers
felt that quantitative composite scores should not be used to judge teaching
performance. Several administrators also regretted there was no mechanism
to adjust for student differences when looking at aggregate test scores for
individual science teachers. Some teachers and administrators felt APPR tar-
gets were unattainable due to lack of consideration of student variability,
which lessened their buy-in and motivation.
A final theme that emerged was the rapid timeline for policy implementa-
tion that diminished teacher and administrator agency. As APPR was intro-
duced, Common Core Standards (Common Core State Standards Initiative,
2010) were implemented simultaneously. The top-down nature of this reform
and hurried enactment of the policy did not allow for the stakeholders to fully
understand the process. Adam shared the impacts of this decision:
The rollout was a problem statewide because it came out when Common Core
was first introduced. And there was a lot of research that clearly outlined the
fact that you had so many initiatives under the Race to the Top—because APPR
was part of the Race to the Top—when you secured those funds, then you had
to implement or jump through all these hoops. And at the same time, Common
Core, we had the shift in the standards. And, because in New York State, we
were actually tested on the new standards and then evaluated a year before we
had to do it. So there was definitely a poor timeline as far as implementation.
There wasn’t enough time for the teachers to understand the new standards in
order to be evaluated correctly. (Adam)
He attributed his science teachers’ lack of clarity about the new evalua-
tion policy to the aggressive time frame. He believed that efforts to receive
federal funds minimized valuable discussions on how teachers should be
evaluated, suggesting that science teachers may have resisted the APPR
policy because they were not included in the preliminary discussions.
This lack of agency resulting from a focus on top-down directives was
Mintz and Kelly 23
When we first started the transition to the new APPR, and there were some
initial committee meetings to discuss the negotiation of the new framework,
many of us, myself included, said let’s start thinking outside the box. Let’s
get a little more creative. Let’s work back to something that still satisfies
these mandates, but at the same time affords us the opportunity to work
more effectively with our staff, and instead of maintaining this performance
based show in some regards—but let’s free up that time so that perhaps we
can form—I would love to institute peer-to-peer observation. That’s talked
about all the time, rarely implemented effectively, and certainly not in our
district. (Rich)
Rich emphasized the level of professional trust that should exist between the
state and school administrators. He discussed how the administration should
have some level of accountability to the state, and the state should trust the
district professionals that they are doing the right thing with regard to evalu-
ating and developing their staff. He also felt that the policy as written was not
congruent with his professional beliefs regarding instructional supervision,
as teachers were receiving evaluation scores based upon categories in a rubric
that were not possible in every lesson. This added to the problem of using
invalid metrics to quantify teacher impacts.
Most teachers also expressed they would choose another form of evalua-
tion that would give them meaningful feedback. The science teachers dis-
cussed how targeted professional development would be more helpful to
motivate them intrinsically to refine their teaching practice. Robert com-
mented on creating a peer review system:
It’s almost like a peer review system would work very well where you can have
teachers come in and they can observe at any point in time and you can ask
them to evaluate you intently. I know that takes time, I know not everybody is
comfortable doing that, you have to have a rapport with that person but I think
that’s a much better evaluation system than having an independent consultant
come in or an administrator who hardly knows you come in a see you in an
either observed or planned observation. (Robert)
The data from this study suggested that student growth and achievement
should be carefully considered in a science teacher’s evaluation because of the
variability among students, lack of seriousness among students taking pre- and
posttests, and the limited reliability and validity of baseline measures. Some of
the participants believed that high-stakes state exams did motivate teachers to
Mintz and Kelly 25
Limitations
The study has several limitations. The context where this research study took
place was unique compared with the organization of other school districts in
the United States. The research area captured the characteristic features of a
much larger school district and had a long tradition of state-mandated test-
based accountability. The results from this qualitative study may not be gen-
eralizable outside of the state of New York in school districts with different
standardized testing cultures, or in rural and urban school districts. Although
the perspectives of urban and rural classroom teachers and administrators
were not included, their views would likely present different understandings
and broader explanations of APPR policy implementation. The small number
of interviews conducted and the results gathered might not be generalizable
beyond the teachers and administrators interviewed. The interview protocol,
developed with a priori constructs, may have missed key variables related to
localized policy implementation.
The researchers took steps to minimize their biases during discussions
with each other and external colleagues, yet their views and judgments were
ever present during data collection and analysis. This could have influenced
participant responses during the interviews and further constrained generaliz-
ability. Several biases were uncovered through discussions with academic
colleagues through the critical friend model (Fetterman & Wandersman,
2005), yet additional misinterpretations were possible.
26 Educational Policy 00(0)
Discussion
This study attempts to fill the gap between educational research and practice
by generating findings that are geared to influence science teacher motivation
and practice through the evaluation process. Research has shown that there
have been concerns about whether new evaluation systems will result in
improved instruction and increase student learning despite devotion of sig-
nificant resources (Sipple, Killen, & Monk, 2003). Educational research has
indicated that the perceptions of key stakeholders are important in under-
standing policy success and failures (Datnow & Castellano, 2000). This
research study builds upon Firestone’s (2014) work with evaluation and
motivation theories by questioning whether the APPR policy was designed to
leverage teacher motivation to improve practice and student performance.
Louis et al. (2005) suggested that science and mathematics teachers are most
affected by accountability, and these findings shed light upon building the
capacity of science teachers to exercise agency in changing educational sys-
tems. This work also furthers the work of de Jesus and Lens (2005) by
Mintz and Kelly 27
Motivation
The APPR policy in New York State was designed to foster teacher profes-
sional development through accountability in an effort to increase student
achievement. The teacher-administrator pairs in this study revealed insights
regarding the tensions between intrinsic and extrinsic motivation in meeting
the policy goals. The science teachers in this study demonstrated a passion
for teaching science and having a positive impact on student learning, which
was the main motivation for their careers. They agreed the focus of the reform
should place more emphasis on the aspects of teaching that would intrinsi-
cally motivate them rather than a quantified extrinsic motivator with unrea-
sonable metrics. This may explain the teachers’ and administrators’ responses
to the policy, and how the policy did not reflect an understanding of what
motivates teachers.
Intrinsic motivation is influenced by teacher efficacy and trust (de Jesus &
Lens, 2005; Gigante & Firestone, 2008), so it is essential that assessments of
effectiveness are perceived as fair, clear, and reliable. If student test scores
are to be used in teacher evaluation, their weight should be given careful
consideration along with variations in student population. Student test scores
should be used formatively as a way to adjust teaching practice in areas in
which students have difficulty. Student performance is an important part of
educator accountability and should be utilized to inform teachers about suc-
cessful and unsuccessful teaching practices. However, this needs to be based
upon reliable and valid metrics. Research has suggested that value-added
measures are not sufficient alone and often are not aligned with observation
ratings (Hill et al., 2011; Kimball & Milanowski, 2009), and the participants
in this study concurred that data must be cumulative to moderate the effects
of student differences from year-to-year.
Teacher learning and accountability were often perceived as mutually
exclusive constructs, which may be construed as tension between intrinsic
(inherent desire for learning) and extrinsic (accountability) motivations.
Intrinsically motivating teachers through professional developmental strate-
gies would fulfill one of the goals of Race to the Top—maximizing student
learning through improved teacher effectiveness. The policy would have had
28 Educational Policy 00(0)
more positive reception if science teachers were clear about the policy and
cognizant of its useful information and practicality. The extrinsic nature of a
quantified composite score that encapsulated a teacher’s professional practice
was viewed as disconnected from their intrinsic motivation; however, this did
not need to be the case. Many of the teachers in this study expressed sensitiv-
ity to their status in their professional community, and they were validated by
positive feedback from their supervisors. They wanted to become better edu-
cators, but were suspicious of a policy that lacked positive incentives. They
identified downside effects such as the potential professional embarrassment
from a low HEDI score. This compromised their sense of professionalism as
the unfairness and uncertainty of certain metrics seemed to marginalize their
efforts. Science teachers expressed commitment to professional improvement,
although this was not because of the policy but in spite of it.
occurs when centralized mandates and local initiatives unite. Fullan (1994)
concluded that systems change when individuals and small groups find com-
monality both logically and centrally. In one sense, teacher evaluation is
already bottom-up as ineffective teachers are removed before granting tenure.
However, the top-down approach of the state evaluation system had limited
impact because administrators were not given the resources to manage
increased workload, and teachers often viewed their composite scores as
unfair. Teacher evaluations can only improve instruction and student learning
if there is trust between the teachers being evaluated and administration
(Firestone, 2014). Trust in professional relationships has been found to con-
tribute to increased motivation and collaboration, leading to improved student
learning in science (Smetana, Wenner, Settlage, & McCoach, 2016).
Implications
This research suggests strategies for the state to improve the evaluation
policy and process, and, consequently, science teacher motivation. A one-
size-fits-all approach to science teacher evaluations is not appropriate for
states with large numbers of school districts with varied ranges of socio-
economic diversity. The results and themes generated by this research lend
themselves to a bigger question, that is, how should science teachers be
evaluated? Structural changes in the format of science teacher evaluation
are necessary to accomplish the goals set by Race to the Top and New York
State, namely, to provide objective teacher evaluation results that foster
motivation, professional growth, and student learning. ESSA released
states from reporting “highly effective” teachers and accompanying stu-
dent performance metrics, however, New York maintained the requirement
for the inclusion of student test scores in the teacher evaluation process.
State- and local-level experts in teacher evaluations and pedagogical con-
tent knowledge should be included in the revamping of science teacher
evaluation policy to meet the needs of specific stakeholders. Science
teachers and administrators, as practitioners of policy, have insightful rec-
ommendations to offer, and fostering their agency will facilitate profes-
sional engagement in the process.
When it comes to specialized content areas, such as the sciences, educa-
tional leadership loses some of its generalizability. Science teachers are con-
tent specialists, and their supervisors sometimes do not share the same content
expertise. Teachers need credible sources of knowledge to benefit from evalu-
ation (Murphy et al., 2013), so it is essential that administrators provide the
requisite content expertise to make effective recommendations. The certifica-
tion of science administrators does not need to be changed or become more
30 Educational Policy 00(0)
specialized, rather, teacher leaders and peer learning communities could share
disciplinary skills and strategies while lessening the time burden for science
administrators. The administrator’s role could be adjusted to become a facili-
tator for peer observations, teacher collaborative communities, and educa-
tional rounds. The administrator could change his or her role from strictly an
evaluator to an instructional leader, seeking out meaningful professional
development to fulfill the needs of faculty. These innovations would strengthen
the value of peer validation, a desirable component of extrinsic motivation
(Finnigan & Gross, 2007).
The results of this study indicate that the professional development of
science teachers should be the focus of teacher evaluations. The partici-
pants found accountability to be an important component of evaluation and
an extrinsic motivator, yet these evaluations may not need to be conducted
every year. Composite scores calculated over a longer period of time could
provide a more reliable representation of the teacher’s quality. Developing
individual professional learning plans is another mechanism for evaluating
science teachers. Multiple assessments with specific growth targets within
the academic year could provide the information needed to demonstrate
professional improvement.
Preservice teacher programs could also benefit from this study. These pro-
grams could introduce content-specific observation rubrics to preservice
teachers to familiarize them with metrics for lesson evaluations. Highly moti-
vated veteran teachers are calling for more content-specific professional
development, so it seems logical that preservice teachers would also require
the same training in disciplinary pedagogical content knowledge. Bridging
the gap between preservice science teacher training and professional science
teacher development could improve the quality and self-reflection of novice
science teachers as they enter the profession.
Conclusion
Using the information gathered from this study, successful implementation of
teacher evaluation reform should correspond with positive teacher percep-
tions and focus on intrinsically motivating science teachers to improve
instruction. Overall, this study found positive potential in the evaluations of
teachers in New York under the APPR system, yet negative stakeholder reac-
tions regarding the perceived unfair, chaotic, and punitive nature of the pol-
icy. The goal of science teacher evaluations should be to provide the
framework for developing excellent science educators through localized con-
trol. This study has established some important considerations when design-
ing and implementing evaluation policy.
Mintz and Kelly 31
Several important points were learned as a result of this study: (a) science
teachers and administrators value the importance of teacher evaluations, (b)
conversations between administrators and teachers have improved as a result
of APPR’s implementation of observation rubrics, (c) the condensed timeline
and top-down approach of APPR policy may have contributed to teacher
resistance because of lack of policy clarity, (d) science teachers and adminis-
trators found APPR did not provide them with practical and reliable informa-
tion to improve science teacher practice with the use of student test scores,
and (e) science educators believed professional development should be the
main focus of teacher evaluations. The cautious nature of teacher perceptions
associated with student test scores and the positive perceptions associated
with the observation process indicate that policy makers should address this
dichotomy in future evaluation reform efforts. Motivational constructs pro-
vide an insightful lens for designing policy to leverage teachers’ commitment
to improving practice and raising student achievement.
The insights gathered from this study add to the limited literature regarding
science teacher and administrator perspectives of evaluation policy. As policy
makers have increased the focus on teacher evaluation, examining these per-
ceptions in a motivational context provides the foundation for further research
in this area. This study calls for future qualitative and large-scale randomized
controlled studies regarding the effects of professional development on teacher
motivation, as well as the impacts of science teacher collaborative networks,
peer review, and educational rounds on science teaching and learning. Future
reform efforts regarding science teacher evaluations should include educator
input, explicit and valid metrics, and possess practicality for science teachers
and administrators. The evaluation policy should emphasize disciplinary pro-
fessional development and teacher professionalism to motivate teachers
intrinsically and foster continuous pedagogical growth.
Appendix A
Science Teacher Semi-Structured Interview Protocol
Appendix B
Science Administrator Semi-Structured Interview Protocol
10. Do you think student test scores should be used to evaluate teacher
quality? Why or why not?
11. Discuss your general impression of the overall evaluation system.
12. Do you think teachers have changed anything about the way they
teach because of the new APPR [Annual Professional Performance
Review] system of evaluation?
13. Do you feel APPR fosters a culture of continuous professional
growth?
14. Discuss the qualities that characterize a “great” teacher.
15. What are your views on tenure?
16. Describe your feelings about the new Annual Professional
Performance Reviews.
17. What would be, in your opinion, the ideal way to evaluate teachers?
Funding
The author(s) received no financial support for the research, authorship, and/or publi-
cation of this article.
ORCID iDs
Jessica A. Mintz https://orcid.org/0000-0002-8789-9164
Angela M. Kelly https://orcid.org/0000-0003-1393-1296
References
American Recovery and Reinvestment Act (ARRA) of 2009, Pub. L. No. 111-5, 123
Stat. 115, 516 (February 19, 2009).
Amrein-Beardsley, A. (2008). Methodological concerns about the education value-
added assessment system. Educational Researcher, 37, 65-75.
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn,
R. L., . . . Shepard, L. A. (2010). Problems with the use of student test scores to
evaluate teachers (vol. 278). Washington, DC: Economic Policy Institute.
Barusch, A., Gringeri, C., & George, M. (2011). Rigor in qualitative social work
research: A review of strategies used in published articles. Social Work Research,
35(1), 11-19.
Bill and Melinda Gates Foundation. (2011). Learning about teaching: Initial findings
from the measures of effective teaching project. Bellevue, WA: Author.
34 Educational Policy 00(0)
Firestone, W. A., Nordin, T. L., Shcherbakov, A., Blitz, C. L., & Kirova, D. (2014).
New Jersey’s Pilot Teacher Evaluation Program: Year 2 final. New Brunswick,
NJ: Center for Effective School Practices.
Forman, K., & Markson, C. (2015). Is “effective” the new “ineffective”? A crisis
with the New York state teacher evaluation system. Journal for Leadership and
Instruction, 14(2), 5-11.
Fullan, M. G. (1994). Coordinating top-down and bottom-up strategies for edu-
cational reform. In R. J. Anson (Ed.), Systemic reform: Perspectives on per-
sonalizing education (pp. 7-22). Washington, DC: U.S. Government Printing
Office.
Fullan, M. G. (2001). The new meaning of educational change (3rd ed.). New York,
NY: Teachers College Press.
Gigante, N. A., & Firestone, W. A. (2008). Administrative support and teacher lead-
ership in schools implementing reform. Journal of Educational Administration,
46, 302-331.
Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies
for qualitative research. Chicago, IL: Aldine.
Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effective-
ness: A research synthesis. Washington, DC: National Comprehensive Center
for Teacher Quality.
González, R. A., & Firestone, W. A. (2013). Educational tug-of-war: Internal and
external accountability of principals in varied contexts. Journal of Educational
Administration, 51, 383-406.
Guest, G., Bunce, A., & Johnson, L. (2006). How many interviews are enough? An
experiment with data saturation and variability. Field Methods, 18, 59-82.
Hanushek, E. A., Kain, J. F., O’Brien, D. M., & Rivkin, S. G. (2005). The market for
teacher quality (No. w11154). Washington, DC: National Bureau of Economic
Research.
Harris, D. N. (2009). Would accountability based on teacher value added be smart
policy? An examination of the statistical properties and policy alternatives.
Education Finance and Policy, 4, 319-350.
Harris, D. N., & Herrington, C. D. (2015). The use of teacher value-added measures in
schools: New evidence, unanswered questions, and future prospects. Educational
Researcher, 44, 71-76.
Harris, D. N., Ingle, W. K., & Rutledge, S. A. (2014). How teacher evaluation meth-
ods matter for accountability: A comparative analysis of teacher effectiveness
ratings by principals and teacher value-added measures. American Educational
Research Journal, 51, 73-112.
Hill, H., & Grossman, P. (2013). Learning from teacher observations: Challenges and
opportunities posed by new teacher evaluation systems. Harvard Educational
Review, 83, 371-384.
Hill, H., Kapitula, L., & Umland, K. (2011). A validity argument approach to evalu-
ating teacher value-added scores. American Educational Research Journal, 48,
794-831.
36 Educational Policy 00(0)
New York State Legislature. (2010). Article 61. Teachers and supervisory and
administrative staff (Section 3012-c Annual professional performance review of
classroom teachers and building principles). Retrieved from http://public.leginfo
.state.ny.us/menuf.cgi
New York State United Teachers. (2011). The NYSUT teacher practice rubric.
Latham: Author.
NORC at the University of Chicago. (2018). State-administered HS end of course
(EOC) science assessments, intended uses, 2016-17. Retrieved from http://stem
-assessment.org/table/pages/table10.aspx
Pallas, A. M. (2012). The fuzzy scarlet letter. Educational Leadership, 70, 54-57.
Papay, J. P. (2012). Refocusing the debate: Assessing the purposes and tolls of teacher
evaluation. Harvard Educational Review, 82, 123-141.
Patton, M. Q. (1990). Qualitative evaluation and research methods. Newbury Park,
CA: SAGE.
Patton, M. Q. (1999). Enhancing the quality and credibility of qualitative analysis.
Health Services Research, 34(5 Pt. 2), 1189-1208.
Polkinghorne, D. E. (1989). Phenomenological research methods. In R. S. Valle
& S. Halling (Eds.), Existential-phenomenological perspectives in psychology
(pp. 41-60). New York, NY: Plenum Press.
Ryan, R. M., & Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic defi-
nitions and new directions. Contemporary Educational Psychology, 25, 54-67.
Ryan, R. M., & Deci, E. L. (2006). Self-regulation and the problem of human auton-
omy: Does psychology need choice, self-determination, and will? Journal of
Personality, 74, 1557-1586.
Sadler, D. R. (1985). Evaluation, policy analysis, and multiple case studies:
Aspects of focus and sampling. Educational Evaluation and Policy Analysis,
7, 143-149.
Saldaña, J. (2013). The coding manual for qualitative researchers. Thousand Oaks,
CA: SAGE.
Sipple, J. W., Killen, K., & Monk, D. H. (2003). Adoption and adaptation: School
district responses to state imposed learning and graduation requirements.
Educational Evaluation and Policy Analysis, 26, 143-168.
Smetana, L. K., Wenner, J., Settlage, J., & McCoach, D. B. (2016). Clarifying and
capturing “trust” in relation to science education: Dimensions of trustworthiness
within schools and associations with equitable student achievement. Science
Education, 100, 78-95.
Springer, M. G., Ballou, D., & Peng, A. (2008). Impact of the teacher advancement
program on student test score gains: An independent appraisal. Nashville, TN:
National Center on Performance Incentives.
Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher
performance: What do teacher observation scores really measure? Educational
Evaluation and Policy Analysis, 38, 293-317.
Strauss, A., & Corbin, J. (1990). Basics of qualitative research: Grounded theory
procedures and techniques. Newbury Park, CA: SAGE.
38 Educational Policy 00(0)
Strong, M., Gargani, J., & Hacifazlioglu, O. (2011). Do we know a successful teacher
when we see one? Experiments in the identification of effective teachers. Journal
of Teacher Education, 62, 367-382.
Stronge, J. H., Ward, T. J., Tucker, P. D., & Hindman, J. L. (2007). What is the rela-
tionship between teacher quality and student achievement? An exploratory study.
Journal of Personnel Evaluation in Education, 20, 165-184.
Tuytens, M., & Devos, G. (2009). Teachers’ perception of the new teacher evalua-
tion policy: A validity study of the Policy Characteristics Scale. Teaching and
Teacher Education, 25, 924-930.
Tyler, J. H. (2011). Designing high quality evaluation systems for high school teach-
ers: Challenges and potential solutions. Washington, DC: Center for American
Progress.
U.S. Department of Education. (2001). The No Child Left Behind Act of 2001 [public
law]. Retrieved from https://www.congress.gov/bill/107th-congress/house-bill/1
U.S. Department of Education. (2009). Race to the Top Program Executive Summary.
Washington, DC: Author. Retrieved from https://www2.ed.gov/programs/raceto-
thetop/executive-summary.pdf
Watt, D. (2007). On becoming a qualitative researcher: The value of reflexivity. The
Qualitative Report, 12, 82-101.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our
national failure to acknowledge and act on differences in teacher effectiveness.
Brooklyn, NY: New Teacher Project. Retrieved from https://tntp.org/publica-
tions/view/the-widget-effect-failure-to-act-on-differences-in-teacher-effective-
ness
Weiss, J. A. (2012). Data for improvement, data for accountability. Teachers College
Record, 114, 110307.
Xu, X., Grant, L. W., & Ward, T. J. (2016). Validation of a statewide teacher evalua-
tion system: Relationship between scores from evaluation and student academic
progress. NASSP Bulletin, 100, 203-222.
Yin, R. K. (1994). Case study research: Design and methods (5th ed.). Thousand
Oaks, CA: SAGE.
Author Biographies
Jessica A. Mintz was awarded the PhD in Science Education from Stony Brook
University, NY, in 2017. She is a New York State Master Teacher and a high school
science teacher in Bay Shore, NY. Her research interests include science teacher
accountability practices and chemistry teacher professional development.
Angela M. Kelly is an associate director of Science Education at the Institute for
STEM Education, and an associate professor of Physics at Stony Brook University.
Her research interests include inequities in physical science and engineering educa-
tion; reformed teaching practices in science; and sociocognitive influences on STEM
access and participation.