[PDF] A two-phase study examining perspectives and use of quantitative methods in PER

Abstract

While other fields such as statistics and education have examined various issues with quantitative work, few studies in physics education research (PER) have done so. We conducted a two-phase study to identify and to understand the extent of these issues in quantitative PER . During Phase 1, we conducted a focus group of three experts in this area, followed by six interviews. Subsequent interviews refined our plan. Both the focus group and interviews revealed issues regarding the lack of details in sample descriptions, lack of institutional/course contextual information, lack of reporting on limitation, and overgeneralization or overstatement of conclusions. During Phase 2, we examined 72 manuscripts that used four conceptual or attitudinal assessments (Force Concept Inventory, Conceptual Survey of Electricity and Magnetism, Brief Electricity and Magnetism Assessment, and Colorado Learning Attitudes about Science Survey). Manuscripts were coded on whether they featured various sample descriptions, institutional/course context information, limitations, and whether they overgeneralized conclusions. We also analyzed the data to see if reporting has changed from the earlier periods to more recent times. We found that not much has changed regarding sample descriptions and institutional/course context information, but reporting and overgeneralizing conclusions has improved over time. We offer some questions for researchers, reviewers, and readers in PER to consider when conducting or using quantitative work.

Full PDF

AA two-phase study examining perspectives and use of quantitative methods in PER

Alexis V. Knaub, John M. Aiken, and Lin Ding Department of Physics and Astronomy, Michigan State University, East Lansing, MI, USA 48823 Center for Computing in Science Education & Department of Physics, University of Oslo, N-0316 Oslo, Norway Department of Teaching and Learning, The Ohio State University, Columbus, OH, USA 43210

While other ﬁelds such as statistics and education have examined various issues with quantitative work, fewstudies in physics education research (PER) have done so. We conducted a two-phase study to identify and tounderstand the extent of these issues in quantitative PER . During Phase 1, we conducted a focus group of threeexperts in this area, followed by six interviews. Subsequent interviews reﬁned our plan. Both the focus groupand interviews revealed issues regarding the lack of details in sample descriptions, lack of institutional/coursecontextual information, lack of reporting on limitation, and overgeneralization or overstatement of conclusions.During Phase 2, we examined 72 manuscripts that used four conceptual or attitudinal assessments (Force Con-cept Inventory, Conceptual Survey of Electricity and Magnetism, Brief Electricity and Magnetism Assessment,and Colorado Learning Attitudes about Science Survey). Manuscripts were coded on whether they featured vari-ous sample descriptions, institutional/course context information, limitations, and whether they overgeneralizedconclusions. We also analyzed the data to see if reporting has changed from the earlier periods to more recenttimes. We found that not much has changed regarding sample descriptions and institutional/course context in-formation, but reporting and overgeneralizing conclusions has improved over time. We offer some questions forresearchers, reviewers, and readers in PER to consider when conducting or using quantitative work.

I. INTRODUCTION

Quantitative research can provide valuable insights and becompelling for audiences as they often draw upon large sam-ple sizes and use compelling statistics [1]. In PER, quantita-tive work is typically employed to provide numeric “... obser-vations through some statistical techniques in order to betterdescribe, explain, and make inferences about certain events,ideas or actions in physics education” [2]. However, the use-fulness of this work is only as good as its quality. Quantita-tive work is not immune to issues regarding data collectionand analysis, as well as researcher bias [3]. The researchin others ﬁelds indicates that peer-reviewed quantitative re-search has room for improvement in how current techniquesare used. While few of these studies were within physics edu-cation research (PER), it is likely that PER also has areas ripefor improvement. However, without investigation, we can-not assume that PER has the same set of quantitative researchissues that other ﬁelds have.To better understand the issues within peer-reviewed quan-titative PER, we sought to ﬁnd out what these issues are. Thisstudy has the following research goals:1. Identify issues in quantitative physics education re-search as well as ways to improve via advice from ex-perts2. Determine how pervasive identiﬁed issues are in PEROur ﬁrst goal was used to identify which issues we shouldfocus on in order to create a cohesive project that has rele-vant research questions and results. Our second goal was tounderstand the extent of identiﬁed issues.We addressed these goals in a two-phase study design. Inthe ﬁrst phase, we asked quantitative PER experts for theirperspectives on the biggest issues that quantitative PER has. In the second phase, we examined peer-reviewed articles onassessments from the American Journal of Physics (AJP),Physical Review- PER (PRPER), and The Physics Teachers(TPT) to determine the extent to which these expert-identiﬁedissues exist.

II. BACKGROUND AND MOTIVATIONA. Research design issues

Regardless of whether a study is quantitative, a researchproject’s design determines the course of a research project,including which data are collected, how data are collected,and what analyses can be done. Research design decisions areoften determined by the research questions that a researcherseeks to answer [2]. Some decisions include which data tocollect and whether the data are considered categorical, con-tinuous, etc. [4]. What is measured should have some under-stood theoretical backing. Scott (2011) emphasized “Onlyif a researcher has a clear understanding of the logic of aparticular measure can he or she make an informed socio-logical judgment about its relevance for a particular piece ofresearch” [5]. If care is not taken, the results may lack valid-ity (i.e., whether an instrument measures what it claims) andreliability (i.e., whether the instrument’s questions are consis-tently interpreted) [4]. More information on nuances of relia-bility and validity (e.g., the importance of multiple sources ofevidence for claims) found in many standard textbooks (e.g.,[9]).Besides considering the research questions and how the de-sign answers those questions, it is also important to considerlimitations within the study’s design and what is truly beingasked. Docktor & Mestre pointed out that PER studies thatuse cognitive psychology often occur within a lab and thus, a r X i v : . [ phy s i c s . e d - ph ] D ec he results may be different if one tries to replicate the studyin a classroom [6]. Another example is who is included inthe sample. While some decisions regarding who is includedin the sample are deliberate (e.g., purposive sampling), othersmay be accidental because a portion of the population wouldnot be identiﬁed (e.g., only relying on who is listed in thephonebook) even if they are a part of the population of in-terest (e.g., convenience sampling) [7]. Some research ques-tions may be answerable, but their answers are not meaning-ful (e.g., comparing two groups for no apparent reason) [8]. B. Analysis issues in quantitative work

Quantitative work can have issues beyond how the researchproject is conceptualized. Literature indicates that studies inmany ﬁelds, such as education and medicine, may violate as-sumptions built into these statistical models or researchersmay incorrectly interpret the statistics (e.g., making causalclaims when one cannot). Examples include incorrect useand interpretation of: value added modeling [8], multivariatestatistics [10], exploratory factor analysis [11], and p-values[12, 13]. Parametric statistics, such as t-tests, assume that thedata have a normal distribution and are interval data [4]. Theexamples given are not just hypothetical but have been ob-served by researchers in various ﬁelds. There have been somestudies regarding statistical issues in PER, such as issues withnormalized gain (e.g., [14, 15]).If statistics are not carefully used and reported, some issuescan arise. Misinterpretation of p-values may mean that resultseems much more important than it is [12, 13, 16]. To betterinterpret results, effect size and conﬁdence intervals should bereported [12, 13, 16]. Some of these issues have led to falsepositives and false negatives (i.e., type 1 and type 2 errors)[17]. In other words, these errors can mislead researchers tobelieving their results have statistical signiﬁcance when theydo not (i.e., false positive) or that the results lack statisticalsigniﬁcance when they do (i.e., false negative).Other issues are more general. Researchers may not haveinformation on all individuals in a sample, resulting in miss-ing data. For example, surveys have response rates that indi-cate what percentage of individuals in a sample took a survey.If only a particular group of individuals took a survey, biasedresults that do not accurately portray the population are pos-sible [7, 19]. Researchers may also make decisions on whichcases to keep in their data. “Cleaning” data or selecting casesthat do not seem valid can introduce errors, especially if re-moved cases were actually valid data [3]. How data are ag-gregated matters, because important information can be lost.For example, the percentage of postsecondary degree holdershas large variance among different Asian American ethnici-ties (e.g., Chinese, Vietnamese) [18]. When data are reportedin aggregate, these details are lost and mislead audiences tobelieve Asian Americans are universally succeeding [18].Graphics can be useful, but they also can portray data in-accurately [3, 4]. One such example is a histogram may in- dicate a normal distribution, though calculations indicate thedistribution is not normal [3]. Another such example is that agraph’s scale may distort the data to de-emphasize differencesbetween two groups [4].Clearly deﬁning terms is important as words can mean dif-ferent things [5]. In a study on equity, how “equity” is deﬁnedand what the study intends to examine can lead to differentinterpretations of data [20]. The authors advocate that re-searchers are explicit in their deﬁnition of equity. While thispaper focused on one term, such explicitness is likely usefulfor other concepts and generally interpreting results. Dingand Liu (2012) noted the following:However powerful a statistical analysis may be,it is after all just a tool that can inform one ofwhat a result is but cannot tell why the result issuch. It is the researcher’s job to make credibleinferences for the reasons that underlie the re-sults and connect them back to the original theo-retical framework” [2].Quantitative research has its limitations in what can be de-scribed and thus, claims should be carefully made. Whilestatistics can describe trends within large groups, understand-ing what an individual is thinking may not be possible [1].Correlation does not equal causation, as many factors canproduce high correlations [2, 3]. Data depict some studentattributes and are taken during a speciﬁc period [21, 22]. Ifthese data are to be used to guide decisions for students, suchas recommending which courses they should take, researcherssuggest that those using the data keep in mind that studentsare individuals and that a student’s past does not necessarilydetermine their future [21, 22].

C. Motivation

In summary, the broad literature on research design andanalysis indicates there are multiple potential issues, rangingfrom how data are collected to whether statistics are used ap-propriately to how they are then interpreted. Some of thesestudies are in PER. Yet, it is unclear how pervasive these is-sues are or whether these issues exist. Our goals with thispaper are to examine which issues exist in PER and to whatextent they do.

III. PHASE 1: COMMUNITY PERCEPTIONS ONQUANTITATIVE PHYSICS EDUCATION RESEARCHA. Research questions

The research questions for Phase 1 are as follows:1. What issues exist in quantitative PER, and how are theissues ranked by experts?. What suggestions do experts have to improve the ro-bustness of quantitative PER?Experts for this study were identiﬁed by the editors of AJPand PRPER.

B. Methodology

1. Focus group sample description and protocol

We asked the editors of AJP and PRPER for some sugges-tions of whom they consider experts in quantitative PER andwould be interested in participating in focus group regardingquantitative PER; we speciﬁed that these individuals do notnecessarily need to work exclusively in PER or exclusivelywith members of PER but have enough familiarity to be ableto comment on quantitative PER.The total list included 27 individuals. While we believethat the editors and advisory board did indeed identify quan-titative experts in PER, we also anticipate that this list is notcomprehensive; we do not want readers to believe there areonly 27 quantitative experts in PER. For the focus group, wecontacted 8 individuals. We wanted a diverse set of opinionsand did not want the focus group to reﬂect a particular re-search tradition, group, or advisor. Individuals were selectedbased on current institution afﬁliation; where they receivedtheir doctorates and did postdoc work (when applicable); andwho recommended them. Institution afﬁliations and degreeinformation were gathered from group, departmental, or per-sonal websites.Five individuals agreed to a day and time to meet virtu-ally through a video meeting. The other three declined dueto time commitments. Three people attended the focus groupconducted by Aiken and Knaub. The three who attended allwere from different institutions and had different doctoral andpostdoctoral backgrounds. They represent the recommenda-tions of two editors or journal advisory board members.As this was a focus group, we used a semi-structuredprotocol aimed at generating discussion among the focusgroup participants. We included these questions in AppendixA. Questions delved into participants’ general opinions onquantitative PER, challenges and mistakes within quantitativePER, how pervasive these issues are, and recommendationsfor resources. Based on the literature and our experiences, webelieve there are issues within quantitative PER. However, wedid not want to lead the focus group to conﬁrming our beliefshence asking open questions.

2. Interview sample description and protocol

Based on the focus group’s feedback, we developed aproject plan for Phase 2. To reﬁne the project plan, we con-ducted interviews for feedback and other suggestions. We contacted seven individuals, purposively selecting for diver-sity in current institution, doctoral institution, postdoctoral in-stitution (when applicable), and recommender.Six agreed to be interviewed, including two individualswho were supposed to attend to the focus group. Only oneindividual declined, claiming not to be an expert in quantita-tive PER. The individuals we interviewed represent the rec-ommendations of four editors or journal advisory board mem-bers and a variety of backgrounds.

3. Limitations and threats to validity

For this phase of the study, our limitations and threats tovalidity are the small size of the focus group and present-ing a pre-made plan. For the former, perhaps a larger or dif-ferent set of individuals would have identiﬁed other issuesin quantitative PER. However, our follow-up interviews indi-cated these issues are present in quantitative PER.Regarding a pre-made plan, perhaps, had our intervie-wees not seen this plan, other issues may have been identi-ﬁed. To make sure we received candid feedback, both in-terviewers took care to ensure that interviewees felt comfort-able critiquing the plan and making other suggestions if theythought there were other issues we should focus on. We usedbroad, open-ended questions and encouraged interviewees tobe honest and offer alternatives if they felt we were focusingon a non-issue. We also anticipate that there may be other is-sues in quantitative PER and hope that other issues in researchare explored in the future.

C. Results: identiﬁed issues in PER from the focus group

We present ﬁndings from our focus group. Individual focusgroup participants are referred to as F(number) (e.g., F1).Our focus group identiﬁed assessments, such as conceptinventories and attitude surveys, as one of the primary areaswhere PER has focused its quantitative work. Literature hasindicated that assessment, particularly multiple choice con-cept inventories (e.g., the Force Concept Inventory) histori-cally has been and continues to be an important part of PER[6]. In terms of speciﬁc issues, the focus group identiﬁedseveral issues within quantitative PER. While they did dis-cuss issues such as p-values and effect sizes, they primarilyfocused on the following: • Not reporting on limitations well.

If care is nottaken to report on a study’s limitations, our focus groupthought readers may come to incorrect conclusions.One focus group participant, F1, explained this:If you’re not careful about your presenta-tion, the unwary reader can take messagesaway from quantitative papers that are notvalid because of the type of data that youhave, the type of techniques that you used,r that kind of thing. So, I think reportingon the limitations of statistical techniquesis something that we don’t do well enough,sort of putting boundaries on the realm ofapplicability of our ﬁndings, both from a de-mographic point of view, as F2 mentioned,but also from a statistical point of view.F3 gave an example of a limitation within quantitativework:I think a lot of questions, if you crudelycharacterize questions that are about how orabout mechanisms, are a lot harder to an-swer quantitatively. Like, how do studentsapproach this new set of inquiry materials,you’re never going to get a good answer tothat quantitatively unless you ﬁnd some ex-tremely innovative ways of quantifying thestudent experience, and I have trouble withimagining how that would work.F3 explained that in this example, qualitative workcould complement the quantitative. However, F3 wasclear that the quantitative by itself would likely havelimitations on what was found. • Institutional/course contexts and samples are notdescribed well.

This was succinctly stated by F2:“PER does not do a good job of telling its audiencewho it is that they’re studying.” Describing one’s pop-ulation and sample is important because errors can re-sult if a study’s implications are applied to a completelydifferent population or if the sample is not representa-tive of the population [7]. F1 pointed out that these de-scriptions need to be done thoughtfully or the audiencemight get lost:So, it’s a balance, I think when you’re pub-lishing, to try and provide enough informa-tion that people can really understand whatyou’re doing, but not more information...[that] ultimately make the paper unreadableor so they can’t understand it if they’re notalready invested in the literature and the sta-tistical techniques being used. • Overgeneralizing and overstating results.

F2 re-marked that “the assumption is that if I present dataabout my students, it’s going to be equally valid foryour students.” The idea is that context varies in crit-ical ways (e.g., student backgrounds, resources avail-able) that may impact whether one can expect similarresults in a completely different context. F1 concurred,pointing out that if authors are not careful, readers maydraw inaccurate conclusions:I think we tend to make conclusions basedon these ﬁndings that are sometimes not entirely valid. So, it’s not necessarily Ithink that the statistical techniques them-selves are ﬂawed, but if you’re not carefulabout your presentation, the unwary readercan take messages away from quantitativepapers that are not valid because of the typeof data that you have, the type of techniquesthat you used, or that kind of thing.There were a few areas where the focus group identiﬁedpotential issues but also saw value in current practices, evenif they are imperfect. Some examples include: • Implicit theories . Some papers do not describe thetheory (e.g., a theoretical framework that uses social,cognitive, and/or learning theories) that guides thestudy. The focus group believed that rather than lack-ing any theory, there are implicit theories that authorsmay not articulate. F1 explained why this could be anissue:If somebody comes into that data with a dif-ferent perspective than was intended by theauthors or a different idea of what learninglooks like than what was intended by the au-thors, they can very easily misinterpret whatthose data are saying.However, when the idea of requiring theories to be ex-plained came up during the discussion, the focus groupwas divided. While F1 saw “...no harm can come fromarticulating it, and being forced to articulate it”, F2 washesitant to make such a stance, stating:I worry about making that a requirementof publication, and this has been on mymind because there are people that are re-ally pushing to have that be a requirementof publication... If I look at something fromthe University of Washington, I sort of knowwhat their framework is because the pa-pers that they’ve published before all sort ofcome from the same place. It doesn’t strikeme as necessary to reiterate that at the be-ginning of every paper.F2 was not opposed that peer reviewers suggest explic-itly articulate theories but against authors being forcedto articulate theories. F3 was in the middle:I wish there would be a nice middle waybecause I feel like for those researchers, ifthey don’t use some of their writing time tohelp think through the theoretical basis fortheir work, then they’re also missing out ona chance to grow. • Not thoroughly considering all aspects of researchdesign and articulating the research design.

Our fo-cus group pointed out the level of detail other ﬁeldspend on the research design in terms of data collec-tion, hypotheses, etc. F2 explained this:F1 talked about earlier, but the care that’sgone into the design of the experiments, theplanning of how many students we will needto get a statistically signiﬁcant result. Verycarefully stating what the goal of the experi-menter is, what a null result would look like,etc.However, the focus group was cautious to suggest thatPER adopts these practices as a universal standard forall manuscripts. The potential dangers of a more strin-gent standard were articulated by F2:One thing I worry about is a lot of times,people aren’t looking at things as this canbe a big tent. It’s like they didn’t do this andso their research is not valid or useful, and Ithink that’s a danger for PER. Just becauseit’s not the way you want to do research,doesn’t mean that it’s not useful ... That[the research] isn’t necessarily compellingor isn’t teaching us something.Depending on what standards were adopted, there ispotential some research simply could not meet the stan-dards. F3 pointed out that such studies may not havemade mistakes but have constraints that pose a limita-tion: If you’re a researcher at a smaller institu-tion, you can’t get some variation betweenthe classes that you’re studying, so you’relimited by what you can do but you stillwant to contribute to the enterprise of PERso you’re going to do what you can do.Despite the acknowledged issues, the focus group believedquantitative work in PER is improving, becoming more nu-anced and following the quantitative practices adopted byother ﬁelds that are believed to be good practices. They be-lieved that because PER is a newer ﬁeld, researchers did notinitially use as many statistical techniques. F1 pointed outsome statistical techniques are newer to PER but are likely tobecome part of quantitative practice:I think as we start to ask more sophisticated ques-tions, reporting of effect sizes, the importance ofeffect sizes, as far as the value of the research,and then conversations about statistical powerfrom the beginning, I think, are going to becomemore important. They’re not something that I’veseen a lot in the literature as of yet, but I thinkthey’re going to be important in the future.At the same time, while PER may be moving in this direc-tion, the focus group, as seen in the previous ﬁndings, saw room for a variety of research papers. F2 emphasized a “bigtent” where there is room for these more detailed papers aswell as papers that do not go into these details. They sug-gested that other authors can publish manuscripts that critiquework that lacks certain elements (e.g., an explicit theoreticalframework).

D. Results: feedback on project plan from interviewees

Overall, the interviewees thought our project plan forPhase 2 would be useful for PER. The project plan reﬂectedthe interviewees’ concerns and observations. Some of theircomments expanded upon the themes of the focus group,pointing out more detailed issues and ramiﬁcations if quan-titative work is not done well. They also pointed other issueswithin PER that they observed. Below we summarize the ad-ditional insights that the interviews provided, linking someof the interviewees’ comments to the overarching themesfrom the focus group as well as summarizing other issuesthey mentioned. Individual interviewees are referred to asI(number) (e.g., I1). • Not reporting limitations well–

Variability with individuals in studies.

Two in-terviewees described two limitations regardingstudying people. One is that while samples can berepresentative of a given population, there is vari-ability. I2 pointed that “we’re not certain that theresults are generalizable... This is really a com-plex system that we’re looking at, or complex sys-tems.” I3 gave a speciﬁc example:Sometimes you see studies... wherethere are direct comparisons drawn fromtreatment and control groups where thetreatment group is at one high schooland a control group is at a different highschool. You just can’t do that. This ishighly, highly problematic. On top ofthat, when you do those kinds of things,once again you fall back to the presump-tion that there is an inherent similarityacross individuals that just fundamen-tally doesn’t exist. It just isn’t there.Among individuals, even with similar demo-graphic markers (e.g., same race, gender), thereis variability. I3 emphasized that:People are not electrons, nor are theyprotons, nor are they photons. Thereisn’t a degree of similarity betweenthem. There is a variety of differencefrom individual to individual, and asa result, making these kinds of broad-based claims based on statistical anal-yses is something that should be donewith care, if done at all.3 also suggested another issue with variabilitywith individuals, that individuals themselves arenot necessarily consistent from one day to thenext by chance:When you look at FCI, when you lookat these other instruments that you havelisted here, CSEM, BEMA, CLASS, it’simportant to bear in mind that if yougive the instrument to the same individ-ual the next day, the responses might notbe identical. That is something to ac-cept, think about, and contend with as aresearcher. You can’t say, well, this iswhat it is. There is some degree of errorassociated with that. • Institutional/course contexts and samples are notdescribed well.–

Missing data

Two interviewees discussed howPER may not handle missing data well. I4 sim-ply stated “There’s almost never any discussionof [missing data]. Not even the information to beable to know if there is missing data.” I1 believedthat when researchers account for missing data,they tend to do paired analysis (e.g., only includestudents who took both the pre- and post-tests).While I1 felt this could be a step in the right di-rection, they also thought that paired analysis hassome critical limitations: “ But you haven’t ac-counted for the people that were, that sort of, wereinvisible, right?” I1 advocated for using imputa-tion methods to examine missing data:If you did an imputation method youwould see the pattern. You would see,"Oh, students who, over on this ques-tion, identiﬁed as female also systemat-ically don’t answer this question. Hmmmaybe there’s a problem there." Andthen you can basically ﬁx it throughsome imputation method. Or at leastyou can account for the increased igno-rance from not having those responses.In sum, analyzing missing data can offer addi-tional insights that may not be apparent other-wise. • Overgeneralizing and overstating results.–

Making causal claims when one cannot.

I3 wasconcerned with how researchers in quantitativework make causal claims:My primary concern has to do with thisidea that somehow performing an em-pirical study or an experimental study of some kind where there is a direct com-parison between a treatment and con-trol group allows you to make a causalclaim. Fundamentally, in educationalresearch of all types that would be ahuge mistake to do that. It has a lot todo with the idea that individuals fromperson to person vary a great deal, andthat’s generally accepted... I see thesekinds of claims, and this is a muchmore consistent kind of problem in alot of papers that I’ve reviewed over theyears... Even if you have an experi-ment, to make a direct comparison, toﬁnd a statistical difference between twogroups and then on top of that to say,‘Well, then as a result, this shows [thetreatment worked],’ that’s very danger-ous territory to go into.I3 emphasized they were not trying to stop peoplefrom doing education research or making claimsbut that I3 was encouraging good research prac-tices:There’s a difference between not beingable to do it at all and not being able todraw these overarching causal conclu-sions and being careful about what wesay and being thoughtful about how wemake certain kinds of claims in what it isthat we found. Making deﬁnitive claimsin educational research is fraught. – Implicit ungeneralizability.

I1 hypothesized:[Authors may] really only care abouttheir students, for example, at their in-stitution. And so they may be veryimplicit about the fact that they reallyaren’t speaking to anything beyond theedges of their campus. So you sort ofhave to, I think, read that carefully, insome sense. That’s also a slippery slopebecause sometimes people forget thattheir work is very institutionalized andit doesn’t apply to other schools.According to this interviewee, explicitness re-garding context and who was studied would helpaudience understand that the work may not applyto all institutions. • Implicit theories.–

Lack of theoretical basis for interpreting mea-sures.

I1 stressed the importance of theory withquantitative PER: “Does the measure make anysense, right?” Without having any theory to sup-port the study design or claims, measurementsmay lack meaning.

Not thoroughly considering all aspects of researchdesign and articulating the research design.–

Statistics not being used thoughtfully.

I4, whenreﬂecting on PER’s past and current use of statis-tics, said “There are more statistics, but I don’tknow if that’s made it a lot better off than whenthere were very few statistics.” I3 warned thatparticular statistical methods that become popu-lar may be treated as the only method:There’s this idea that certain kinds ofstatistical methods work better than oth-ers and that kind of thing. For exam-ple, right now in quantitative researchyou see HLM [hierarchal linear model-ing] being thrown around as if it weresome kind of magic amulet of you holdit up, ooh, HLM. It’s not the statisticalmethod, it’s the analysis that’s impor-tant. You do your data collection, youdo the work, and sometimes hierarchi-cal linear modeling works and it’s whatyou’re supposed to do, and other timesit’s better just to go with an ANOVA ora multiple regression.I5 reiterated this point, emphasizing that “[statis-tics] are a tool and given your speciﬁc use, not allwrenches are the same.”One issue identiﬁed by two interviewees is thatresearchers in PER sometimes do consider themeaning of their results in the particular context.I5 pointed out that blanket criteria that are appliedto all studies do not work:I feel lots of people are using methodsfrom the shelf, and they’re believing injust some criteria. So it’s almost likepeople believe that if reliability is notpoint seven, it’s a shitty instrument, andthen once it’s point seven, it’s a super in-strument. And that’s just not the case...It depends, right?I6 pointed that when in interpreting statistics, aresult can be statistically signiﬁcant but not nec-essarily meaningful:So, is a difference of half a point on anassessment for 10,000 students, is thatreally signiﬁcant or not? The statisticsmight tell you it is, and you look atit and you say, "naaahhhh, no. It re-ally is not." I think sometimes peoplejust throw the statistics out there anddon’t think enough about what it’s reallytelling them, or not telling them.During the interviews, the idea of minimum criteria or stan-dards for reporting data in quantitative manuscripts came up. The interviewees were not entirely opposed to minimum stan-dards. I5 stated that “what we are lacking is standards in whatwe would expect from a paper on the reliability and validityof instruments, on certain procedures, it seems like there’sdifferent standards.” I3 believed this was important for PERas a ﬁeld:Emerson once said many years ago, he says,“Look, it is not the direction that you’re going inany given moment but the general trend that yougo in over time that is the most important thingto consider about one’s life.” That is in a sensewhat we’re trying to do about research. It is notwhat the individual ﬁnding of a particular papermight be but the general trend of the ﬁeld overtime that we happen to go in. I think that if weidentify what’s clear in terms of good method-ology for these types of empirical/experimentalstudies, that we will have a better chance of ﬁnd-ing these kinds of trends that lead us to the kindsof conclusions that help us improve physics edu-cation for the long term.However, similar to the focus group, the interviewees saw thisas nuanced issue with potential challenges if standards wereimposed. Although not opposed to minimum standards, I2expressed concern regarding reporting one’s sample:I think there needs to be a balance of howmuch we need to say about the population. Be-cause there could be some ethical problems go-ing ahead. That being said, I think it’s impor-tant and I personally strive to try and say, ‘Okay,so what is this kind of population that we havehere? And how would it be different from otherplaces?’One of the ethical issues noted by I2 is that too many de-mographic details might de-anonymize the sample. I5 alsopointed out that because PER is a developing ﬁeld, whateverstandards might be created will shift with time as knowledgebases grow.One aspect that a few interviewees brought was the culture(i.e., beliefs, practices, etc.) around quantitative research inPER. I2 suggested that many in PER have a narrow deﬁnitionof what quantitative research is:When you say quantitative methods the thingthat a PER person thinks is, the Force ConceptInventory, or the CLASS. They don’t think ...I’m using statistics and counting things. Thevery speciﬁc realm of what it means in PER,which is not actually what it means in otherﬁelds.I6 observed that some individuals may not use quantitativework possibly due to a misunderstanding of what quantitativework is: “I’ve seen people that will not apply quantitativemethods when maybe they should because they think theiropulation has to be super duper huge in order to do anythingmeaningful, which again, is I think a lack of knowledge aboutthings beyond mean and standard deviation, and Z scores.” I4observed that there are few replicability studies, which theyfeel can be problematic if a single study is used to support aclaim. I1 noted that the way articles are written in educationﬁelds differ from PER, perhaps as a result of how traditionalscience papers are written:“[Articles in journals such as Journal of Researchin Science Teaching (JRST)] always, like, havethe million-page introduction to every paper.And that style is pretty different than PRPER’s,even today... Because traditional science pa-pers are often written in that very staccato style,where if you read a Science or a Nature or a PRLarticle in a traditional science discipline, right?They’ll just say, ‘Here’s a two-paragraph sum-mary with 50 citations that, if you want to under-stand this topic, here’s the 50 things.’ WhereasJRST will say, ‘Well let’s talk about what this allmeans.’ And it’s partly, I think, intended to be alittle more self-contained. I come from the sci-ence background where I’m perfectly happy togive a short introduction, but I’ve learned thatthis is sort of a stylistic thing and a rhetoricalﬂourish. But it does mean that it’s hard to doatheoretical work and publish it in JRST, or Sci-ence Ed, or similar journals.While some of these cultural differences may be actual prob-lems (e.g., not using quantitative methods due to lack ofunderstanding), other differences, such as the differences inintroductions, are areas that may be worthwhile thinkingthrough the costs and beneﬁts. Considering the cultural as-pects regarding quantitative PER may be useful for consider-ing how any identiﬁed issues could be changed.

1. Concerns regarding Phase 2

A few interviewees brought up some potential issues withour plan for Phase 2. I1 pointed out that American Journalof Physics (AJP), The Physics Teacher (TPT), and Physi-cal Review- PER (PRPER) all have difference audiences andthus, authors would present their work differently. The ratio-nale will be described in more detail in Phase 2, but AJP andTPT were traditionally the venues for PER prior to PRPER.I1 also brought up a subtle point, whether we would check tosee if claims are consistent with the data that is presented orwhether we would check to see if they answered the researchquestions or goals. We opted to do the former. Answeringthe broader research questions is important, we are more in-terested in more fundamental research practices, as our focusgroup and interviewees suggested were where the biggest is-sues are. The other concern was that the project plan relied on as-sessments that may have issues. I4 mentioned that some ofthe concept inventories on our list may not be validated orreliable:I think at the FCI, and I think of validationand reliability arguments and there’s just none,right? Like, when they developed it, they justyou know, I think they did a pretty amazing jobfor like what they were doing at the time, butif you compare that to the work even done onthe BEMA, or at least the published work doneon the BEMA, when it was developed, I thoughtthat was like a much more thorough and well-validated instrument.I4 argued that better designed instruments are much newer,so the initial time period we would be studying would in-evitably consist of manuscripts that contain the issues in ourplan. While validity and reliability are fair concerns if wewere doing a meta-analysis to cull together ﬁndings, our in-terests are how researchers reported on data and used data tomake claims.

E. Advice from the focus group and interviews

Our focus group participants and interviewees provided awide-range of advice to improve quantitive research in PER.This includes: • PER should support community-built software re-sources and tools. he statistical programming lan-guage, R, was mentioned by 3 individuals as the quan-titative analysis tool PER should use. They were en-thusiastic that R would be ideal tool to use. They alsosuggested that the R code used for analysis be sharedin some fashion, either in repository or in an appendixin the manuscript. I5 thought this would not only be ameans of checking for accuracy but also as a learningtool, suggesting that “maybe [submitting] commentedsnippets, so people would understand why did that per-son do it that particular way.”The authors of this manuscript do not necessarily ad-vocate speciﬁcally for R, but they believe that spiritof the sentiment, community-supported tools and shar-ing code or resources, is worthwhile for PER to con-sider. We anticipate a broader conversation regardingsuitability of tools for different datasets, may be usefulfor PER. • Provide enough information that others can checkthe work.

As readers and reviewers, two individu-als were interested in having enough information todetermine whether the statistical test was used cor-rectly. I6 has observed instances where they cannottell whether the researcher was able to use parametrictatistics without violating assumptions. F3 said thatthey sometimes check a manuscript’s statistics and ad-vocated for enough information that others can do so:As a reviewer, occasionally I’ll do by handsomeone else’s t-test to make sure that itcomes out. I at least eyeball it... So occa-sionally, I’ll calculate the F to see like "oh,they’re saying this is F, but am I nuts, or isit for those mean squares, not seem like itworks." ... I think at a minimum, that sufﬁ-cient statistics to do that has to be reported.That’s been APA that have been demand-ing that for years in order for anything tobe published.F3 thought this information does not necessarily needto be in the manuscript’s body but could be in an ap-pendix. • Consider sample size with study design.

I3 noted thatearly publications in education used small sample, sucha single classroom in one school, and later found outthat their ﬁndings were not applicable. They noted thatthis could happen to PER. I3’s advice regarding samplesize and study design is to do one of the following:[My father] goes, “You know, what’s in-teresting is that within crop science theyhave a very sort of simple rule about howto approach research studies, and the rule isthis: multiple sites, single treatment; singlesite, multiple treatments.” That’s how thatworks. If you want to produce a researchstudy that you want to do a single treat-ment with, then you need to do that researchstudy in multiple sites. The other part of thatwould be if you want to do a single site, thatmeans you want to go to a single school anddo your study, then you need to do multi-ple treatments. That means you need to dovarious classrooms, and if there’s only oneor two classrooms in that school, then youneed to do the same treatment in successiveacademic years.However, undertaking any of these suggestions might bechallenging, particularly if it involves sharing code and pro-viding enough data that one’s work can be checked. Re-searchers may feel vulnerable, as I1 suggested:Yeah, I mean I think it’s cultural inertia primar-ily, right? You know, learning something new ishard and scary. It’s also, I honestly think thatpeople are a little scared sometimes to be wrong,right? And so if you put all of your dirty laun-dry out there, then people will ﬁnd the dirt. Andso that’s a little scary for people - it’s scary in a world where you think that you will be judgedand then not respected for having made a mistakeas if none of us have ever made a mistake before.There might be a need to have a cultural change regarding re-search and our relationship to the tentativeness of science andbeing wrong, if PER is to adopt any of the advised practices.

IV. PHASE 2: ANALYZING PEER-REVIEWEDMANUSCRIPTSA. Research questions

Using the data from Phase 1, we designed phase 2 tostudy peer-reviewed manuscripts that use the following as-sessments: the Force Concept Inventory (FCI), ConceptualSurvey of Electricity and Magnetism (CSEM), Brief Elec-tricity and Magnetism Assessment (BEMA), and ColoradoLearning Attitudes about Science Survey (CLASS). Althoughquantitative PER encompasses more than these assessmentsor assessments period, we opted to focus on these becausethey are commonly used by many different researchers. Thus,we can look at an important part of PER that reﬂects much ofthe research community. Our research questions are as fol-lows:1. What information on samples have authors reportedin peer-reviewed manuscripts that use FCI, CSEM,BEMA, and CLASS?2. Have authors reported on limitations in thesemanuscripts?3. Are claims in these manuscripts supported by data?4. Do questions 1-3 differ by time period in PER?Based on the focus group’s comments, our underlying hy-pothesis is that the more recent research tends to report on thesample in more detail, includes limitations, and make claimsthat are supported by data.We note these issues also have been noted in other work(e.g., [6]). The work most similar to ours is a recent pa-per by Kanim and Cid (2017) that asks which students arestudied in PER and whether this is a representative samplein relation to population demographics. While they also ex-amined manuscripts within PER journals, we note that theirmanuscript and ours differ on several key points. Namely, welooked for presence of information such as a sample descrip-tion, as the focus group and interviewees suggested quantita-tive work in PER may lack these descriptions. We also fo-cused on a narrower set of papers; they drew upon papersform the 1970s through 2015 and included a wide variety ofpapers [24], while we focused on speciﬁc assessments fromthe 1990s through 2017. Thus, our papers may have similari-ties but are distinct in their goals and methods. . Methodology

1. Journal selection

For this study, we used manuscripts from AJP, TPT, andPRPER. For ease of discussion, we are using PRPER to referto both Physical Review- PER and Physical Review SpecialTopics- PER (PRSTPER), the original name for PRPER. Weconsidered JRST but found few articles that used the assess-ments of interest.We selected these journals because they are intended forresearchers in PER and they are peer-reviewed. Our interestin peer-reviewed journals is because this study was to exam-ine PER’s quantitative work and peer reviewers are part ofthe research system. Although authors bear the responsibilityof creating high quality manuscripts, editors and peer review-ers share this responsibility by ensuring the manuscripts aresuitable for publication.Although PRPER was established because of an identiﬁedneed for a journal devoted to researchers in PER [25–27] AJPand TPT historically published important manuscripts. Forexample, the FCI, the oldest assessment we considered, wasintroduced in 1992 in TPT [30]. The FMCE was introducedin 1998 in AJP [31]. Both instruments are still used todayby researchers. AJP and TPT are not opposed to research butemphasize that presented research must have practical appli-cations due to their readership being practitioners.

2. Manuscript and assessment selection

We initially envisioned this project drawing uponmanuscripts that use the following assessments: FCI, Forceand Motion Conceptual Evaluation (FMCE), CLASS, Col-orado Learning Attitudes about Science Survey for Experi-mental Physics (ECLASS), Maryland Physics ExpectationsSurvey (MPEX), CSEM, and BEMA. These seven were se-lected to cover a range of assessments. We also believedthese were frequently used. Frequency of use was importantto ensure our sample contained enough manuscripts for ourstudy. Below, we brieﬂy describe each instrument, appearingin chronological order by the ﬁrst publication on the instru-ment: • FCI , 1992. A multiple choice assessment designedto measure how well students conceptually understandNewtonian mechanics that are covered in introductoryphysics courses [30]. • FMCE , 1998. A multiple choice assessment designedto measure student understanding of Newtonian me-chanics in introductory physics [31]. The FCI andFMCE cover topics to slightly different extents, and theFCI uses pictorial representations for some questionswhile the FMCE uses graphs for some questions[32]. • MPEX , 1998. A multiple choice assessment de-signed to measure student beliefs and attitudes towardsphysics, including how they approach studying in thecourse and whether they think physics is relevant tothem [33]. Respondents select how much they agreewith a statement on ﬁve-point Likert scale (stronglydisagree to strongly agree) [33] • CSEM , 2001. A multiple choice assessment designedto measure student understanding of introductory elec-tricity and magnetism [34]. The authors use bothgraphical and pictorial representations in this conceptinventory [34]. • CLASS , 2006. Designed to study students’ beliefsabout physics and learning physics, such as whetherlearning physics has any usefulness to them in theirlives [35]. Respondents rate statements on a ﬁve-pointLikert scale (strongly disagree to strongly agree) [35]. • BEMA , 2006. A multiple choice assessment de-signed to measure student understanding in traditionalcalculus-based electricity and magnetism (E&M) aswell as those enrolled in E&M courses that use

Matterand Interactions II: Electric and Magnetic Interactions curriculum [36]. Similar to the other concept invento-ries on this list, the emphasis is on conceptual under-standing rather than mathematical calculations [36]. • ECLASS , 2012. Designed to study changes in studentattitudes towards laboratory practices before and aftera lab course [37]. ECLASS uses a ﬁve-point Likertscale (strongly disagree to strongly agree). Items onthis assessment ask respondents to consider how theyfeel about different statements and asks respondentshow a hypothetical physicist might answer [38].We looked at each assessment on PhysPort, a websitethat, among other services, has a comprehensive collectionof assessments [28]. Only ECLASS has built-in demo-graphic questions, both for the individual student respon-dent as well as course-level information (e.g., institutionname, course name). Because ECLASS had built-in demo-graphic questions and was fairly recent, we decided to ex-clude manuscripts that used it for uniformity. On PhysPort,each assessment has an implementation guide that describesthe purpose of each assessment. The assessments link to abest practices guide that mentions studies that have exam-ined demographic differences (e.g., gender, race/ethnicity)but does not suggest that those data are collected [29].To ensure the number of manuscripts studies was manage-able, we decided to include manuscripts from the ﬁrst ﬁveyears of the instrument’s introduction and the most recentﬁve years to reﬂect contemporary practice. For example, weincluded manuscripts that used the FCI that were publishedfrom 1992 to 1997 as well as manuscripts from 2012 through2017. We excluded any meta-analyses, because their report-ing capabilities are reliant on the original studies, as well astudies that only mentioned the assessment in their literaturereview.Our searches in the journals were based on the history ofpublishing in PER. We searched for manuscripts in AJP andTPT if the instrument was introduced prior to 2010. Al-though the PRPER began in 2005, fewer than 10 articles werepublished in 2005 and approximately 25 were published in2009 [27]; we suspected that researchers may not have ini-tially been aware of PRPER or still saw AJP or TPT as thebest options to publish PER work. If the instrument’s ﬁrstﬁve years coincided with 2005, the debut of PRPER, wesearched in PRPER as well. For example, we looked formanuscripts that used CSEM (introduced in 2001 [34]) inAJP, TPT, and PRPER because the ﬁrst ﬁve years ranged from2001-2006. We only looked in PRPER for recent papers, the2012-2017 timespan, because three hundred eighty-eight pa-pers had been published [39]. This suggested that PRPERwould provide an adequate number of articles to study in themore recent timespan.Manuscripts were further examined by Aiken and Knaubto determine to what extent the assessment was used. Our listwas further reﬁned to only include manuscripts that focusedon the assessment data; some manuscripts used assessmentsin the periphery, only discussing them in a few sentences andfocusing on other research areas. We found 84 manuscriptsthat met our criteria. Thirteen used more than one of the as-sessments of interest.Table I displays the range of the ﬁrst ﬁve years of each in-strument, as well as the number of articles we found for theﬁrst ﬁve years and the most recent 5 years for the FCI, FMCE,MPEX, CSEM, BEMA, and CLASS. We note that the totalis greater than 84 because some manuscripts used more thanone assessment. We only counted the use of an assessment ifit coincided with the ﬁrst ﬁve years of the assessment’s exis-tence. For example, a manuscript published in the early 2000smight use both the FCI and CSEM. We would only count thisas a CSEM paper, because the FCI would have existed for adecade.We included only assessments that appeared in 10 or moremanuscripts, ending with FCI, CSEM, BEMA, and CLASS.Four included manuscripts used FMCE or MPEX in additionto one of the four assessments included in this study. This leftus with 72 manuscripts. Seven are from AJP, sixty-two arefrom PRPER, and three are from the TPT. These manuscriptshave 176 researchers as lead or co-authors. Twenty-four(33.3%) of these manuscripts studied non-US populations(e.g., students from Canada, students from China). The listof manuscripts we coded can be found online [40].

3. Manuscript coding scheme and methods

A priori codes were developed based on the focus group,the interviews, and the researchers’ experiences with report-ing sample descriptions, limitations, and conclusions. Codesdelved into population context descriptions, sample descrip- tions, discussion of limitations, and how ﬁndings were usedto support conclusions. Coding was done to look for the pres-ence of these features, not making judgments about how wellthese features were described.Aiken and Knaub initially coded 15 manuscripts indepen-dently, reading each manuscript to ﬁnd the particular informa-tion. They used the a priori codes as well as noted any emer-gent codes. They met to discuss emergent codes, difﬁcul-ties in interpreting the codes, and discrepancies between theircoding. The coding scheme was modiﬁed. Using the newcoding scheme, they independently coded all manuscriptsincluding the 15 they previously coded. Their coding wasthen combined. When discrepancies in coding occurred, themanuscripts were rechecked.Table II displays the ﬁnal coding scheme. Coding erredon the side that the information was present, even if the in-formation was difﬁcult to ﬁnd. For example, a manuscriptby authors at one institution might say “our students” whendiscussing the data and make references that they were theinstructors, but not name the institution. This would be codedas the institution was named, even though it required morecareful reading than manuscripts that named the institution inthe body. We also read similarly for the codes under limita-tions and conclusions.The codes reﬂect the potential issues in quantitative PERwork our focus group and interviewees noted. We note thatsome codes are more speciﬁc (e.g.,

Institution name ) thanothers (e.g.,

Population demographics ). We were more spe-ciﬁc for some codes because the information would morereadily be available and standard, such as the

Institutionname , Course grade , and N . The less speciﬁc codes were acompromise of noting that authors provided some descrip-tion but not applying standards that may be harmful to theirsample (e.g., gender information may inadvertently reveal re-spondents), an issue noted in Phase 1 of this study.

4. Limitations

Phase 2 has several limitations regarding the scope of thisoverall project. The primary limitation is that we are fo-cused on a handful of assessments. They do not encompassthe entire body of quantitative research in PER. Perhaps non-assessment quantitative work or different assessments wouldyield different results. Despite this limitation, focusing onthese assessments had some advantages. Pragmatically, wewere able to set a boundary around a manageable set ofmanuscripts and could thoroughly examine them. Althoughthese assessments focus on different aspects of physics edu-cation, they are similar in that they do not have built-in demo-graphics questions. This eliminated some variability.A second major limitation is that we drew upon three dis-tinct journals who have different purposes. The target audi-ences are different, so the inclusion or exclusion of partic-ular information (e.g., limitations) may not feel as relevantif a manuscript is for practitioners. However, PRPER only

ABLE I. Description of assessments considered for Phase 2

Assessment Range of No. of articles No. of articles Total no. of articlesﬁrst ﬁve years from ﬁrst 5 years from most recent 5 years using assessment

FCI 1992-1997 3 29 32FMCE 1998-2003 2 4 6MPEX 1998-2003 3 5 8CSEM 2001-2006 5 9 14BEMA 2006-2011 8 4 12CLASS 2006-2011 11 15 26 came into existence in recent times. Drawing upon AJP andTPT are best options to determine whether PER has improvedwithin our scope.Lastly, we coded manuscripts for the existence of these par-ticular features. We cannot make any claims to the quality ofsample reporting or whether manuscripts did an adequate jobof addressing limitations and discussing conclusions. Giventhe wide range of contextual situations regarding samples andmanuscript research questions and goals, we felt that look-ing for the existence of these features was adequate for thegoals of this manuscript. We suggest that the quality of thesemanuscript features be future work for interested researchers.

C. Results: information reported on institutional/coursecontext and sample

Figure 1 displays the percentage of manuscripts that werefound to contain different features. The only features thatall manuscripts in this sampled contained was N and thecourse description. Many (47 or 65.3%) reported the insti-tution name and half of the manuscripts reported a responserate. It is unclear why response rate was not reported. Themanuscripts in this study had well-deﬁned samples (e.g., aparticular course or several courses), suggesting that the au-thors knew the total number of possible respondents.

1. Time period in PER

The data were further divided as “early” and “recent.”“Early” were manuscripts published before 2012, while “re-cent” were manuscripts published from 2012-2017 (the mostrecent completed 5 years). Twenty-ﬁve manuscripts are in the“early” category. Forty-seven are in the “recent” category.Figure 2 displays these data. All data are presented as per-centages of “early” or“ recent.” The features show an assort-ment of differences. Some features such as

Response rate show an increase between “early” and “recent” while othercodes show a decrease, such as

Population demographics and

Instructor/section . Institution name , Year in schooling , and

Sample demographics had similar percentages of reporting inboth “early” and “recent.” We ran Fisher’s Exact test on each feature to determinewhether there is any statistically signiﬁcant relationship be-tween each feature and the early/recent category. Fisher’s Ex-act test is similar to the chi-square ( χ ) test and is generallypreferred for small samples sizes. Table III displays the χ results, the p-value for Fisher’s exact test, and the effect size( φ ). Because Course description and N were reported on byall manuscripts, we did not include them in the table.Almost all features have small effect sizes (< | | , per tra-ditional convention)and were not statistically signiﬁcant. Inother words, manuscripts in the “early” period are not verydifferent from manuscripts in the “recent” period, on a whole,for the majority of features for the institutional/course contextand sample categories.The only feature that has a statistically signiﬁcant ( p <0.05) result is Response rate , which suggests a relationshipbetween this feature and the early/recent variable (i.e., re-sults are less likely to chance). Though the percentage ofmanuscripts with a

Response rate almost doubled in the “re-cent” times, the effect size is somewhat moderate (> 0.2).This means there is a modest difference between early andrecent periods.

D. Results: reported limitations

Figure 3 displays the percentage of manuscripts that re-ported any limitations, reported sample limitations, reportedstatistical or study design limitations, and attempted to ad-dress or overcome the limitations. We note that there may beno good way to overcome limitations, but we were interestedbecause one of our interviewees suggested that authors do notoften attempt to overcome study limitations.While most manuscripts (approximately 90%) in this studyreported limitations, a few did not. Slightly over half (37 or51.4%) reported both on sample and statistical or study de-sign limitations. Fewer than half of the manuscripts notedsome kind of attempt to overcome the study’s limitations.

1. Time period in PER

The data were analyzed to determine whether manuscriptswritten in earlier times were different from those written more

ABLE II. Coding scheme for Phase 2

Category Code DescriptionInstitution/course context

Location

Where the institution is located. Includes general(e.g., Midwest) and speciﬁc (e.g., Boston, MA)

Institution name

The name of the institution (e.g., Michigan State)

Population demographics

Any descriptions of the population studied (e.g., genderof students in the course, percentage of internationalstudents in the institution, etc.)

Course description

Any description of the course (e.g., taught using activelearning, introductory physics)

Sample N The authors reported the total number of responses

Response rate

The authors reported how many responses they receivedrelative to the number of potential respondents. Wecounted both manuscripts that explicitly had a responserate and those that provided enough data to ﬁnd one (e.g.,including both the number of responses and the totalnumber of students in a class)

Sample demographics

Any description of the sample (e.g., race, gender, majors)

Background in physics

Any description of the sample’s prior experiences inphysics (e.g., participants had taken at least one physicscourse)

Course grade

Any description of the course grades of the students in thesample (e.g., letter grades, numeric grades)

Year in schooling

Any description of year in schoolings of the sample (e.g.,sophomores, recent graduates)

Instructor/section

Any description regarding who taught the course (e.g.,teaching background of the instructor, number ofinstructors) or how many sections are in the sample

Limitations

Sample and population limitations

Any acknowledgement that the sample and/or populationmay impact the results such that generalizability orapplicability may hindered (e.g., collecting from oneinstitutions, the sample only has students who passed thecourse)

Statistical and study design limitations

Any acknowledgement that the statistics and/or the studydesign impacts the results such that the results such thatgeneralizability or applicability may hindered (e.g., causalclaims cannot be made, noting lack of a comparison group)

Attempt to overcome limitations

The use of any technique done to mitigate a limitation (e.g.,using paired data or imputation methods for incompletedata sets)

Conclusions

Data are used to support conclusions

Conclusions are made referring to the data

Conclusions do not overgeneralize or overstate

Conclusions acknowledge that the results may not beuniversal (e.g., conclusions acknowledge limitations,concluding statements refer to the study and the sample)and present claims tentatively (e.g., authors do not makeabsolute statements) recently. Similar to section IV C 1, we ran Fisher’s exact testto see if differences were statistically signiﬁcant. These re- sults are displayed in Figure 4 and in Table IV.All manuscripts in recent times note at least one limitation.

IG. 1. Percentage of manuscripts that had speciﬁc population and sample codesFIG. 2. Percentage of manuscripts that had speciﬁc population and sample codes, disaggregated by recent (2012-2017) and early (1992-2011)

Reporting limitations on samples, limitations in statistics orstudy design, and attempts to overcome limitations are haveall increased from earlier to recent times. These are all foundto be statistically signiﬁcant and have moderate effect sizes.This suggests these results may not be due to chance and thatthe difference is somewhat substantial.

E. Results: reporting on conclusions

Lastly, we looked at whether the conclusions referred tothe data in the study and whether the conclusions overgener-alized or overstated ﬁndings. These ﬁndings are in Figure 5.All manuscripts in this study made use of the study’s data tosupport any conclusions. Most manuscripts did not overgen-eralize or overstate their conclusions. .0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

Attempts to overcome limitationsSample limitationsStatistical or study design limitationsAny limitations reported

FIG. 3. Percentage of manuscripts that had reported on limitations

Attempts to overcome limitationsSample limitationsStatistical or study design limitationsAny limitations reported

Early Recent

FIG. 4. Percentage of manuscripts that had reported on limitations, disaggregated by recent (2012-2017) and early (1992-2011)

Conclusions do not overgeneralize or overstateConclusions use data

FIG. 5. Percentage of manuscripts that used data in the conclusions and conclusions did not overgeneralize or overstate

1. Time period in PER

We examined the issue of overgeneralization to seeif recent manuscripts in this study differed from earlier manuscripts. These results are in Figure 6. Manuscripts inmore recent times tended not overgeneralize or exaggeratetheir conclusions, though some (12.8%) still do. This result

ABLE III. Fisher’s Exact test results on institutional/course andsample features

Feature χ P-value φ Institution name

Location

Population demographics

Response rate ∗ Sample demographics

Background in physics

Year in schooling

Grade in course

Instructor/section ∗ = p < 0.05TABLE IV. Fisher’s Exact tests on limitation features Feature χ P-value φ Any limitations reported ∗ Statistical or study design limitations ∗ Sample limitations ∗ Attempts to overcome limitations ∗ ∗ = p < 0.05 is statistically signiﬁcant, though the effect is somewhat small( χ = 5.34, p = 0.032, φ = 0.272). This suggests that these re-sults may not have been due to chance and that the differenceis small. V. DISCUSSION AND CONCLUSIONSA. Phase 1 discussion

This two-phase study’s goals were to ﬁnd out what arepressing issues in quantitative PER and to determine howpervasive these issues are. During Phase 1, we ran a fo-cus group of experts in quantitative PER who were identi-ﬁed by editorial members of PER publications. The focusgroup’s main concerns were manuscripts having poor sampledescriptions, not reporting limitations well, and overgener-alizing/overstating conclusions. They pointed out that muchof the quantitative work in PER is focused on assessments.The focus group also believed that these issues were resolv-ing themselves as PER matured.The results from this phase were particularly interestingbecause the issues mentioned are fundamental aspects of re-search, regardless of whether it is quantitative. Althoughthese issues are fundamental aspects of research, resolvingthese issues may be complicated and non-prescriptive. Asour focus group and interviewees noted, there are legitimatereasons to not report overly detailed descriptions on samples.Limitations and generalizability will vary from study to study.Some of this may depend on the author’s intended audience; for example, authors who write for practitioners may fear thattoo much space devoted to discussing limitations will deterpractitioners from reading their manuscript. Still, there issome general sense that manuscripts should include these as-pects.

B. Phase 2 discussion

After the focus group, we created a project plan and inter-viewed additional experts to provide feedback. The intervie-wees were mostly in agreement with the plan, though theydid express some caution in applying any universal standardto some reporting on these issues (e.g., demographic data onthe sample). This sentiment was also expressed by the fo-cus group. During Phase 2, we looked at manuscripts thatuse the FCI, CSEM, BEMA, and CLASS from AJP, PRPER,and TPT, all peer-reviewed PER journals. We were inter-ested in what is reported on samples and limitations as wellas how conclusions are articulated. We were also interestedin whether there were any difference between early (1992-2011) publication times of these assessments and recent times(2012-2017). Our general hypothesis, based on the focusgroup, was that these manuscript features are more commonin recent times.The manuscripts in our study have used a variety of waysto describe the institutional/course context and the sample intheir studies. Although there are differences between earlyand recent times, most were not found to be statistically sig-niﬁcant via Fisher’s Exact test. This suggests that reportingon these features has not changed much since the early daysof these assessments and present time. One feature that wasfound to be statistically signiﬁcant,

Response rate , had smalleffect size. We note that the four assessments in this study donot have built-in questions regarding demographic informa-tion or institutional/course context. Future work is needed tosee whether built-in questions have any effect on sample orinstitutional/course context descriptions.We urge caution in judging manuscripts that report thesefeatures and in turn, comparisons between “early” and “re-cent”, as “good” or “bad.” We note that a third of themanuscripts studied individuals from non-US institutions;there may be different legal and ethical standards for report-ing. Additionally, some features, such as

Location or Institu-tion , may not have a readily apparent purpose, making a judg-ment of “good” difﬁcult. For example, we are unsure whatis gained by knowing that the students were from the EastCoast. Even knowing an institution name sometimes doesnot help if one is not familiar with the institution. Perhapsthere is, as described in Phase 1, an implicit theory that com-pels authors to include this information. We do not bring thisissue up to shame authors who have done so; we, the authorsof this manuscript, have also reported some of these featureswithout much if any explanation.Still, while we respect that not all of these features can beincluded due to ethical obligations to research participants, .0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0%

Conclusions do not overgeneralize or overstateConclusions use data

Early Recent

FIG. 6. Percentage of manuscripts that used data in the conclusions and conclusions did not overgeneralize or overstate, disaggregated byrecent (2012-2017) and early (1992-2011) we are unsure why some features are not reported. Somefeatures should not violate ethical obligations and would beuseful. For example, while reporting response rates has in-creased in more recent times, a little under 60% of the recentmanuscripts in our study report a response rate. These studieshad bounded systems (e.g., a course), meaning a total numberof possible respondents is known. We also noticed that fewstudies indicated whether their sample was representative ofthe population of interest (e.g., were students with A’s over-represented in the sample?). These features can help readersunderstand the limitations of the study and perhaps help re-searchers gain additional insights on their work.The recent manuscripts in this study have more report-ing on limitations and have more often mentioned ways inwhich the authors attempted to overcome limitations. Recentmanuscripts are also less likely to overgeneralize or overstatein the conclusions. While these results are encouraging, weponder whether most of these features should be present inall manuscripts. At the very least, no manuscript should over-generalize or overstate their conclusions.

C. Conclusions

Within the context of this study, manuscripts that use theFCI, CSEM, BEMA, and CLASS have improved since theearly days in that limitations are present and that conclu-sions are are not overgeneralizing. Sample reporting has notchanged much, though we are cautious to suggest whetherthat is overall negative.We emphasize that these results are limited to the contextof this study and that readers should not interpret these resultsto mean that reporting on samples, limitations, and conclu-sions is ﬁne. As noted earlier, we erred on the side of thesefeatures being present even if they were subtly indicated (e.g., writing that discusses “our students” counted as not overgen-eralizing, refer to another paper that describes the study inbetter detail). Aiken and Knaub compared coding, and themanuscripts were re-examined when discrepancies occurred.These were quickly resolved, but we note that we read thesepapers for research purposes and were deliberately lookingfor these features. A typical reader, who is not conductingsuch a study, may not read a manuscript in this manner. Inshort, these features are present but may not be easily foundor may be accidentally overlooked.Similarly, we also did not code for how well manuscriptsreported any of these features. We coded for these featuresbeing present. There are likely manuscripts in this study thatdo not cover the most important or relevant sample descrip-tions or limitations.Despite these caveats, these results shed some light onquantitative PER. We see these results as a step towards crit-ically examining quantitative work in PER. To aid this workin critically examining quantitative work, we offer the follow-ing questions for PER to consider when writing or reviewingmanuscripts. These are just questions, not necessarily with a“correct answer.” • What information regarding the sample is useful for theaudience? – What implicit messages could the included infor-mation tell the audience? – Does the included information need explanationso that the audience understands why it is impor-tant? – If the author does not include particular informa-tion regarding the sample, is the research weak-ened as result?

Are sample descriptions and limitations implicit?Should they be explicit? • Which limitations are important for authors to ac-knowledge? – Is it adequate for the author to just acknowledgethe limitations? • How explicit should authors be that their conclusionsdo not overgeneralize? • How strongly should authors make their claims? • How much effort should the audience have to make inﬁnding any of this information? – If sample descriptions and limitations are notcarefully woven throughout the manuscript, whatmessage might a reader receive?The authors of this manuscript acknowledge there is a greatdeal of nuance to how these questions could be answered, andthere likely is no “one-size-ﬁts-all” response to any of thesequestions. However, we do believe that these questions areimportant for researchers, reviewers, and readers of PER toconsider. This responsibility is not just for the researcherswho create the studies and write the manuscripts but for allinvolved in the research enterprise. Manuscripts are peer-reviewed and research is used, cited, or applied. As these studies are used to inform practice and policy thatcan affect many, it is important that studies do not misinformeven if misinformation is inadvertent. Future work is neededto examine the quality of reporting on these issues, as wellas work on exploring how readers engage and understand re-search. We anticipate such work would help researchers un-derstand how to best communicate their results. PER as aresearch community may beneﬁt from an open conversationregarding what should be presented and why. Again, we donot have answers but believe that further research and explor-ing questions like the ones we propose, even in a conversa-tion, could lead to more robust research that ultimately leadsto better physics education.

VI. ACKNOWLEDGEMENTS

The authors thank the editors and advisory boards of Phys-ical Review- Physics Education Research and the AmericanJournal of Physics for their suggestions for potential inter-viewees/focus group members. The authors thank the focusgroup participants and interviewees. Chastity Aiken providedadditional feedback and proofreading that was much appreci-ated. We also thank Michigan State University, the Ohio StateUniversity, and the University of Oslo.This study was partially funded by the Norwegian Agencyfor Quality Assurance in Education (NOKUT) which sup-ports the Centre for Computing in Science Education at theUniversity of Oslo. [1] R. J. Beichner.“An Introduction to Physics Education Re-search,” in

Getting Started in PER

GettingStarted in PER (2003).[4] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R (Sage Publications, London, UK, 2012).[5] J. Scott.

Social Network Analysis: A Handbook (Sage Publica-tions, London, UK, 2000).[6] J. L. Docktor and J. P. Mestre, Synthesis of discipline-basededucation research in physics,

Phys. Rev. ST Phys. Educ. Res. , , 020119 (2014).[7] F. J. Fowler, Jr., Survey Research Methods (Sage Publications,London, UK, 2013).[8] K. C. Everson, Value-Added Modeling and Educational Ac-countability: Are We Answering the Real Questions?,

Rev. ofEd. Res. , , 35 (2017). [9] J.W. Creswell. Research Design: Qualitative and quantita-tive approaches (Sage Publications Ltd. Thousand Oaks, CA1994).[10] D. McNeish, Challenging Conventional Wisdom for Multivari-ate Statistical Models With Small Samples,

Rev. of Ed. Res. , , 1117 (2017).[11] R. K. Henson and J. K. Roberts, Use of exploratory factor anal-ysis in published research: Common errors and some commenton improved practice, Ed. and Psych. Meas. , , 393 (2006).[12] R. L. Wasserstein and N.A. Lazar, The ASA’s statement on p-values: context, process, and purpose, The Am. Stat. , ,129 (2016).[13] B. Thompson, Editorial Policies Regarding Statistical Signif-icance Testing: Three Suggested Reforms, Ed. Res. , , 26(1996).[14] L. Bao, Theoretical comparisons of average normalized gaincalculations, Am. J. Phys. , , 917, 2006.[15] J. Marx and K. Cummings, Normalized Change, Am. J. Phys. , , 87, (2007).[16] B. Thompson, What future quantitative social science researchcould look like: Conﬁdence intervals for effect sizes, Ed. Res. , , 25 (2002).[17] A. F. Zuur, E. N. Ieno and C. S. Elphick, A protocol for data ex-ploration to avoid common statistical problems, Meth. in Eco.and Evo. , , 3 (2010).[18] S. D. Museus and P. N. Kiang, Deconstructing the Model Mi-ority Myth and How It Contributes to the Invisible MinorityReality in Higher Education Research, New Direct. for Instit.Res. , (142) , 5 (2009).[19] A. Žnidar˘si˘c, A. Ferligoj, and P. Doreian, Non-response insocial networks: The impact of different non-response treat-ments on the stability of blockmodels. Soc. Networks , ,438 (2012).[20] I. Rodriguez, E. Brewe, V. Sawtelle, and L. H. Kramer, Im-pact of equity models and statistical measures on interpreta-tions of educational reform, Phys. Rev. ST Phys. Educ. Res. , , 020103 (2012).[21] S. Slade and P. Prinsloo, Learning analytics: Ethical issues anddilemmas. Am. Behav. Sci. , , 1510 (2013).[22] B. Dietz-Uhler and J.E. Hurn, Using Learning Analytics to Pre-dict (and improve) Student Success: A Faculty Perspective, J.of Inter. On. Learn. , , 17 (2013).[23] R. Chetty, J. Friedman, and J. Rockoff, Discussion of theAmerican Statistical Association’s Statement (2014) on UsingValue-Added Models for Educational Assessment, Stat. andPub. Pol. , , 111 (2014).[24] S. Kanim and X.C. Cid. The demographics of physics educa-tion research. arXiv preprint arXiv:1710.02598 (2017).[25] C. Henderson and R. Beichner, Publishing PER Articles in AJPand PRST-PER, Am. J. Phys. , , 581 (2009).[26] National Research Council, Discipline-Based Education Re-search: Understanding and Improving Learning in Under-graduate Science and Engineering (National Academies Press,Washington, DC, 2012).[27] R. J. Beichner, Reﬂections on the Origins of Physical ReviewSpecial Topics- Physics Education Research,

Phys. Rev. STPhys. Educ. Res. , The Phys. Teach. , , 141 (1992).[31] R. K. Thornton and D.R. Sokoloff, Assessing student learningof Newton’s laws: The force and motion conceptual evaluationand the evaluation of active learning laboratory and lecture cur-ricula, Am. J. Phys. , , 338 (1998).[32] R. K., Thornton, D. Kuhl, K. Cummings, and J. Marx, Com-paring the force and motion conceptual evaluation and theforce concept inventory. Phys. Rev. ST Phys. Educ. Res. , ,010105 (2009).[33] E. F. Redish, J.M. Saul, and R.N. Steinberg, Student expecta-tions in introductory physics. Am. J. Phys. , , 212 (1998).[34] D. P. Maloney, T. L. O’Kuma, C. J. Hieggelke, and A. VanHeuvelen, Surveying studentsï¿œ conceptual knowledge ofelectricity and magnetism, Am. J. Phys. , , S12, A.(2001).[35] W. K. Adams, K. K Perkins, N. S. Podolefsky, M. Dubson,N.D. Finkelstein, and C.E. Wieman, New instrument for mea-suring student beliefs about physics and learning physics: TheColorado Learning Attitudes about Science Survey, Phys. Rev.ST Phys. Educ. Res. , , 010101, (2006).[36] L. Ding, R. Chabay, B. Sherwood, and R. Beichner, Evaluatingan electricity and magnetism assessment tool: Brief electricityand magnetism assessment, Phys. Rev. ST Phys. Educ. Res. , Phys. Rev. PER , , 010123 (2016).[39] https://journals.aps.org/search/results?sort=recent&date=custom&per_page=20&journal=prstper&journal=prper&start_date=2012-01-01&end_date=2017-12-31[40] https://github.com/learningmachineslab/publication_supplementals/blob/master/article%20list.xlsx Appendix A: Focus group protocol

Facilitator introduction: Thank you for joining us for thisfocus group. As we had described in the email invitation, weare interested in your observations and opinions of statisticaluse in PER. You all were invited because you have substantialexperience in this area.The focus group will be recorded and transcribed. While wewill de-identify the data of your names and institutions, wemay use some quotes in the manuscript we will produce.If you are okay with being recorded, please say yes.[Pause for the focus group to say yes]. If you did not say yes,we ask that you leave the call. Pause for anyone who wishesto leave. I will now turn on the recorder.

1. Please state your name, the institution where you work,and roughly how long you have been participating inPER.2. When you think about statistical and quantitative workin PER in published work, what comes to mind?(a) What kind of mistakes do you often see beingmade?(b) What kind of issues do you often see that makeit challenging to determine whether the methodswere sound?(c) Which paper(s) do you feel are good examples ofquantitative work? What particular aspects makethem good examples?3. We have a list of potential and/or common quantitativeand statistical mistakes and misuses from the literature.

Shares screen with focus group participants . Have youoften seen these kind of errors before in PER work?4. Based on your comments, the most prevalent errors de-scribed in the group are reads off list off errors . Didwe accurately summarize the discussion? If not, whatchanges should we make?a)

If there are a lot of errors

Which mistakes andissues do you feel are the most important for usto focus on? Why?5. What resources would be useful for the PER commu-nity so that they make better use of quantitative meth-ods and statistical analysis?(a)

If the resource already exists

Would you pleaseprovide the URL/title/etc. of this resource?(b)

If the resource does not exist

Is there an examplein another ﬁeld that is similar to what you have inmind?6. Is there anything else you feel is important for us toknow?

Facilitator closes focus group, thanks participants, and shutsoff recorder.

Appendix B: Interview protocol

1. Would please tell me roughly how long you have beenparticipating in PER and a bit about your researchbackground?

Keep this part brief

2. Recently, we ran a focus group to ﬁnd out perspectiveson quantitative methods in PER. Participants includedindividuals who were recommended by editorial stafffor PR- PER and AJP. We were interested in whatthey felt were the important issues in quantitativeand statistical work in PER in terms of mistakes,ambiguities, and misuses. Based on their feedback, wecreated the project plan we had sent to you. We wouldlike you to comment on it.Speciﬁcally, we are interested in:(a) Your feedback on this topic and whether weshould be focused on different topics.(b) Whether the work plan answers important andrelevant research questions to quantitative re-searchers.3.

If interviewee does not think the work plans covers agood topic but has no suggestions of their own

We pro-vided the focus group with an a priori list that containeda list of technical issues in quantitative research, basedon issues identiﬁed in other social sciences. Which ofthese issues do you think are most important for us tocover in this project? Can you give me some speciﬁcexamples or reasons why you think these are the mostimportant?4.

If interviewee does not think the work plans covers agood topic AND has their own suggestions

Can yougive me some speciﬁc examples or reasons why youthink these are the most important? 5. Additionally, we are interested in collecting resourcesfor quantitative researchers in PER. What resources(such as publications, programs, repositories, or otherresources) would be useful for the PER community sothat they make better use of quantitative methods andstatistical analysis?(a)

If the resource already exists

Would you pleaseprovide the URL/title/etc. of this resource?(b)

If the resource does not exist

Is there an examplein another ﬁeld that is similar to what you have inmind?(c) Which paper(s) do you feel are good examples ofquantitative work? What particular aspects makethem good examples?6. Is there anything else you feel is important for us toknow?

Appendix C: Project plan for interviewees (This was a document given to each interviewee to com-ment on)

1. Phase 1

We held a focus group of recommended quantitative re-searchers who are either quite familiar with PER or re-searchers in PER. We were interested in ﬁnding out what theyfelt were the primary issues in quantitative PER works. Theynoted that much of the quantitative work in PER is focused onassessment. Members of the focus group were mostly con-cerned with the following: • The research contains little to no description of thesample (e.g., demographic information and institu-tional context). • Researchers may try to generalize to all stu-dents/institutions when they sample only included se-lective, predominantly white institutions. • Claims are not supported by the data.The focus group also emphasized that these issues, amongothers, are improving. They pointed out that PER is a youngﬁeld and believe that some of these issues are resolving them-selves.

2. Phase 2

As we had described in our Phys. Rev.- PER manuscriptproposal, we plan on looking through peer-reviewed articlesto determine how pervasive the issues noted by the focusgroup are.ased on their comments, we are considering focusing onthe following student assessments:1. FCI2. FMCE3. CLASS4. MPEX5. ECLASS6. CSEM7. BEMABecause the focus group emphasized that quantitative workis improving, we will test this hypothesis in the context ofthese three areas (sample description, limitation, and data-supported claims). • H1: Recent articles in Phys. Rev.- PER tend to includesample description, include limitations on the study,and make data-supported claims. – H1a: Articles written during the beginning ofPER did not tend to include sample description,limitations on the study, and make data-supportedclaims. We will examine articles for the most recent 5 years ofPhys. Rev.- PER (2012-2017). For older articles, we decidedto use the creation of the Force Concept Inventory (FCI) asour starting point. The FCI was created in 1992. We will usea ﬁve- year span (1992-1997). Because Phys. Rev.- PER didnot exist then, we will American Journal of Physics (AJP) andThe Physics Teacher (TPT). Early volumes of AJP and TPThad some research papers including the FCI paper series.

3. Analysis3. Analysis