Reflection on my MRes Studies

There have been a lot of challenges since beginning my MRes course at University of Portsmouth, even bearing in mind the advice given to me that I should make as many contingency plans as possible. However, what has been most difficult has been planning to overcome myself in the research process. In this blog post I shall outline the natures of challenges faced and overcome. It is not the case that this is some kind of quest, merely that, given the circumstances I vastly overestimated my own abilities to carry out the kind of study that I wished to undertake. What has finally coalesced is, I believe, worthwhile research but not quite the project that I had planned. Below, I outline my learning during the MRes course so far with reference to the Vitae Researcher Development Framework (RDF) (Careers Research and Advisory Centre Ltd., 2011) in bold parentheses.

The pond at Shinjuku Imperial Gardens, Tokyo in Spring 2019. Cherry blossoms are reflected in the pond.

My original proposal was for a quantitative study that relied upon an overly optimistic sample size of volunteer participants. This sample was drawn from a population at my new place of work. Because I was a new instructor in an intensive English programme, I had few free teaching periods available when my students did. Furthermore, I had not Continue reading →

Reflections by a hermit on collaborative writing

I am fairly asocial. Not antisocial, but I tended, even before the current pandemic, not to go out very much. I geek out about SLA and teaching alone for the most part, but I do have chats with colleagues at work from time to time, and on Twitter.

I decided to work on a duoethnography with my colleague and former housemate Jon Steven last year as a way to try something a bit different to all of the quantitative work that I was reading about and basically immersing myself in as part of my MRes. I also noticed that part of the Vitae criteria that researchers are supposed to work toward (and as somebody looking toward getting a doctorate in the future, this is me), and Jon wanted to work on more publications. This seemed like a really good opportunity.

It really was, but it was also tough. Jon and I both worked as full-time part-time/freelance teachers, me in universities and at an orthodontic clinic teaching ESP, him at high schools, companies and an international supplementary school. I changed jobs in the middle of our big bulk of writing it, so finding time to write was difficult. We are both parents, so after work and being present with our families, finding time to check citations and page numbers was dependent upon how tired we were and how long we could procrastinate them.

A key quote from Jon: “Are we still doing that? I thought it had been abandoned!”

Oh yes, it took me best part of nine months to get down to checking two citations and making changes to a paragraph I highlighted “Drastically rewrite or cut”. When I was ‘in the zone’ it felt frustrating that Jon wasn’t, though I am also sure the same is true of how Jon felt when I should have been redrafting, filling out or pruning text.

In the end, though, we ended up with a duoethnography that is, according to our review, “rooted in the literature” and that we are both proud of, despite it being quite tiring to write at times. Despite being a bit of a challenge, it was quite fun to share our opinions and beliefs and explore them further in writing, boiling them down and distilling them, then egging each other on to explain ourselves further.

As with autoethnography being a bit ‘mesearch’, duoethnography can be a bit ‘wesearch’; it’s pretty much the whole point. It could have got fairly navel-gazing if it was only about us, but I think we did a decent job within our word count of situating us within a context and talking about how others might have similar feelings, anxieties and experiences as us. That, to me, is the value of autoethnographic research methods to the literature. While I am quite keen on quantitative studies, having qualitative studies to explain the human, emotional side of what happens to us in language teaching is also important, and also appears to be becoming a bigger part of my projects outside my MRes.

Given the right opportunity and circumstances, I would definitely write a duoethnography again. It was immensely rewarding to write it, and once I got myself sat down and prepared, even fun to revise it. I just wouldn’t ever consider writing one when in the midst of learning the ropes in a new job.


Notes on Construct Validity and Measurement in Applied Linguistics

This is intended primarily as a note for myself, and is very much a work in progress, but I thought that others might benefit. Also, if anyone commented, I would benefit. With the disclaimer out of the way, I will get to the point.

Basically, we have problems

In a pre-print, Flake & Fried (2019) make the point that measurement in psychology is very difficult to do in a valid way and, even worse, check the validity because of underreporting of decision-making processes among the researchers involved. The reason this matters is that psychology and its sub-disciplines heavily influence applied linguistics/SLA.

While psychology attempts to get through its replication crisis, the main ways for it to do so seem to be pre-registered studies and greater transparency in reporting them. Flake and Fried (2019) choose to look at “Questionable Measurement Practices (QMPs)” as opposed to “Questionable Research Practices” (Banks et al., 2016; John, Loewenstein, & Prelec, 2012 in Flake & Fried, ibid)). such as HARKing (hypothesising after results known) (Kerr, 1998 in Flake & Fried, ibid) and p-hacking (manipulating data so the p-value or probability that the hypothesis is validated by the results is due to chance is made smaller) (Head et al., 2015).

They go on to differentiate as follows:

“In the presence of QMPs, all four types of validity become difficult to evaluate… Statistical conclusion validity, which QRPs have largely focused on, captures whether conclusions from a statistical analysis are correct. It is difficult to evaluate when undisclosed measurement flexibility generates multiple comparisons in a statistical test, which could be exploited to obtain a desired result (i.e., QRPs). ”

(Flake & Fried, 2019, p.6-7)

Flake and Fried (2019) state that many of the QMPs are not carried out deliberately but a major problem is the lack of transparency in decisions made in the measurement process which reduces not only replicability but also the checking of validity.

They advocate answering the questions in a checklist (Flake & Fried, 2019, p. 9) to reduce the possibility of QMPs arising.

I am quite certain that a lot of applied linguistics masters-level students and above have seen articles where there are statistics reported but it is not clear why those particular statistics were chosen. Often these are blindly followed processes of running ANOVA or ANCOVA in SPSS software. I will go out on a limb and say that these problems are ignored as being simply how things are usually done.

However, how many of us have considered our controlled variables? For example, when running studies on phonological perception, are we explicit in the ranges of volume, fundamental frequency and formant frequency? Processing for noise reduction? I know I’ve seen studies that make claims for generalizability, not just exploratory or preliminary studies that do not control these. If you are going to make these claims, I think there should be greater controls than in a study that is primarily for oneself that you are sharing because it could be informative for others. Of course declaring the decision-making process and rationale ought to be necessary in both.

There’s an awful lot of talk about how language acquisition studies in classrooms are problematic due to individual differences being confounding. One way to increase the validity and generalisability is to be explicit in the choices made regarding measurement and variable choices.


I took part in a Google Hangout hosted by Julia Strand. Some of the ideas discussed over an hour have bound to have wormed their way in and mingled with my own.


Flake, J. K., & Fried, E. I. (2019, January 17). Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them. Retrieved Jan 20th 2019 from .

Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS biology, 13(3), e1002106. doi:10.1371/journal.pbio.1002106  . Retrieved Feb 1st  2019.

Rating learners’ pronunciation: how should it be done?

This goes into a bit more detail about phonetics than some people familiar with me might be comfortable with.

On Friday I went to Tokyo JALT’s monthly meeting (no link because I can’t find a permalink) to see three presentations on pronunciation (or more accurately, phonology, seeing as Alastair Graham-Marr covered both productive and receptive, listening skills). All three presenters, Kenichi Ohyama, Yukie Saito and Alastair Graham-Marr were interesting but there was one particular point that stuck with me from Yukie Saito’s presentation.

She was talking about rating pronunciation and how it had often been carried out by ‘native speaker’ raters. She also said that it was often carried out according to rater intuition on Likert scales of either ‘fluency’ (usually operating as speed of speech), ‘intelligibility’ (usually meaning phonemic conformity to a target community norm) or ‘comprehensibility’ (how easily raters understand speakers).

What else could work is something that needs to be answered, not only to make work done in applied linguistics more rigorous but to make assessment of pronunciation less arbitrary. I have an idea. Audio corpora could be gathered of speakers in target communities, phonemes run through Praat, and typical acceptable ranges for formant frequencies taken. Learners should then be rated according to comprehensibility by proficient speakers, ideally from the target community, as well as run through Praat to check that phonemes correspond to the acceptable ranges for formants. This data would all then be triangulated and a value assigned based on both.

Now, I fully acknowledge that there are some major drawbacks to this. Gathering an audio corpus is massive pain. Running it all through Praat and gathering the data even more so. To then do the same with learners for assessment makes things yet more taxing. However, is it really better to rely on rater hunches and hope that every rater generally agrees? I don’t think so and the reason is, there is no construct that makes any of this any less arbitrary, especially if assessment is done quickly. With the Praat data, there is at least some quantifiable data to show whether, for example, a learner-produced /l/ conforms to that typically produced in the target community and it would be triangulated with the rater data. It would also go some way to making the sometimes baffling assessment methodologies a bit more transparent, at least to other researchers.