Working with me

Hello. Thank you for visiting my website.

I am a teacher in the Tokyo and Kanagawa area. I am interested in how we teach English, especially how we teach listening and pronunciation. If you are interested in working with me, contact me.

Notes on Construct Validity and Measurement in Applied Linguistics

This is intended primarily as a note for myself, and is very much a work in progress, but I thought that others might benefit. Also, if anyone commented, I would benefit. With the disclaimer out of the way, I will get to the point.

Basically, we have problems

In a pre-print, Flake & Fried (2019) make the point that measurement in psychology is very difficult to do in a valid way and, even worse, check the validity because of underreporting of decision-making processes among the researchers involved. The reason this matters is that psychology and its sub-disciplines heavily influence applied linguistics/SLA.

While psychology attempts to get through its replication crisis, the main ways for it to do so seem to be pre-registered studies and greater transparency in reporting them. Flake and Fried (2019) choose to look at “Questionable Measurement Practices (QMPs)” as opposed to “Questionable Research Practices” (Banks et al., 2016; John, Loewenstein, & Prelec, 2012 in Flake & Fried, ibid)). such as HARKing (hypothesising after results known) (Kerr, 1998 in Flake & Fried, ibid) and p-hacking (manipulating data so the p-value or probability that the hypothesis is validated by the results is due to chance is made smaller) (Head et al., 2015).

They go on to differentiate as follows:

“In the presence of QMPs, all four types of validity become difficult to evaluate… Statistical conclusion validity, which QRPs have largely focused on, captures whether conclusions from a statistical analysis are correct. It is difficult to evaluate when undisclosed measurement flexibility generates multiple comparisons in a statistical test, which could be exploited to obtain a desired result (i.e., QRPs). ”

(Flake & Fried, 2019, p.6-7)

Flake and Fried (2019) state that many of the QMPs are not carried out deliberately but a major problem is the lack of transparency in decisions made in the measurement process which reduces not only replicability but also the checking of validity.

They advocate answering the questions in a checklist (Flake & Fried, 2019, p. 9) to reduce the possibility of QMPs arising.

I am quite certain that a lot of applied linguistics masters-level students and above have seen articles where there are statistics reported but it is not clear why those particular statistics were chosen. Often these are blindly followed processes of running ANOVA or ANCOVA in SPSS software. I will go out on a limb and say that these problems are ignored as being simply how things are usually done.

However, how many of us have considered our controlled variables? For example, when running studies on phonological perception, are we explicit in the ranges of volume, fundamental frequency and formant frequency? Processing for noise reduction? I know I’ve seen studies that make claims for generalizability, not just exploratory or preliminary studies that do not control these. If you are going to make these claims, I think there should be greater controls than in a study that is primarily for oneself that you are sharing because it could be informative for others. Of course declaring the decision-making process and rationale ought to be necessary in both.

There’s an awful lot of talk about how language acquisition studies in classrooms are problematic due to individual differences being confounding. One way to increase the validity and generalisability is to be explicit in the choices made regarding measurement and variable choices.


I took part in a Google Hangout hosted by Julia Strand. Some of the ideas discussed over an hour have bound to have wormed their way in and mingled with my own.


Flake, J. K., & Fried, E. I. (2019, January 17). Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them. Retrieved Jan 20th 2019 from .

Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS biology, 13(3), e1002106. doi:10.1371/journal.pbio.1002106  . Retrieved Feb 1st  2019.

Plans for 2019

Thanks for dropping by. I ended 2018 with a last gasp submission to New Sounds 2019, which links to my MRes research project on phoneme acquisition. Hopefully I get accepted but it looks a lot more scientific than anything I have been a part of so far. It will be good to get out of my comfort zone somewhat, though.

I also have a full-time job to start in April, which I am looking forward to very much. I will be teaching first-year university students so I am looking at articles on the transition between high school and university. I also want my students to make the most of the new self-access centre thst will open at the university so I am also looking at self-access and autonomous learning, too. I am particularly interested in learners’ autonomous L2 listening, so hopefully I shall gain some more useful insights into this.

Further on into the year, I should be collecting data over a 13-week period. This should conclude the the bulk of the pre-writing of my MRes dissertation.

Other than that, I do not know other classes that I will teach as a part-time instructor at my part-time job but I foresee making at least one corpus and doing some more work on essay writing and managing learner expectations and enabling them to assess their own abilities more accurately.

New pre-print about corpus-informed teaching

I put up a new pre-print on SocArxiv:

Creating a small corpus to inform materials design in an ongoing English for Specialist Purposes (ESP) course for Orthodontists and Orthodontic Assistants

In my work as a language teacher to a group of orthodontists and orthodontic treatment assistants, I wanted an analysis of orthodontic practitioner-to-patient discourse. Because access to authentic spoken discourse was too difficult to attain due to ethical considerations, a small corpus was constructed in order to facilitate better informed form-focused instruction. Details of the typical forms found in the corpus are given, as is an overview of the corpus construction.

Rating learners’ pronunciation: how should it be done?

This goes into a bit more detail about phonetics than some people familiar with me might be comfortable with.

On Friday I went to Tokyo JALT’s monthly meeting (no link because I can’t find a permalink) to see three presentations on pronunciation (or more accurately, phonology, seeing as Alastair Graham-Marr covered both productive and receptive, listening skills). All three presenters, Kenichi Ohyama, Yukie Saito and Alastair Graham-Marr were interesting but there was one particular point that stuck with me from Yukie Saito’s presentation.

She was talking about rating pronunciation and how it had often been carried out by ‘native speaker’ raters. She also said that it was often carried out according to rater intuition on Likert scales of either ‘fluency’ (usually operating as speed of speech), ‘intelligibility’ (usually meaning phonemic conformity to a target community norm) or ‘comprehensibility’ (how easily raters understand speakers).

What else could work is something that needs to be answered, not only to make work done in applied linguistics more rigorous but to make assessment of pronunciation less arbitrary. I have an idea. Audio corpora could be gathered of speakers in target communities, phonemes run through Praat, and typical acceptable ranges for formant frequencies taken. Learners should then be rated according to comprehensibility by proficient speakers, ideally from the target community, as well as run through Praat to check that phonemes correspond to the acceptable ranges for formants. This data would all then be triangulated and a value assigned based on both.

Now, I fully acknowledge that there are some major drawbacks to this. Gathering an audio corpus is massive pain. Running it all through Praat and gathering the data even more so. To then do the same with learners for assessment makes things yet more taxing. However, is it really better to rely on rater hunches and hope that every rater generally agrees? I don’t think so and the reason is, there is no construct that makes any of this any less arbitrary, especially if assessment is done quickly. With the Praat data, there is at least some quantifiable data to show whether, for example, a learner-produced /l/ conforms to that typically produced in the target community and it would be triangulated with the rater data. It would also go some way to making the sometimes baffling assessment methodologies a bit more transparent, at least to other researchers.

New working paper

I put up a new working paper on SocArxiv today.

Jones, M. (2018, October 10). Exploring Difficulties Faced in Teaching Elective English Listening Courses at Japanese Universities.

In this paper, an exploration of the problems encountered in teaching two elective English listening courses at Japanese universities in 2017 and 2018. Intended as a working paper with an intended audience of teaching professionals and those who support them, problems in working memory, motivation and general listening pedagogy are detailed.

Corpus Linguistics: Searching for affixes with R

One of my interests is corpus linguistics and creating corpora. However, I want to get better at analyzing my corpora more deeply.

As a project to help me learn the software/language R, I made a corpus analysis tool that gets the first 5 and last 5 characters of each word in a corpus, counts their occurrences and outputs the results in CSV files.

You’ll need to download R if you don’t have it.

The code I wrote is here.

Current professional development goals

With the start of my MRes at University of Portsmouth, one of my main goals is to improve my data handling and data analysis skills. I have very rusty and rather limited skills in using Python, which I used to build and clean a corpus for English for Specific Purposes with the open source tools from Masaryk University NLP Centre & Lexical Computing (n. d).

Continue reading →