3 – Big Data Theory
We statistically acquire language by exposure to big linguistic data.
With Communication Theory, we assume this. When learners communicate meaningful messages, they acquire language. With Input Theory, we assume that when learners receive understandable linguistic messages, they acquire language. But if learners want to reach higher levels of ability, input and communication are not enough. They need big data.
Language learners face a big data problem, and this brings us to our Big Data Theory. It says that we statistically acquire language by exposure to big linguistic data. For example, we can compare native speakers and second language learners regarding big data. Taro is a typical Japanese student. By the end of high school, he has had 761 hours of English input/output in school. That’s about 100 days of English. Terry is a native speaker of English. By the age of 10, she has had 3,650 days of English. That is English input/output of about 30,000 hours. Taro received a lot of data, and that’s good. But Terry received a massive amount of data, and that’s necessary for advanced language acquisition.
We can see the big data problem in research about extensive reading (ER). Carney (2016) studied the relationship between ER and TOEIC reading scores. And he concluded that “there is no clear relationship between the amount of extensive reading done and TOEIC score growth” (Carney 2016, p. 83). But there is a big data problem with Carney’s study. Only three learners in his study read over 300,000 words.
The word count of 300,000 is important. It may be the beginning of big data reading. (See Nishizawa 2012.) What happened to the three students in Carney’s study who actually read over 300,000 words? They also increased their scores far more than the average. This reminds us that ER research needs to look at students who read more than 300,000 words. But most of all, this reminds us of an important fact. Big data is the key to big improvements in language acquisition. For example, in another Nishizawa study (2010), students who read about 3,000,000 words improved TOEIC scores as much as students who studied abroad for one year. And those who read 6,000,000 words improved more than students who studied abroad for one year. That is, 3,000,000 words can equal one year of study abroad, and 6,000,000 words can beat it.
According to Seidenberg, “Language acquisition is driven by exposure to a massive amount of data, utterances that exhibit statistical regularities at many levels” (2017, Chapter 5, Section 1, para. 8). In simple terms: big data drives language acquisition. Seidenberg is a reading expert, and he says that extensive reading is one way learners access big data. “Readers gather data about the statistical properties of texts at multiple levels: letters, letter combinations, syllables, morphemes, words, sequences of words, phrases, and on to galaxies beyond” (2017, Chapter 5, Section 4, para. 2).
So the big question is this. How do we give our learners big data? For one, schools can give students opportunities to study abroad, live with host families, and immerse themselves in the target culture. But what about those who cannot study abroad? For example, what about English learners in schools in Japan, Korea, or China? Clearly one answer is extensive reading. It’s a great way to receive big linguistic data without traveling abroad.
Regarding ER, many experts suggest the one-million-word challenge (Sakai, 2002; Sakai & Kanda, 2005; Furukawa & Ito, 2005; Nishizawa & Yoshioka, 2016). For example, if readers read for about 20 minutes per day, they can read one million words in on year. This requires a reading rate of 125 words per minute. But 20 minutes per day might be difficult for most learners. Thus, Nishizawa (2012) suggests the “long and easy” approach. Students start with really easy books in junior high. Then by the time they finish high school, they can read more than one million words. But in the end, learners cannot escape the big data theory. If we want our learners to reach advanced levels, we have to find ways for them to get big linguistic data.