A Corpus-Based Study on Japanese BBS Messages: Micro-Macro Connections from Morphemes to Discourse

Yukiko Nishimura

Faculty of Humanities, Toyo Gakuen University, Japan


Linguistic aspects of computer-mediated communication (CMC) in English have been compared with speech and writing (Yates 1996). Earlier studies on Japanese CMC found that users employ informal conversational styles with creative orthography (Nishimura 2003, 2007). This paper reports a quantitative study of variations within CMC and against speech and writing with background theory of language functions by Halliday (1978).

This comparison adopts the methodology employed by Yates’s (1996), with two major modifications for Japanese. Firstly though publicly available large-scale spoken and written corpora such as the London-Lund corpus were used in Yates’ study, due to the absence of equivalent corpora in Japanese at present, this study creates smaller corpora of written and spoken Japanese, in addition to CMC corpus. Secondly while the word is the basic unit of quantitative analysis in Yates’ study, the morpheme takes this role in this study due to the agglutinative nature of Japanese, enabled by ChaSen software, a morphological parser for Japanese.

The CMC corpus consists of messages from two major bulletin board system (BBS) websites discussing popular films and language studies; the written corpus was created after scanning magazine articles on similar topics; the spoken corpus is from transcriptions of casual conversation among friends on everyday topics. The comparison is two-fold: first, differences among CMC, speech and writing are examined, and second, within CMC, messages in the two BBS websites are contrasted with each other against features of speech and writing. After identifying the parts-of-speech distribution, this study specifically examines particles and auxiliary verbs, to clarify how they relate to the interpersonal and ideational functions (Halliday).

Preliminary analysis of sentence final and case particles reveals that the language in CMC is in an intermediate position in a continuum from speech to writing, and can be described as “spoken-oriented, edited written”. Within the two BBS websites, the study finds one site shows more features of speech and the other of writing. While corpus linguistics in Japanese lags behind compared with accumulated research on the English language, this study can show how corpus work can be conducted on the microscopic morpheme level in Japanese, even though the data sets are limited in quantity. It is expected that this work can contribute to (1) still limited CMC studies in Japanese, (2) corpus-based methodology for sociolinguistic research in Japanese, and (3) studies on variation in CMC context


