About the Edisyn search engine

The Edisyn project is an ESF-funded project on dialect syntax (see also www.dialectsyntax.org). It runs at the Meertens Institute in Amsterdam from September 2005 until September 2010, with a partial extension till March 2012. Edisyn aims at achieving two goals. One is to establish a European network of (dialect)syntacticians that use similar standards with respect to methodology of data collection, data storage and annotation, data retrieval and cartography. The second goal is to use this network to compile an extensive list of so-called doubling phenomena from European languages/dialects and to study them as a coherent object. Cross-linguistic comparison of doubling phenomena will enable us to test or formulate new hypotheses about natural language and language variation.

One of the deliverables of the Edisyn project is the Edisyn search engine (cf. Kunst & Wesseling (2011), The Edisyn Search Engine). This online search interface enables users to query and compare various dialect corpora. By making these databases interoperable we provide linguists with the opportunity to compare dialect data of different languages. It should be stressed that in the attempt to connect the corpora we do not alter the content hereof in any way. Each database remains searchable individually (via a link to the website of the relevant corpus) at all times.

To make the different databases interoperable a tag set has been created which serves as an intermediate between the tag sets of each database. The Edisyn tag set is able to talk to any database; consequently no data or tags in a particular database need to be changed. The Edisyn tags consist of two parts: Categories and Features. These can be combined or can be searched for separately. The categories represent the word class of a word, the features indicate the specification(s) of a word class, e.g. a category can be 'Verb', and the features '1' (first person), 'sg' (singular), 'past' (past tense) the specifications of that verb (form). There is no restriction on the number of features that can be combined. This does not hold for categories; categories cannot be combined with themselves. Due to the division between categories and features, not all possible linguistic tags need to be included in the tag set individually, which would make it too large and opaque.

All major word classes are included in the category set. The set of features is comprehensive and dynamic; if a specific feature is not present in the tag set and appears to be desirable, it can easily be added. This flexibility is needed because new corpora will be added to the search engine and each corpus may have (language) specific tags, which need to be expressed by the Edisyn tag set.

The Edisyn tag set is compatible with the ISOcat standard (for more information on ISOcat see www.isocat.org).

There are various ways to search the Edisyn search engine. First, it is possible to search on the basis of strings, i.e. lexical words. Second, it is possible to search on the basis of tags. In the near future it will also be possible to search on the basis of English glosses (given that the database is question has been enriched with such glosses).

The databases that can be consulted are the following:

ASIt (Italian dialects)

The ASIt database contains dialect data of ca. 200 Northern Italian dialects, which has been gathered during interviews with dialect speakers between 1985 and 2005. The interviews and questionnaires focus on specific syntactic phenomena, such as subject clitics, object clitics, auxiliary selection, modals and modality. The responsible institution for the ASIt database is the Information Management Systems Group at the Department of Information Engineering, University of Padua, Italy.

Cordial-Sin (Portuguese dialects)

The CORDIAL-SIN is a dialect corpus of European Portuguese. The materials for this corpus were drawn from the recordings of dialect speech collected by the Research Group on Linguistic Variation at Linguistics Center of Lisbon University (CLUL), as fieldwork interviews for linguistic atlases between 1974 and 2004 in more than 200 locations in the Portuguese territory. The corpus amounts to 600,000 words, collected from 42 locations within the continental territory of Portugal and the archipels of Madeira and Azores.

EMK (Estonian dialects)

The EMK database consists of interviews held with various dialect speakers of Estonian. Most of the recordings have been made between 1960 and 1980, the oldest dates back to 1938, the latest recording was made in 1990. In total 229 recordings are included in the database. The corpus consists of ca. 1 million words (in 2008). The dialect corpus is compiled by the University of Tartu in cooperation with the Institute of the Estonian Language.

FRED (English dialects)

The Freiburg Corpus of English Dialects (FRED) was compiled by the research group 'English Dialect Syntax from a Typological Perspective', based at the English Department of the University of Freiburg. FRED is a monolingual spoken-language dialect corpus that contains full-length interviews with native speakers from England, Scotland, as well as (in its full version) Wales, the Hebrides, and the Isle of Man. The texts reflect the 'traditional' varieties of British English spoken in these areas during the second half of the 20th century. The corpus consists of sound recordings and orthographic transcripts.

NDC (Scandinavian dialects)

The NDC database contains spontaneous speech data from dialects of the North Germanic languages across all of the Nordic countries: Norway, Sweden, Denmark, The Faroe Islands and Älvdalen (recordings from Iceland and Swedish in Finland are in process). Most of the data has been gathered between 2000 and 2010 and comes from a variety of sources. The corpus is composed of spontaneous speech from interviews and conversations. The corpus has been initiated under the ScanDiaSyn research network umbrella and the Nordic Centre of Excellence NORMS. The Text Laboratory has provided the technical solutions.

SAND (Dutch dialects)

The SAND database (DynaSAND) contains data of Dutch dialects. The data has been gathered between 2000 and 2005, in 267 different locations spread across the Netherlands, Belgium and French Flanders. The main focus of the SAND concerned the left periphery of the clause, pronominal reference, negation and quantification and the right periphery of the clause. The DynaSAND is hosted by the Meertens Institute. For other application of this database see Kunst & Wesseling (2010), Dialect Corpora Taken Further: The DynaSAND corpus and its application in newer tools. In: Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation(pp. 759-767).

SDDD (Slovene dialects)

The dialect data in this database is based on an online questionnaire distributed over the internet in the period March/April 2009. The question sentences mainly focus on syntactic doubling and related phenomena. The Slovene Dialect Database contains responses of 126 informants. The data that has been collected covers as much of the Slovene language area as possible.

In the future more databases will be added to the search engine. There are no particular requirements according to which a database is included in the search engine. That is, as long as a dialect database contains useful and reliable data, preferably provided with tags, the Edisyn team will try to incorporate it into the search engine.