Why a nose is a nose

In most of the world’s languages, a number of basic words have similar sound structure. A word for ‘nose’ typically has a nasal sound, an /n/ or /m/, words for ‘mother’ an /m/ or /n/, and various bone words, such as ‘knee’ a /k/. It is a mystery how these connections emerge and why they are maintained as languages evovolve over generations of speakers. Are we born to pronounce words in a specific way? Or does every new generation of speakers reinvent similar-sounding words for ‘mother’,  ‘father’, ‘knee’, ‘blow’, and so forth? A new study form in Lund (in collaboration with Tübingen) finds that these sound-meaning mappings are more stable than average as words evolve over time. This tendency is strongest for those sounds which are acquired earlier when a child learns a language. Our results indicate that across languages, new generations uphold these sound-symbolic associations and therefore keep pronouncing basic concepts similarly.

The study is published in Philosophical Transactions of the Royal Society B and can be accessed at https://doi.org/10.1098/rstb.2020.0190.

A previous study by the Lund group, identifying basic concepts that have similar sound structure in all of the world's languages was published by Linguistic Typology 2020 and can be accessed at https://doi.org/10.1515/lingty-2020-2034.

Photo: istockphoto.

The Takla Makan desert in Western China is in the middle of nowhere. Being there feels more like having landed on a deserted Tatooine than on earth; most villages are very scarcely populated and sand rocks, red desert sand, and dried salt rivers outdo the surroundings. The climate is horrible: winters are freezing, summers extremely hot and dry; springs and autumns are endurable, but temperatures between day and night often differs by 30̊ C. In a village called Subashi I met a villager, who had used 20 years to dig a well (by hand, I assume, considering the many years he had spent on the project). The well was obviously very deep, but it contained no water.
Nevertheless, French and German expeditions 100 years ago found the remnants of an Indo-European language in the sand-filled grottoes of this desert. The language, which was wrongly labelled ‘Tocharian’, after an Iranian tribe mentioned by the ancient Greeks, turned out to represent a branch of its own on the large Indo-European tree. In recent years, research has revealed new and interesting knowledge of this mysterious people, how they lived, where they came from, and what their language looked like.
During the first millennium ACE, the Tocharian civilization flourished along the Silk Road. By that time, Tocharian had split into two languages, which for the sake of simplicity are labelled Tocharian A and Tocharian B. The Tocharian culture was in important aspects not very different from other early Eastern medieval civilizations: they possessed a warrior class, a nobility, royals, farmers, and a religious class of monks, which lived from welfare in the form of alms by the working population. The Tocharians were Buddhists and learned to write by Buddhist missionaries from India, and the system they used to write their language was an adaptation of the Indic Brahmi script. Accordingly, most texts, which date between 300-1100 ACE, are of Buddhist content. A large part of the literary sources represent Tocharian adaptations of the Indian Buddhist canon – parallels in Sanskrit cannot always be found. After the Islamic conquest of Central Asia and the closing of the Silk Road, the Tocharian kingdoms collapsed, the Tocharian language died out, the area was depopulated, and the desert sand quickly buried all traces of the Tocharian people and their language.   
Even though out texts in Tocharian are of a relatively late date, at least compared to the ancient civilizations of the Mediterranean or the Fertile Crescent, archeology, archaeogenetics and - most of all – language give us rich information about the prehistory of the Tocharians.
It is evident that the Tocharians left the Indo-European homeland very early and migrated towards the East. Even though Tocharian is a centum language and actually has more similarities with western than eastern Indo-European languages, it clearly forms its own branch on the Indo-European tree. The long absence from the Indo-European proto-language, together with a long period in isolation from other Indo-European languages, has resulted in two languages with very weird and complex structures. The languages have many case forms, like Uralic and Caucasian languages, and they have double causatives, like Turkic languages. But even though the Tocharian categories clearly show non-Indo-European impact in the typological structure, the inflectional forms themselves are all of Indo-European descent: the setup of verbs easily matches Greek or Sanskrit in its complexity and variety of forms. Most forms and categories reconstructed to Indo-European are there, but often in a reorganized structure and with changed use and meaning.
Even though most preserved texts are of Buddhist context, the language and the specific Tocharian version of Buddism shows many traces of a pre-Buddhist, pagan faith, not very different from what we assume was present in early Indo-European. We have a sun-god and a moon-god, as well as remnants of the so-called heroic myths and the concept of ‘eternal glory’, which is well represented in epic tales such as the Iliad, the Odyssey, or the Mahabharata.  
Tocharians borrowed words from the Turkic Uighur language, from Chinese, and from Sanskrit; the latter in large amounts – almost half of the Tocharian lexicon has its source in Sanskrit. Uighur also borrowed from Tocharian. However, if we move back in time, Tocharian also borrowed a substantial amount of vocabulary, often administrative terms, from Iranian. In the period between 500 BCE and onwards, Tocharian seemed basically to be a recipient language, something that indicates that Tocharian during this period was a less important regional language than, for instance, Chinese (in the East) or various Iranian languages (in the West). If we look earlier than that, we find interesting and striking language contacts of Tocharian. Early forms of Tocharian are found in Uralic languages, and very likely, a pre-form of Tocharian is responsible for the Indo-European borrowings into Early Chinese. Therefore we may assume that Tocharians had a more important cultural role in the archaic period than in the antique period, when they basically were target of language borrowing.

Archaeological track record in the Tocharian-speaking area is astonishingly rich: most famous are the well-preserved mummies, which look like Celts with their pointy hats, tattoos and red braids. Studies of their DNA indicate several origins, in the earlier layers mainly Western European haplogroups, in later layers preferably Central Asian or Eastern haplogroups. The patrilinear DNA is mainly R1a1, a haplogroup associated with the Proto-Indo-European migration out of Eastern Europe.
However, there are many enigmas that still look for a solution. One of the most complex issues is the large amount of obscure lexemes in Tocharian. Even though the core vocabulary of Tocharian is completely Indo-European, most words of the lexicon (except for the many Sanskrit borrowings, of course) have either no etymology or a very uncertain etymology. It is possible that the Tocharians borrowed words from a long-lost substrate language – but what would that be? There are few traces of significantly different cultures in the area, preceding the Tocharians. Alternatively, Tocharian picked up words from several extinct, unrelated languages of Eurasia on their way from Eastern Europe to the Takla Makan desert. Very few, reliable etymologies in Tocharian can be sourced in any of the living language families of Asia.  
Coming up next: Heroic, lethal, or filthy animal? The history of pig words
References: (Adams 2013; Carling 2005; Carling et al. 2009; Mallory and Mair 2000; Malzahn 2011-2018; Pinault 2008)
Adams, Douglas Q. (2013), Dictionary of Tocharian B. : Revised and Greatly Enlarged. (Amsterdam: Rodopi).
Carling, Gerd (2005), 'Carling, Gerd. Proto-Tocharian, Common Tocharian, and Tocharian – on the value of linguistic connections in a reconstructed language', in Karlene Jones-Bley, et al. (eds.), Proceedings of the Sixteenth Annual UCLA Indo-European Conference (Journal of Indo-European Studies - Monograph Series; Washington: Institute of Man), 47-70.
Carling, Gerd, Pinault, Georges-Jean, and Winter, Werner (2009), Dictionary and thesaurus of Tocharian A (Wiesbaden: Otto Harrassowitz).
Mallory, J. P. and Mair, Victor H. (2000), The Tarim mummies : ancient China and the mystery of the earliest peoples from the West (London: Thames & Hudson).
Malzahn, Melanie (2011-2018), CEToM - A Comprehensive Edition of Tocharian Manuscripts.
Pinault, Georges-Jean (2008), Chrestomathie tokharienne : textes et grammaire (Leuven: Peeters).

In the previous blogpost, I started a compilation of safe loans from and into Tocharian. I will continue this work in the next post. In this post, I will talk about loan directionality, since I am currently completing a paper (with several co-authors) on lexical borrowability in Eurasian languages. I want to say a few word about this project.

We have compiled and extracted all loan events in the lexical database, and tested various statistical measures on this data. Worth noticing is the directionality of loans in contrast to language power as well as the differential source languages of the families. As I have described in recent posts, our data set on lexical data compiles culture concepts, i.e., words for farming, technology, hunting, and war, which have a presumed age that go at least back to the Chalcolithic. This means that this vocabulary is not representative for the entire lexicon, only these specific domains. Loans are also extended over long periods, at least back to antiquity. If we look at the source languages, we notice that they differ between families. In Indo-European, Latin is most frequent, followed by Middle Low German, French, Old French, Slavic, Classical Greek. In Caucasian, Turkic languages dominate, followed by Persian, Georgian, and Arabic. In Uralic, Scandinavian languages dominate, which is mainly due to the fact that our Fenno-Ugric languages dominate in our data (see pictures below).

The correlation between loan directionality and language power and populations size is also noteworthy. We define the power of languages by a quantitative rank based on several features, including literary power, economic power and population size. This we plot against the occurence as source and target language in loan events (see graph above). All languages are equally likely to be target languages, but the most powerful languages are more likely to be source languages. This is a significant correlation. The most frequent loan event is from a very powerful language to a very weak. The second most frequent language is from a medium powerful to a weak. The third most frequent loan is from a medium powerful to a medium powerful language. In scrutinizing the data, we observe that this type of loan event is almost entirely restricted to the middle ages, which is also an interesting result. Unequality between languages seems to be specific to the antique and modern periods, whereas language contact in the middle ages was more distributed between languages of equal power.

Graph illustrating the most frequent source languages in Indo-European (top), Caucasian (middle), and Uralic (bottom) families.

This week, also known as the Holy Week, is part of the holiday that in English goes by the name of Easter. Easter, which is celebrated throughout all of the Judaeo-Christian world, is one of the most important festivities of the year, marking the beginning of spring or summer and the resurrection of Christ. Like most Christian holidays, the roots of Easter go back into pagan times. In particular in Northern Europe, many of the mysterious habits of an ancient spring festival have survived until today. Children chase an unvisible easter hare, which puts candy-filled eggs in the grass. Birch twigs are compiled, taken indoors, and ornamented with painted eggs and feathers. Children also dress as witches or 'easterhubbies' (the difference is whether you wear a scarf or a hat), painting their faces with red dots, and go from door to door asking for candy. Afterwards, they are supposed to fly on their brooms to Brocken. Fires and fireworks are lit, and, most importantly, enormous quantities of egg, fish, meat, and candy are consumed.

So, which are the terms we use for this festival? Most languages have form of the Greek (via Latin) word paskha, itself borrowed from Aramaic (Hebrew Pesach), meaning 'passover'. The West Germanic terms, such as English Easter and German Ostern, go back to a Common Germanic goddess of spring, Old English Eastre, which is identical to the Indo-European goddess of dawn *h2éus-ōs (Sanskrit uṣās, Latin aurōra). Other languages have words that in various ways relate to the basically biblical rituals of Easter, including 'sacrificial animal', 'taking of the meat', 'resurrection', 'great day' or 'great night', or 'liberation'.

Just as with the Christmas words (see http://www.gerdcarling.se/i/a32842142/2018/12/), the map of meanings of Easter unveil important information about various cultural spheres, as well as exceptions in the form of islands of different usage.

With this little etymological overview I would like to wish you all a Happy Easter!

Lubotsky, Alexander. Brill Online Dictionaries: Indo-European Etymological Dictionaries Online (https://dictionaries-brillonline-com.ludwig.lub.lu.se/iedo). Accessed 2019-04-17.
Troels-Lund 1932. Dagligt liv i Norden på 1500-talet. VII Årets fester. Stockholm: Bonniers.
Andersson et al 1968. Kulturhistoriskt Lexikon för Nordisk Medeltid XIII. Malmö: Allhems förlag.

I thank Ante Petrović for assistance with compiling/checking data for the Easter map.

Wikipedia has an excellent overview of names of Easter: https://en.wikipedia.org/wiki/Names_of_Easter

During the period of the first centuries BCE, the impact of Iranian becomes important in Tocharian. This is something we know from the relatively large amounts of loanwords in Tocharian from various Iranian languages, beginning with one or several unknown Old Iranian dialects (which are not Avestan or Old Persian) and continuing with loans from various known Middle Iranian languages, such as Khotanese, Sogdian, and Bactrian. As usual with loans, the exact match of the source word is seldom found, meaning that the exact source language cannot be identified.
Iranian loans in Tocharian are interesting from the viewpoint of their semantic domains, which are indicative of the cultural impact of the Iranians on the Tocahrians in Central Asia.

A majority of the words refer to administrative concepts , e.g., titles or specific concepts of merchandise or administration, indicating that the Iranians influenced the Tocharians by imposing an administrative infrastructure. Examples are: Tocharian B waipecce 'possession', from Old lranian, Avestan xʷaēpaiθya­'own' Tocharian B waipte 'separately, apart' < Common Tocharian *wai-pätæ, borrowed probably from an adjective, Old Iranian *hwai­pati in the sense of 'independent, oneself’. Tocharian A pärko, B pärkau 'advantage, profit, interest' < Common Toch. *pärkāwV, borrowed from Old Bactrian, Bactrian φρογαοο 'profit', Old lranian *fragāwa-, Sogdian prγ'w, βry'w, Parthian frg'w 'treasure'. Tocharian A pare, B peri 'debt' < Common Tocharian *pæräī is borrowed from Old Bactrian *pāra > Bactrian paro 'debt, obligation, loan, amount, due'. Tocharian A  āpṣātrik* ‘citizen of a borough or market-town’, borrowed from Old Iranian *αβþαρο < *api-xšaθra- ‘borough, sub-district (of a city)’.  
Other words clearly refer to military concepts, such as values or terms for weapons: Tocharian B tsain 'arrow' from an Old lranian *dzaina-, Avestan zaēna- 'weapon'. Tocharian A āmāṃ B amāṃ ‘pride, arrogance’, loan from Middle Iranian, cf. Buddhist Sogdian ’’m’n ‘power’. Tocharian A āṣāṃ B aṣāṃ ‘worthy’, borrowed from Middle Iranian, cf. Khotanese āṣaṇa- ‘worthy’. Tocharian A āṣānik B āṣānike ‘venerable, worthy of respect’, loan from Middle Iranian, with same sourse as A āṣāṃ B aṣāṃ A senik ‘care, pledge, guaranteee’, from Middle Iranian *zēnik (Khot. ysīnīta, Sogd. zynyh, Kroraina Prakrit jheniya-)  
A bunch of words refer to farming and the household. Examples are: Tocharian AB ās ‘she-goat’, borrowed from Middle Iranian. Tocharian A kātak* B kattāke ‘master of the house, householder’, from Common Tocharian *kāttākǝ borrowed from Middle Iranian, cf. Khotanese ggāṭhaa, itself borrowed from Middle Indic, cf. Gāndhārī Prakrit *ghahaṭha, from Sanskrit gṛhastha-. Tocharian A miṣi B miṣṣe, miṣṣi ‘field’, borrowed from Khotanese mäṣṣa, miṣṣa ‘field for seed’.  
 A small amount of words are Buddhist terms (normally, the impact of Sanskrit is enormous on both Tocharian languages here). Examples are: Tocharian A pissaṅk ‘community of monks’, from Middle Iranian from Skt. bhikṣusaṃgha-  ‘Mönchsgemeinde, Mönchsorden’ (SWTF III:298b), cf. Khotanese bisaṃga-.  
Finally, we have a group of words referring to plants and ingrediants which are unfamiliar to the Tocharian fauna (also here, Sanskrit loans are much more common). Examples are: Tocharian A kārāś B karāśe* Via TB from Khotanese karāśśa ‘climbing plant’. Tocharian A kuñcit B kwäñcit, kuñcit, from Khotanese kuṃjsata- ‘sesame’.
In conclusion, the Iranian impact on Tocharian is mainly pre-Buddhist, referring to concepts of administration, warfare, and farming. With the change to Buddhism, the impact of Old and Middle Aryan becomes completely dominating in both Tocharian languages.
The words have been extracted from these sources:
Carling Gerd (to appear). A Dictionary and Thesaurus of Tocharian A. Complete Edition. In collaboration with Georges-Jean Pinault. Wiesbaden: Harrassowitz (610p.).
Carling, Gerd (2005). Proto-Tocharian, Common Tocharian, and Tocharian – on the value of linguistic connections in a reconstructed language. In: Jones-Bley, Karlene, Huld, Martin E., Volpe, Angela Vella,  Dexter, Miriam Robbins Proceedings of the Sixteenth Annual UCLA Indo-European Conference. Journal of Indo-European Studies. Monograph Series 50, 47-70.
These sources have many references to works by, e.g., Georges-Jean Pinault, K T Schmidt, Werner Winter, Nicholas Sims-Williams, Harold Bailey, L Isebaert, Jörundur Hilmarsson.

Of the 3672 entries of the Tocharian A dictionary (Carling and Pinault to appear), 772 lemma have been marked as “from Sanskrit”, which represents 21% of the entire vocabulary (of 1508 nouns, 338 are from Sanskrit, representing 22%). Of these 772 lemma, 39 are marked as “via Middle Indic”, which represents 5% of the words borrowed from Sanskrit. Compared to Sanskrit loans, other source languages are marginal: there are 22 words marked as “from Middle Iranian”, 5 “from Chinese”, 10 “from Uighur”, 10 “from Prakrit”, and 4 “from Pali”.
What does this imply? First, and foremost, of course, that Sanskrit, or rather Buddhist Hybrid Sanskrit, plays a fundamental role in Tocharian literature. “From Sanskrit” means that a word has been borrowed from Classical Sanskrit (Monier Williams 1899) or Buddhist Hybrid Sanskrit (Edgerton 1953, Bechert, Waldschmidt, and Bongard-Levin 1996) with no other change than an adaptation to the morphological system according to the languages’ rules for adapting loans (Krause and Thomas 1960). “From Prakrit” or “from Pali” means that the word can be traced back to a source attested in Pali or Prakrit texts, which apparently is much more unusual than the other way round.
So, what type of changes are we talking about when we define words as “via Middle Indic” instead of just “from Sanskrit”? (Note that the examples below are from Tocharian A: there are also similar patterns in Tocharian B (Carling 2005)). Let us look at a couple of examples.
Some of the words are almost identical to the Sanskrit word, with little change:  A pāruṣak (n.) ‘name of a mythical garden’, via Middle Indic from Sanskrit pāruṣyaka- ‘n. of one of the groves of trāyastriṃśa gods’ (BHSD:343b), as in Pali phārusaka- ‘name of one of Indra's groves’ (PED:478b).  A kās* ‘Kāśa, a species of grass’, via Middle Indic from Sanskrit kāśa- ‘a species of grass’ (MW:280b).  
In other lexemes, there is more far-gone phonological change, which were either taken over from the Middle Indic source word, or alternatively, they took place in Tocharian. This remains unclear. Examples are: A kurkal (n.) ‘bdellium, a medical ingredient’, via Middle Indic from Sanskrit gulgulu- ‘bdellium’ (MW:360b). A klawe (n.m.) ‘die, throw of the die’, via Middle Indic from Sanskrit glaha-, originally ‘throw of the dice’, and individually ‘die’ (MW:374b). A jar (n.m.) ‘topknot’, via Middle Indic from Sanskrit jaṭā- ‘the hair twisted together (as worn by ascetics)’ (MW:408a). A tāpātriś (n.m.) ‘name of a class of gods’, via Middle Indic from Sanskrit trāya(s)-triṃśa- ‘name of a class of gods’, cf. Pali tāvatiṃsa (BHSD:257b).  A patatam (adv.) ‘fortunate, gifted’, via Middle Indic from Sanskrit pradattam, neuter adv. from pradatta- ‘granted, bestowed, gifted’ (MW:679c). A nātäk (n.m.) ‘lord’, via Middle Indic from Sanskrit nāthaka-, derived from Sanskrit nātha- ‘protector, patron, owner, lord’ (MW:534c). This vocabulary, both in Tocharian A and B (which has a larger vocabulary), is very interesting. The lexemes were apparently not borrowed from the literary standard of Prakrit and Pali or from Buddhist Hybrid Sanskrit directly. Rather, they were borrowed from one or several local Indo-Aryan dialects, which became extinct, but which may be part of a general change in Middle Indo-Aryan leading to the dialectal diversity of Modern Indo-Aryan languages.
In addition, the boundaries between Indo-Aryan and Iranian in some of these lexemes are not sharp: the words may have been borrowed from Iranian, but since Indo-Aryan is much better attested (via Classical Sanskrit), an Indo-Aryan source becomes more likely.
A systematization of sound changes in these words would likely add knowledge to the evolution of sound changes in Middle Indo-Aryan leading to Modern Indo-Aryan. This will also help us to teas apart Iranian from Indo-Aryan borrowings in Tocharian.

Bechert, Heinz, Ernst Waldschmidt, and Grigorij Maksimovic Bongard-Levin. 1996. Sanskrit-Wörterbuch der buddhistischen Texte aus den Turfan-Funden. Beih. 6, Sanskrit-Texte aus dem buddhistischen Kanon: Neuentdeckungen und Neueditionen, 3. Folge. Göttingen: Vandenhoeck und Ruprecht.
Carling, Gerd. 2005. "Carling, Gerd. Proto-Tocharian, Common Tocharian, and Tocharian – on the value of linguistic connections in a reconstructed language." In Proceedings of the Sixteenth Annual UCLA Indo-European Conference, edited by Karlene Jones-Bley, Martin E. Huld, Angela Vella Volpe and Miriam Robbins Dexter, 47-70. Washington: Institute of Man.
Carling, Gerd, and Georges-Jean Pinault. to appear. A Dictionary and Thesaurus of Tocharian A. Wiesbaden: Harrassowitz.
Edgerton, Franklin. 1953. Buddhist hybrid Sanskrit grammar and dictionary, William Dwight Whitney linguistic series: Yale U.P.; Oxford U.P.
Krause, Wolfgang, and Werner Thomas. 1960. Tocharisches Elementarbuch. B. 1, Grammatik. Heidelberg.
Monier Williams, Monier. 1899. A Sanskrit-English Dictionary : Etymologically and philologically arranged with special Reference to Cognate Indo-European Languages. Oxford: At the Clarendon Press.

This blogpost will briefly introduce a highly interesting phenomenon in the history of Eurasian languages, namely the emergence of definiteness. Most ancient attested Indo-European languages do not have definitess marking, but the phenomenon appears relatively early on in several languages, in various forms. The emergence of the various types of definiteness markings do not seem to be areally caused, rather, most of the variants emerge through internal pressure and grammaticalization. In addition, definiteness is not restricted to the Indo-European languages but occurs also in various forms in Caucasian families, in Turkic, as well as in some Uralic languages.
There are several types of definiteness marking, which typically co-occur in languages. One type, is to have a non-bound definite article (as a special word class), as in German or English:
das Haus
def house ‘the house’

Another type is a bound definite marker, as in Scandinavian:
house-def ‘the house’

The fundamental types of definiteness are  definiteness marked on the adjective, as in Swedish:
det stor-a hus-et
DEF large-DEF house-DEF
‘the large house’

Definite marking can be obligatory, either at the end or at the beginning of a Noun Phrase, as in Bulgarian:
xubava-ta kniga
nice-DEF book
‘the nice book’

The ancient Indo-European languages lack definiteness, and this state has been preserved in a huge area of predominately Slavic and Indo-Aryan languages. The emergence of the various forms of definiteness began - apparently independently and with large variation even within branches of the families - already in ancient times, and escalates during the medieval period. A large part of the existing variation seems to be caused by parallel evolution. Still, the exact causes for the variation remain obscure.

Bauer, Brigitte. 2007. "The definite article in Indo-European. Emergence of a new grammatical category?" In Nominal Determination. Typology, context constratis, and historical emergence, edited by Elisabeth Stark, Elisabeth Leiss and Werner Abraham, 103-139. Amsterdam-Philadelphia: John Benjamins.


Variation in definiteness marking in historical Eurasian languages. Legens see map of modern languages above.

Probability levels of different types of definiteness marking in protolanguages, based on an evolutoinary test using the data of the DiACL database.

The Swedish summer vacation is approaching, and I will go to Australia, among others to attend the International Conference on Historical Linguistics in Canberra, 1-5 July. I will give two talks, one about the evolution and tendencies of gender assignment in Indo-European, and one about the evolution and change of alignment in Indo-European. After the summer intermission I will return and write more about these two topics in different posts.
However, I will try (if I have time and possibility) to make an overview of some of the interesting talks from the ICHL conference. Therefore, stay tuned! Thanks to all readers and have a nice summer!

Wordcloud of texts from the blogposts of autumn 2018.

I am taking up this blogg after a summer intermission. During the summer, I have been at International Conference of Historical Linguistics 24 in Canberra and at the 52nd Annual Meeting of Societas Linguistica Europea in Leipzig. In both places I talked about one specific topic, which have attracted my interest recently: gender evolution and gender assignment, specifically in Indo-European.
In a couple of coming blogposts, I will talk specifically about this issue. The first post will deal with the morphosyntactic reconstruction of the Indo-Europen gender system.

First of all, how do we define gender? The typical way in which this is done is to use the definition of agreement, which is visible on an agreeing article, adjective or verb. Normally, the gender system of a language is described in grammars, which is reflected in the dictionary of this language. However, this definition does not work for pronominal gender, which is more tricky. For defining pronominal gender, it is necessary to look at the occurrence of gendered forms in pronominal systems.

Gender is prototypically a property of nouns, and once the gender has been identified for all nouns in a language, an important issue is to try to define the underlying causes for gender assigment. There is plenty of research on this issue, both from a general typological perspective as well as with respect to individual languages.  According to the canonical gender literature (Corbett 1991, 2013, Corbett and Fraser 2000), there are three basic principles according to which gender is assigned in languages. These are phonological, morphological and semantic. A fundamental problem is that these rules typically compete in languages.

What is the situation in Indo-European?

Most languages have gender (masculine, feminine, neuter). No language has ”purely” phonological, morphological, or semantic assignment. Diachrony apparently plays a role: many language inherit larger or smaller parts of their gender system and gender assignment on nouns. Most languages have competing rules for assignment.

The next issue is the reconstruction of Indo-European gender. For the reconstruction of the Indo-European gender system, based on a morphological reconstruction of systems in the various branches, there are three proposed suggestions in the literature. The option suggested by Hermann Hirt in the 1930s (Hirt 1934, 1937) was that Indo-European had no gender, which then later developed into a three-gender system by means of grammaticalization. The reconstruction of Delbrück and Brugmann (Brugmann & Delbrück 1893, 1897, 1900) contained three genders, like Sanskrit, Classical Greek and Latin, which later was either preserved or collapsed into a masculine-feminine or a common-neuter system. However, Brugmann and Delbrück were uncertain about the feminine gender, basically due to the formal correspondence in the reconstructed state of the feminine and the neuter (the -h2- suffix). Based on this formal similarity between the collective/neuter and the feminine, as well as the shape of the system of Anatolian with a commune and a collective/neuter, later Indo-European scholars agree that Indo-European had a two-gender animate-inanimate system (which is reflected in the Anatolian system), which later developed into a sex-based gender system with an additional collective gender, the neuter (see Table 1) (Luraghi 1911, Matasović 2004).
Basically, the model of Hirt implies that gender evolved by grammaticalization, the Delbrück model that the three-gender system of Indo-European either remained or collapsed. However, we must remember that both these models were constructed before the discovery of Anatolian.
The mainstream model is based on an idea of a typological evolution of the gender systems, which moves from an animate - inanimate to a sexus-based system, which retains the difference between animacy in the masculine feminine and the difference between abstract and concrete in feminine-neuter (table 1).

In brief, the mainstream model supposes that there is:

Trace of the old system in languages Emergence of human~non-human distinction after the proto-language Emergence of an abstract~conctrete distinction of non-human gender after the proto-language Later mapping into a sexus-based system with retention of the concrete inanimate (neuter) Continuation of the ancient assignment principles in various languages
Table 1. The developmental phases of the Indo-European gender system according to the mainstram model (after Luraghi 2009). Stage 1 ANIMATE INANIMATE Stage 2 HUMAN ABSTRACT CONCRETE Stage 3 MASCULINE/FEMININE FEMININE NEUTER

The next issue in this process is to find out what happens if an evolutionary model is used for the reconstruction (Cathcart, Carling et al 2018, Carling 2019)? Gender reconstruction is an important question for evolutionary models, since the system reconstructed to Proto-Indo-European has been changed in most living languages (see Table 1).

I will discuss this issue in the next blogpost.

Brugmann, Karl, Delbrück, Berthold, and Delbrück, Berthold (1893), Grundriss der vergleichenden Grammatik der indogermanischen Sprachen : kurzgefasste Darstellung der Geschichte des Altindischen, Altiranischen (Avestischen u. Altpersischen), Altarmenischen, Altgriechischen, Albanesischen, Lateinischen, Oskisch-Umbrischen, Altirischen, Gotischen, Althochdeutschen, Litauischen und Altkirchenslavischen. Bd 3, Vergleichende Syntax der indogermanischen Sprachen, T. 1 (Strassburg: Trübner).
--- (1897), Grundriss der vergleichenden Grammatik der indogermanischen Sprachen : kurzgefasste Darstellung der Geschichte des Altindischen, Altiranischen (Avestischen u. Altpersischen), Altarmenischen, Altgriechischen, Albanesischen, Lateinischen, Oskisch-Umbrischen, Altirischen, Gotischen, Althochdeutschen, Litauischen und Altkirchenslavischen. Bd 4, Vergleichende Syntax der indogermanischen Sprachen, T. 2 (Strassburg: Trübner).
--- (1900), Grundriss der vergleichenden Grammatik der indogermanischen Sprachen : kurzgefasste Darstellung der Geschichte des Altindischen, Altiranischen (Avestischen u. Altpersischen), Altarmenischen, Altgriechischen, Albanesischen, Lateinischen, Oskisch-Umbrischen, Altirischen, Gotischen, Althochdeutschen, Litauischen und Altkirchenslavischen. Bd 5, Vergleichende Syntax der indogermanischen Sprachen, T. 3 (Strassburg: Trübner).
Carling, Gerd (2019), Mouton Atlas of Languages and Cultures. Vol. 1: Europe, Caucasus, Western and Southern Asia (Berlin - New York: Mouton de Gruyter).
Cathcart, Chundra, et al. (2018), 'Areal pressure in grammatical evolution.', Diachronica, 35 (1), 1-34.
Corbett, Greville G. (1991), Gender (Cambridge textbooks in linguistics, 99-0104661-0; Cambridge: Cambridge Univ. Press).
Corbett Greville, G. (2013), 'Gender typology', The Expression of Gender.
Corbett, Greville G. and Fraser, Norman M. (2000), 'Gender assignment: a typology and a model', in Gunter Senft (ed.), Systems of Nominal Classification (Cambridge: Cambridge University Press), 293-325.
Hirt, Hermann Alfred (1934), Indogermanische Grammatik. T. 6, Syntax, 1 : syntaktische Verwendung der Kasus un der Verbalformen (Heidelberg: Carl Winter).
Luraghi, Silvia (2011), 'The origin of the Proto-Indo-European gender system: Typological considerations', Folia Linguistics, 45 (2), 435-64.
Matasović, Ranko (2004), Gender in Indo-European (Heidelberg: Winter).


Prononminal gender systems in Indo-European languages.

Evolutionary reconstruction of gender in Indo-European is a highly interesting field. The subject is a perfect testbed for how well evolutionary methods generally work. The core issue is that the system that we reconstruct to Proto-Indo-European, a system with a commune/neuter distinction, which has developed into a sexus-based system (masculine/feminine/neuter) in most daughter branches, is preserved only in Anatolian (Hittite, Luwian), the oldest attested Indo-European branch. However, in Scandinavian and Dutch/Frisian, a commune/neuter system has re-emerged as a merger of a previous three-gender system. Therefore, on the surface, Anatolian and Scandinavian are similar, as we see from the MCA plot above, which indicates the synchronic similarities of Indo-European gender systems based on attested languages. However, the similarity between Scandinavian, Frisian/Dutch and Hittite/Luwian is an illusion, or - to use evolutionary terminology -  an example of homoplasy. The background and the functionality of the different systems are completely different. How can we make evolutionary methods account for this difference in the reconstruction reconstruct?
This is where we can test how well different models perform. Experiments (performed by our colleagues Chundra Cathcart, Harald Hammarström, and Marc Tang) indicate that the result of an evolutionary reconstruction are similar to the model of a comparative reconstruction (even if the the method, of course, is completely different). What we want the evolutionary reconstruciton to produce is a high probability of masculine/neuter at the root (i.e., Proto-Indo-European) and a lower probability of a feminine.
In experimenting with the data and different models, we find that the most important thing is the shape of the tree. For Indo-European, we get different results if we use a branched vs non-branched tree, if we use Indo-Anatolian vs non-Indo-Anatolian, if we use ancestry constraints vs. non-ancestry-constraints (ancestral languages are situated on the branches of trees, not 'cousins' to the living language). As for the model, we get different results depending on if we us an Markov Chain Monte Carlo model, which is basically constructing a chain that has a desired distribution as its equilibrium distribution, where one can obtain a sample of the desired distribution by recording states from the chain. A Dollo model has as its precondition that a system never returns exactly to its previous state, but it keeps trace of intermediate stages through which it passes. A Dollo model with and Indo-Anatolian tree produces a reconstruction which looks almost similar to Anatolian. However, more experimenting needs to be performed: obviously, it is necessary to have a correct tree of a family before an evolutionary reconstruction can be performed. But different models of reconstruction may be better than others, depending on how they deal with the problem of homoplasy and parallel drift.