How Scandinavian languages got two genders

Most Scandinavian languages, including Swedish and Danish, distinguish two genders. Nouns are defined as either common or neuter, which is marked by indefinite pronouns en cykel ‘a bike’ but ett träd ‘a tree’. This is something that speakers have to learn for each noun in the language. Scandinavian languages once used to have even more genders, with a feminine, masculine, and neuter. This was the case in Old Norse, the ancestral language of Scandinavian languages, spoken about thousand years ago in large parts of Scandinavia. In fact, some dialectal varieties of Scandinavian have kept the system of three genders, and in a new paper we study how these languages are gradually drifting towards a system of two genders. Over time, the feminine becomes weaker. Feminine endings are lost and fewer and fewer words are feminine. Also, more infrequent words are unstable in their gender, and new loans become masculine. The process can be traced already in Old Norse, indicating that the decay and eventual loss of the feminine can be predicted already before it is taking place.

The study is published in the latest issue of Journal of Germanic Linguistics

Why a nose is a nose

In most of the world’s languages, a number of basic words have similar sound structure. A word for ‘nose’ typically has a nasal sound, an /n/ or /m/, words for ‘mother’ an /m/ or /n/, and various bone words, such as ‘knee’ a /k/. It is a mystery how these connections emerge and why they are maintained as languages evovolve over generations of speakers. Are we born to pronounce words in a specific way? Or does every new generation of speakers reinvent similar-sounding words for ‘mother’,  ‘father’, ‘knee’, ‘blow’, and so forth? A new study form in Lund (in collaboration with Tübingen) finds that these sound-meaning mappings are more stable than average as words evolve over time. This tendency is strongest for those sounds which are acquired earlier when a child learns a language. Our results indicate that across languages, new generations uphold these sound-symbolic associations and therefore keep pronouncing basic concepts similarly.

The study is published in Philosophical Transactions of the Royal Society B and can be accessed at https://doi.org/10.1098/rstb.2020.0190.

A previous study by the Lund group, identifying basic concepts that have similar sound structure in all of the world's languages was published by Linguistic Typology 2020 and can be accessed at https://doi.org/10.1515/lingty-2020-2034.

Photo: istockphoto.

Deixis refers to pointing by using language. Deixis seems to be universal – all languages have a system for denoting at least two dimensions of deixis: ‘here’ and ‘there’. Deixis is marked either by deictic markers without person reference ‘here’, ‘there’, or deictic markers with person reference, ‘s/he /that here’, ‘s/he /that there’. Almost without exception, deictic words are accompanied by gestures.
Deictic systems are very interesting – their purpose is clearly communicative and they are deeply rooted in our cognitive system. Think of a hunting situation: a speaker wants to communicate to a companion that a game animal is hiding among the bushes. Or that a dangerous snake has been seen among the rocks.

-Where? asks the second speaker.
-Over there! answers the first speaker, pointing in the direction of the presumed hiding animal.
-Where over there? Did you really see it yourself?
-No, I am not sure … I thought I saw something...

In situations such as these, languages have found out different effective ways to standardize the communication, often by means of intricate and complex systems of deixis. But even if the preconditions for deixis is imprinted in our brains, the ways in which systems come out is highly diverse and pronounced cultural.

Deictic systems – at least the ones we are used to – typically distinguish two or three dimensions of deixis): ‘here’, ‘there’ and ‘over there’. In language, these dimensions are also mirrored in the sound structure of their words – a phenomenon that seems to be almost universal. Forms for ‘here’ are expressed by sounds that have higher frequency, e.g., vowels i, e or consonants such as s, t. In contrast, forms for ‘there’ are expressed by sound with lower frequency, such as the vowels a, o, u, and consonants m,b. This has to do with our apprehension of our surrounding environment: we associate closeness with familiarity, safety, smallness and higher voice or pitch, whereas we associate distance with unfamiliarity, threat, large size, and a lower voice or pitch. To fully understand this phenomenon, think about the sound of a cat versus the growl of a tiger. Which one do we want to have closer? This opposition between high and low frequency in here- and there-forms is stable in languages. If is becomes distorted by change in the sound structure, the opposition becomes restored within generations.

Besides forms for the basic deictic distinctions, some languages have expanded their deictic systems in various directions, introducing a large amount of additional information.
A system such as this is found in Kamaiurá, a Tupí-Guaraní language spoken in Upper Xingu, Brazil. Kamaiurá is a prototypical Amazonian language: they have mother-in-law language, evidentiality (linguistic ‘truth-marking’), male versus female speech, and nominal tense. In the system of deictic terms, there are four basic dimensions of deixis, ‘s/he /that here’ (close to speaker), ‘s/he /that there’ (close to listener), or ‘s/he /that over there’ (away from both speaker and listener), and ‘s/he /that over there’ (far away from both speaker and listener). Besides these four basic dimensions, there is a large set of forms, in total around 20. In normal speech, such as when someone tells a story or reports an event, these deictic forms are highly important: they communicate a number of dimensions of an event: the time, the place, the role of the speaker, what may come next, or what the speaker or the participants know or don’t know, as well as modalities, feelings, and so forth.
One deictic form denotes ‘s/he/ that, close to speaker but invisible’ – a form used for instance about someone talking inside a house, who is heard through the door. Another form is used to mark that the referent is more or less close, heard but not seen, and again, another form marks that the referent is over there, neither heard nor seen, and the speaker is uncertain about its status – the source is secondary, ‘hearsay’. There is also a that form refers to ‘that guy I don’t know the name of’ or ‘the guy I don’t remember’, and one form notes that someone is close but not visible: this is used for instance when talking about an absent son. Further, forms may denote that someone is moving away or is located close to something else of importance. The system is impossible to learn for an outsider: since the use of the forms are consolidated in each and every situation the language is used, only native speakers can learn to master the system in full.
 
References: (Carling et al. 2017; Diessel 2011; Johansson and Carling 2015)
 
Coming up next: The Tocharians, the mysterious people who travelled more than 4000 km and ended up in a desert
 
Carling, Gerd, et al. (2017), 'Deixis in narrative: a study of Kamaiurá, a Tupí- Guaraní language of Upper Xingu, Brazil', Revista Brasilieira de Linguística Antropológica, 9 (1), 13-48.
Diessel, Holger (2011), 'Deixis Demonstratives', in Claudia Maienborn, Klaus von Heusinger, and Paul Portner (eds.), Semantics: An International Handbook of Natural Language Meaning (Handbooks of Linguistics and Communication Science (Handbooks of Linguistics and Communication Science): 33 (1-3); Berlin, Germany: de Gruyter Mouton), 2407-32.
Johansson, Niklas and Carling, Gerd (2015), 'The de-iconization and rebuilding of iconicity in spatial deixis', Acta Linguistica Hafniensia: International Journal of Linguistics, 47 (1), 4.
 


The Takla Makan desert in Western China is in the middle of nowhere. Being there feels more like having landed on a deserted Tatooine than on earth; most villages are very scarcely populated and sand rocks, red desert sand, and dried salt rivers outdo the surroundings. The climate is horrible: winters are freezing, summers extremely hot and dry; springs and autumns are endurable, but temperatures between day and night often differs by 30̊ C. In a village called Subashi I met a villager, who had used 20 years to dig a well (by hand, I assume, considering the many years he had spent on the project). The well was obviously very deep, but it contained no water.
Nevertheless, French and German expeditions 100 years ago found the remnants of an Indo-European language in the sand-filled grottoes of this desert. The language, which was wrongly labelled ‘Tocharian’, after an Iranian tribe mentioned by the ancient Greeks, turned out to represent a branch of its own on the large Indo-European tree. In recent years, research has revealed new and interesting knowledge of this mysterious people, how they lived, where they came from, and what their language looked like.
During the first millennium ACE, the Tocharian civilization flourished along the Silk Road. By that time, Tocharian had split into two languages, which for the sake of simplicity are labelled Tocharian A and Tocharian B. The Tocharian culture was in important aspects not very different from other early Eastern medieval civilizations: they possessed a warrior class, a nobility, royals, farmers, and a religious class of monks, which lived from welfare in the form of alms by the working population. The Tocharians were Buddhists and learned to write by Buddhist missionaries from India, and the system they used to write their language was an adaptation of the Indic Brahmi script. Accordingly, most texts, which date between 300-1100 ACE, are of Buddhist content. A large part of the literary sources represent Tocharian adaptations of the Indian Buddhist canon – parallels in Sanskrit cannot always be found. After the Islamic conquest of Central Asia and the closing of the Silk Road, the Tocharian kingdoms collapsed, the Tocharian language died out, the area was depopulated, and the desert sand quickly buried all traces of the Tocharian people and their language.   
Even though out texts in Tocharian are of a relatively late date, at least compared to the ancient civilizations of the Mediterranean or the Fertile Crescent, archeology, archaeogenetics and - most of all – language give us rich information about the prehistory of the Tocharians.
It is evident that the Tocharians left the Indo-European homeland very early and migrated towards the East. Even though Tocharian is a centum language and actually has more similarities with western than eastern Indo-European languages, it clearly forms its own branch on the Indo-European tree. The long absence from the Indo-European proto-language, together with a long period in isolation from other Indo-European languages, has resulted in two languages with very weird and complex structures. The languages have many case forms, like Uralic and Caucasian languages, and they have double causatives, like Turkic languages. But even though the Tocharian categories clearly show non-Indo-European impact in the typological structure, the inflectional forms themselves are all of Indo-European descent: the setup of verbs easily matches Greek or Sanskrit in its complexity and variety of forms. Most forms and categories reconstructed to Indo-European are there, but often in a reorganized structure and with changed use and meaning.
Even though most preserved texts are of Buddhist context, the language and the specific Tocharian version of Buddism shows many traces of a pre-Buddhist, pagan faith, not very different from what we assume was present in early Indo-European. We have a sun-god and a moon-god, as well as remnants of the so-called heroic myths and the concept of ‘eternal glory’, which is well represented in epic tales such as the Iliad, the Odyssey, or the Mahabharata.  
Tocharians borrowed words from the Turkic Uighur language, from Chinese, and from Sanskrit; the latter in large amounts – almost half of the Tocharian lexicon has its source in Sanskrit. Uighur also borrowed from Tocharian. However, if we move back in time, Tocharian also borrowed a substantial amount of vocabulary, often administrative terms, from Iranian. In the period between 500 BCE and onwards, Tocharian seemed basically to be a recipient language, something that indicates that Tocharian during this period was a less important regional language than, for instance, Chinese (in the East) or various Iranian languages (in the West). If we look earlier than that, we find interesting and striking language contacts of Tocharian. Early forms of Tocharian are found in Uralic languages, and very likely, a pre-form of Tocharian is responsible for the Indo-European borrowings into Early Chinese. Therefore we may assume that Tocharians had a more important cultural role in the archaic period than in the antique period, when they basically were target of language borrowing.

Archaeological track record in the Tocharian-speaking area is astonishingly rich: most famous are the well-preserved mummies, which look like Celts with their pointy hats, tattoos and red braids. Studies of their DNA indicate several origins, in the earlier layers mainly Western European haplogroups, in later layers preferably Central Asian or Eastern haplogroups. The patrilinear DNA is mainly R1a1, a haplogroup associated with the Proto-Indo-European migration out of Eastern Europe.
However, there are many enigmas that still look for a solution. One of the most complex issues is the large amount of obscure lexemes in Tocharian. Even though the core vocabulary of Tocharian is completely Indo-European, most words of the lexicon (except for the many Sanskrit borrowings, of course) have either no etymology or a very uncertain etymology. It is possible that the Tocharians borrowed words from a long-lost substrate language – but what would that be? There are few traces of significantly different cultures in the area, preceding the Tocharians. Alternatively, Tocharian picked up words from several extinct, unrelated languages of Eurasia on their way from Eastern Europe to the Takla Makan desert. Very few, reliable etymologies in Tocharian can be sourced in any of the living language families of Asia.  
 
Coming up next: Heroic, lethal, or filthy animal? The history of pig words
 
References: (Adams 2013; Carling 2005; Carling et al. 2009; Mallory and Mair 2000; Malzahn 2011-2018; Pinault 2008)
Adams, Douglas Q. (2013), Dictionary of Tocharian B. : Revised and Greatly Enlarged. (Amsterdam: Rodopi).
Carling, Gerd (2005), 'Carling, Gerd. Proto-Tocharian, Common Tocharian, and Tocharian – on the value of linguistic connections in a reconstructed language', in Karlene Jones-Bley, et al. (eds.), Proceedings of the Sixteenth Annual UCLA Indo-European Conference (Journal of Indo-European Studies - Monograph Series; Washington: Institute of Man), 47-70.
Carling, Gerd, Pinault, Georges-Jean, and Winter, Werner (2009), Dictionary and thesaurus of Tocharian A (Wiesbaden: Otto Harrassowitz).
Mallory, J. P. and Mair, Victor H. (2000), The Tarim mummies : ancient China and the mystery of the earliest peoples from the West (London: Thames & Hudson).
Malzahn, Melanie (2011-2018), CEToM - A Comprehensive Edition of Tocharian Manuscripts.
Pinault, Georges-Jean (2008), Chrestomathie tokharienne : textes et grammaire (Leuven: Peeters).

In this blog, I will try – as far as possible – to switch between lexicon and grammar. Most topics are related to ongoing research either by me or by people in our research group. I will also try to have PhDs and other researchers writing guest posts, sharing their research. Contact me if you want to contribute!

Since I began by posting a picture on the Eurasian diversity for the words for WHEEL, my first post is lexical: I will talk about terms for vehicles. Within Indo-European studies, the issue of the words for vehicle-related terms is an important issue. Generally, it is believed that the invention of the wheel as a means of transport during early Chalcolithic was, together with the domestication of the horse for traction, the innovation that spread the Indo-European family over all Eurasia. However, there are several enigmas surrounding the origin of vehicles and wheeled transports. First, archaeology does not help us very much. The early wheels, hubs, and naves were made of wood, a non-durable material. Further, the spread of the wheel was so swift that we cannot know where it appeared first. Before the wheeled transport, there were other uses of the wheel: millstones for grinding, the pottery wheel, and spindles for spinning, so the word for wheel in the Indo-European proto-language had several potential functions. More important is the entire complexity of wheel and transport-related lexemes in Indo-European and its neighbors.
For Indo-European, a set of forms for wheel and transport can be reconstructed to the proto-language. Beginning with WHEEL, we have at least 3 common terms (PIE *h₂wērg-wn̥t-ōn 'wheel, circle’, PIE *h₂urg-i- 'wheel, circle', PIE *kʷekʷlo-, *kʷel-o- 'wheel, circle' < PIE *kʷel(H)- ‘to turn‘; PIE *Hróth₂o- 'wheel, circle' < PIE *(H)reth₂- 'to run'). Besides, we have terms for HUB or NAVE, which also mean ‘navel’ (PIE * h₃enbh-, * h₃nebh- ‘navel, nave, hub’, PIE *h₃nobh-li- 'navel, nave'), a reconstructed lexeme for WAGON (PIE *weǵhno- 'wagon' < PIE *uoǵh- 'to carry, drive'), The process of creating a word for ‘wheel’ from a verb meaning ‘to roll’ is found also outside of Indo-European, such as in Caucasian languages (Proto-Kartvelian *gor- 'wheel; to roll',  Proto-Nakh *gur- 'wheel', Proto-Dagestanian *gur- 'to whirl, to roll; wheel‘ (Georgian gor-gor-a 'wheel', Chechen gur-ma 'wheel for plough’); Proto-Kartvelian *bor- 'rotation', Proto-Nakh *bor-a 'mill's wheel', Proto-Dagestanian *bor-a 'wheel‘ (Georgian borbali 'wheel', Laz bor-bol-ia 'wheel', Laz  bur-in-i ’rotation; spinning’, Beshta örræ 'wheel', Avar ber 'wheel')).
It is evident that the Indo-Europeans knew the wheel and also used wheeled transports. Whether these transports took them over large areas is questionable: the wagons were heavy, the wheels of solid wood and roads were absent. Wagons were more likely used for loading and traction, such as for pulling hay from the field to the barn. Caucasians also had a word for WAGON (PKv *sa-kʰum- 'carriage', PNWC *kwə 'carriage, cart', PD *hankʰwə- 'carriage, vehicle‘ (Megr o-kʰim-o 'carriage', Adyghe kʷə, kʰwə 'wagon', Ubykh  kʰwə 'cart', Dargwa urkʰura 'carriage', Lezg akʰur 'carriage'). Apparently, these wagons were not fit to transgress the high Caucasus Mountains and spread the languages over the open plains.
Proto-Indo-European also had several words for YOKE (e.g., PIE *yug-o- 'yoke’). YOKE is a highly stable word in Indo-European, which practically did not change its form and was not substituted in languages. If the root was substituted, new forms were derived from roots meaning ‘to bind’ (Proto-Slavic *arь̀mъ, *arьmò 'yoke, ox-yoke' < PIE *h₂er- 'join’, Proto-Celtic *wedo- ‘yoke, harness’ < PIE *wedh- 'bind'). Interestingly enough, the Caucasians use the same root for the YOKE (PKv *uɣ-el- 'yoke', PNWC *ɣəw 'yoke', PD *ur- 'yoke’ (Georgian uɣeli 'yoke', Megrelian uɣeli 'yoke', Ubukh ɣawə 'yoke', Tabarasan uɣ-in 'cart (drawn by a single ox), Udi ọq' 'yoke')). The yoke, independent whether it was put on a bull, horse, donkey or human, had a very simple and straight-forward function, which did not change over the millennia: to put a device over the neck for facilitating traction and carrying.
The vehicles words in languages are highly interesting. Words for the parts of vehicles, such as the wheel or the hub, are seldom borrowed and remain stable in most languages. The words for WAGON and AXIS change more frequently: they are more often borrowed, and they often switch or expand their meaning. Both WAGON and AXIS frequently change or colexify their meanings, in particular to meanings referring to the sky and the firmament, e.g., ‘Polar star’, ‘axis’ or ‘firmament’. This says us something about the cultural importance of the wheel and the transport: words are frequently projected to the firmament, something that has a natural cause.
References (Anthony 2007; Carling To appear (2019); Greenfield 2010; Mallory and Adams 2006; Piggott 1983)

Coming up next: the language of deixis

References
Anthony, David W. (2007), The horse, the wheel, and language : how Bronze-Age riders from the Eurasian steppes shaped the modern world (Princeton, N.J. ;: Princeton University Press).
Carling, Gerd (To appear (2019)), Mouton Atlas of Languages and Cultures. Vol. 1: Europe, Caucasus, Western and Southern Asia (Berlin - New York: Mouton de Gruyter).
Greenfield, Haskel J. (2010), 'The Secondary Products Revolution: the past, the present and the future', World Archaeology, 42 (1), 29-54.
Mallory, James P. and Adams, Douglas Q. (2006), The Oxford introduction to Proto-Indo-European and The Proto-Indo-European world (Oxford linguistics; Oxford ;: Oxford University Press).
Piggott, Stuart (1983), The earliest wheeled transport : from the Atlantic coast to the Caspian Sea ([London]: Thames and Hudson).


Density heatmaps indicating the frequency of languages as source (y) and target (x) language in loan events, by their ranking in a Language Power Index rank.

A study in PLOS ONE shows that borrowing is hierchical: borrowings are most likely to take place from a more prestigious language to a less prestigious one. In addition, borrowing is caused by increased cultural labour intensity.

Abstract
All languages borrow words from other languages. Some languages are more prone to borrowing, while others borrow less, and different domains of the vocabulary are unequally susceptible to borrowing. Languages typically borrow words when a new concept is introduced, but languages may also borrow a new word for an already existing concept. Linguists describe two causalities for borrowing: need, i.e., the internal pressure of borrowing a new term for a concept in the language, and prestige, i.e., the external pressure of borrowing a term from a more prestigious language. We investigate lexical loans in a dataset of 104 concepts in 115 Eurasian languages from 7 families occupying a coherent contact area of the Eurasian landmass, of which Indo-European languages from various periods constitute a majority. We use a cognacy-coded dataset, which identifies loan events including a source and a target language. To avoid loans for newly introduced concepts in languages, we use a list of lexical concepts that have been in use at least since the Chalcolithic (4000–3000 BCE). We observe that the rates of borrowing are highly variable among concepts, lexical domains, languages, language families, and time periods. We compare our results to those of a global sample and observe that our rates are generally lower, but that the rates between the samples are significantly correlated. To test the causality of borrowing, we use two different ranks. Firstly, to test need, we use a cultural ranking of concepts by their mobility (of nature items) or their labour intensity and “distance-from-hearth” (of culture items). Secondly, to test prestige, we use a power ranking of languages by their socio-cultural status. We conclude that the borrowability of concepts increases with increasing mobility (nature), and with increased labour intensity and “distance-from-hearth” (culture). We also conclude that language prestige is not correlated with borrowability in general (all languages borrow, independently of prestige), but prestige predicts the directionality of borrowing, from a more prestigious language to a less prestigious one. The process is not constant over time, with a larger inequality during the ancient and modern periods, but this result may depend on the status of the data (non-prestigious languages often remain unattested). In conclusion, we observe that need and prestige compete as causes of lexical borrowing.

The large Scandinavian languages, such as Swedish and Danish, have lost their three-gender system to a system of commune and neuter. However, several smaller dialects or languages, such as Jamtlandic and Elfdalian, have preserved the system of three genders. In a new study from our research group, by Briana Van Epps and me, we investigate the assignment principles of gender in Jamtlandic. The dialect indicates an instability of the feminine gender, which is visible, among others, in gender assignment of loanwords.

DOI to the paper (Nordic Journal of Linguistics (2019), 1-33):
https://doi.org/10.1017/S0332586519000209

Abstract:
AbstractIn this study, we present an analysis of gender assignment tendencies in Jamtlandic, a lan-guage variety of Sweden, using a word list of 1029 items obtained from fieldwork. Mostresearch on gender assignment in the Scandinavian languages focuses on the standard lan-guages (Steinmetz 1985; Källström 1996; Trosterud 2001, 2006) and Norwegian dialects(Enger 2011, Kvinlaug 2011, Enger & Corbett 2012). However, gender assignment prin-ciples for Swedish dialects have not previously been researched. We find generalizationsbased on semantic, morphological, and phonological principles. Some of the principlesapply more consistently than others, some‘win’in competition with other principles; amultinomial logistic regression analysis provides a statistical foundation for evaluatingthe principles. The strongest tendencies are those based on biological sex, plural inflection,derivational suffixes, and some phonological sequences. Weaker tendencies include non-core semantic tendencies and other phonological sequences. Gender assignment inmodern loanwords differs from the overall material, with a larger proportion of nounsassigned masculine gender.
 

Continuing my blogposts about gender, I will say a few words about gender stability. Over time, words often change their gender. This is well known, for instance, in Germanic languages, the words for 'sun' and 'moon' are feminine and masculine respectively (as in German die Sonne and der Mond), whereas other branches of Indo-European the situation is the reverse (Italian sole masculine 'sun' and luna feminine 'moon').
The important and interesting thing here is to investigate the reasons for gender stability or instability. Are they connected to a specific gender? Or are they connected to specific words? Or is gender stability a matter of frequency? There are still very few, if no studies that look at gender stability, using large-scale data sets.
If we consider fist the issue of gender instability in our culture data set for Indo-European, we notice that is little difference between the genders when it comes to stability in cognates. We distinguished three classes, cognates with more than 90% same gender (stable class), cognates with between 90-50% same gender (dominant class), and cognates with under 50% same gender (change class). Wee notice that all three genders masculine, feminine, and neuter have approximately the same distribution within the classes stable, dominant and changing gender (see picture below). However, the masculine is slightly overrepresented in the stable group, feminine in the dominant group and neuter in the change group, meaning that the masculine is most stable, feminine a bit less stable, and neuter must untable. However, the differences are small.
What is more interesting though, and probably also promising for future research on gender stability, is that there is a large variation in the stability of different semantic classes. Crops, metals, trees, vegetables, prodcuts, are all highly stable, drink & drugs, small cattle, and tillage, etc and highly unstable. And so forth. If there is a connection to general frequency remains to be controlled for the entire Indo-European family, but a study on gender in Scandinavian languages only (Van Epps, Carling & Sapir 2019), found a correlation between frequency and gender instability.

Van Epps Briana, Gerd Carling & Yair Sapir to appear. “Gender assignment in six North Scandinavian languages: Patterns of variation and change”, to appear in a journal.
 


Heatmap of frequency of occurrence of various semantic classes in the different categories stable (





This week's blog post will deal with a complex topic: gender assignment.
As I have described in a previous post, gender involves a classification of nominal entities in language. Gender can generally be defined as classes of nouns which are reflected in the behaviour of associated words (Corbett 1991: 1). That is, gender is indicated by agreement of various elements. Gendered languages have varying number of genders present and they vary with respect to assignment, or how individual lexical items receive a gender (Audring 2014, 2017). Some languages assign gender based on semantic principles (semantic assignment systems), in which gender reflects categories such as biological sex or animacy. Other languages have formal assignment systems, which can be divided into morphological and phonological assignment (Corbett 1991: 7-8). Thus, gender assignment may be guided by semantic qualities (e.g., male/female, level of abstractness, shape), by morphological criteria (e.g., stem formation, inflection class, derivational suffixes), or by phonological criteria (e.g. word-final vowels or consonants). Languages may use semantic factors only, or a combination of semantic and formal factors, but all gender languages have a some semantic core (Corbett 1991: 8).

When looking at gender assignment in Indo-European culture vocabulary (the 100-culture list of our database, consisting of 8,500 gender- and cognacy-coded lexical items), some interesting tendencies emerge. We cannot investigate the phonological and morphological assignment principles on the data in its current shape (words in languages have not ben coded for morphology or phonology), but many other interesting tendencies can be extracted from the data.
First, the total distribution of genders of lexical items in the data is straightforward as masculine<feminine<neuter<alternans (see below). This is also reflected in the timeline of evolution of genders (see below), where we see that the masculine dominates in the early period, but weakens during the antique period and then regains strength during the first and in particular the second millenia ACE, on behalf of the feminine and in particular the neuter.  
We code all concepts for various semantic properties listed in the literature as important for gender assignment, such as animacy, collectiveness, countability, sexus, concreteness, and form/shape. In addition, we divide gender by different concepts classes, which we conclude by patterns of colexification and semantic change in the data.
We find that animated concepts (animals in our data) are significantly associated with the masculine gender (we compile both male and female forms of animals, but the overrepresentation of masculine for the general terms is important in the data). Further, we find that collectives as well as concepts coded as materials are significantly associated with the neuter gender. Our data does not contain abstract nouns, but surprisingly, we find that sharp and sticking implements are significantly associated with the feminine gender.
These tendencies for semantic properties undelie the overrepresentation of particular genders in certain semantic classes, which can be seen in the heatmap of gender distribution in relation to different classes above. In this heatmap, which divides concepts into classes, we can observe that neuter is overrepresented for metals and materials and drink and drugs, masculine is overrepresented for all animals, feminine is overrepresented for weapons, trees and insects (honeybee). This indicates that assignment is not just caused by semantic property, it is very likely also caused by semantic class, but more research and data is required to prove this assumption.

Audring, Jenny (2014), 'Gender as a complex feature', Language Sciences, 43, 5-17.
--- (2017), 'Calibrating complexity: How complex is a gender system?', Language Sciences, 60, 53-68.
Carling, Gerd (2019), Mouton Atlas of Languages and Cultures. Vol. 1: Europe, Caucasus, Western and Southern Asia (Berlin - New York: Mouton de Gruyter).
Corbett, Greville G. (1991), Gender (Cambridge textbooks in linguistics, 99-0104661-0; Cambridge: Cambridge Univ. Press).
--- (2014), The expression of gender [Elektronisk resurs] (Berlin ;: De Gruyter Mouton).
Corbett, Greville G. and Fraser, Norman M. (2000), 'Gender assignment: a typology and a model', in Gunter Senft (ed.), Systems of Nominal Classification (Cambridge: Cambridge University Press), 293-325.
Corbett, Greville G. and Fedden, Sebastian (2016), 'Canonical Gender', Journal of Linguistics, 52 (3), 495-531.
Van Epps, Briana 2019. Sociolinguistic, comparative and historical perspectives on Scandinavian gender: With focus on Jamtlandic. PhD dissertation, Lund.
 


Distribution of the genders alternans, commune, neuter, feminine, and masculine in the dataset (lexemes of 104 concepts in 105 Indo-European languages)


Timeline of gender distribution in the lexical dataset (by Briana Van Epps).

Evolutionary reconstruction of gender in Indo-European is a highly interesting field. The subject is a perfect testbed for how well evolutionary methods generally work. The core issue is that the system that we reconstruct to Proto-Indo-European, a system with a commune/neuter distinction, which has developed into a sexus-based system (masculine/feminine/neuter) in most daughter branches, is preserved only in Anatolian (Hittite, Luwian), the oldest attested Indo-European branch. However, in Scandinavian and Dutch/Frisian, a commune/neuter system has re-emerged as a merger of a previous three-gender system. Therefore, on the surface, Anatolian and Scandinavian are similar, as we see from the MCA plot above, which indicates the synchronic similarities of Indo-European gender systems based on attested languages. However, the similarity between Scandinavian, Frisian/Dutch and Hittite/Luwian is an illusion, or - to use evolutionary terminology -  an example of homoplasy. The background and the functionality of the different systems are completely different. How can we make evolutionary methods account for this difference in the reconstruction reconstruct?
This is where we can test how well different models perform. Experiments (performed by our colleagues Chundra Cathcart, Harald Hammarström, and Marc Tang) indicate that the result of an evolutionary reconstruction are similar to the model of a comparative reconstruction (even if the the method, of course, is completely different). What we want the evolutionary reconstruciton to produce is a high probability of masculine/neuter at the root (i.e., Proto-Indo-European) and a lower probability of a feminine.
In experimenting with the data and different models, we find that the most important thing is the shape of the tree. For Indo-European, we get different results if we use a branched vs non-branched tree, if we use Indo-Anatolian vs non-Indo-Anatolian, if we use ancestry constraints vs. non-ancestry-constraints (ancestral languages are situated on the branches of trees, not 'cousins' to the living language). As for the model, we get different results depending on if we us an Markov Chain Monte Carlo model, which is basically constructing a chain that has a desired distribution as its equilibrium distribution, where one can obtain a sample of the desired distribution by recording states from the chain. A Dollo model has as its precondition that a system never returns exactly to its previous state, but it keeps trace of intermediate stages through which it passes. A Dollo model with and Indo-Anatolian tree produces a reconstruction which looks almost similar to Anatolian. However, more experimenting needs to be performed: obviously, it is necessary to have a correct tree of a family before an evolutionary reconstruction can be performed. But different models of reconstruction may be better than others, depending on how they deal with the problem of homoplasy and parallel drift.