Tuesday, May 08, 2018

Songhay viewed through PCA

Playing around a bit more with PCA, I decided to apply the method* to a dataset I've worked with more extensively: Songhay, a compact language family spoken mainly in Niger and Mali. On a hundred-word list (Swadesh with a few changes), randomly choosing one form in cases of synonymy and including borrowings, I get the following table of lexical cognate percentages:

Tabelbala Tadaksahak Tagdal In-Gall Timbuktu Djenne Kikara Hombori Zarma Djougou
Tabelbala 1 0.678 0.67 0.687 0.636 0.667 0.625 0.622 0.616 0.602
Tadaksahak 0.678 1 0.857 0.8 0.63 0.635 0.567 0.576 0.58 0.586
Tagdal 0.67 0.857 1 0.857 0.632 0.649 0.579 0.588 0.582 0.588
In-Gall 0.687 0.8 0.857 1 0.65 0.667 0.598 0.606 0.6 0.606
Timbuktu 0.636 0.63 0.632 0.65 1 0.979 0.773 0.808 0.79 0.778
Djenne 0.667 0.635 0.649 0.667 0.979 1 0.753 0.789 0.771 0.768
Kikara 0.625 0.567 0.579 0.598 0.773 0.753 1 0.835 0.814 0.823
Hombori 0.622 0.576 0.588 0.606 0.808 0.789 0.835 1 0.838 0.867
Zarma 0.616 0.58 0.582 0.6 0.79 0.771 0.814 0.838 1 0.808
Djougou 0.602 0.586 0.588 0.606 0.778 0.768 0.823 0.867 0.808 1

Running this through R again to get its eigenvectors, the first two principal components are easily interpretable:
  • PC1 (eigenvalue=7.3) separates Songhay into three low-level subgroups - Western, Eastern, and Northern, in that order - with an obvious longitude effect: it traces a line eastward all the way down the Niger river, jumps further east to In-Gall, and then proceeds back westward through the Sahara.
  • PC2 (eigenvalue=1.1) measures the level of Berber/Tuareg influence.
All the other eigenvectors have eigenvalues lower than 0.4, and are thus much less significant.

The resulting cluster patterns have a strikingly shallow time depth; as in the Arabic example in my last post, this method's results correspond well to criteria of synchronic mutual intelligibility (Western Songhay is much easier for Eastern Songhay speakers to understand than Northern is), but it completely fails to pick up on the deeper historic tie between Northern Songhay and Western Songhay (they demonstrably form a subgroup as against Eastern). It's nice how the strongest contact influence shows up as a PC, though; it would be worth exploring how good this method is at identifying contact more generally.

* Strictly speaking, this may not quite count as PCA - I'm starting from a similarity matrix generated non-numerically, rather than turning the lexical data into binary numeric data and letting that produce a similarity matrix.

Update, following Whygh's comment below: here's what SplitsTree gives based on the same table:

Monday, May 07, 2018

Some notes on PCA

(Exploratory notes, written to be readable to linguists but posted in the hope of feedback from geneticists and/or statisticians - in my previous incarnation as a mathmo, I was much more interested in pure than applied....)

Given the popularity of Principal Component Analysis (PCA) in population genetics, it's worth a historical linguist's while to have some idea of how it works and how it's applied there. This popularity might also suggest at first glance that the method has potential for historical linguistics; that possibility may be worth exploring, but it seems more promising as a tool for investigating synchronic language similarity.

Before we can do PCA, of course, we need a data set. Usually, though not always, population geneticists use SNPs - single nucleotide polymorphisms. The genome can be understood as a long "text" in a four-letter "alphabet"; a SNP is a position in that text where the letter used varies between copies of the text (ie between individuals). For each of m individuals, then, you check the value of each of a large number n of selected SNPs. That gives you an m by n data matrix of "letters". You then need to turn this from letters into numbers you can work with. As far as I understand, the way they do that (rather wasteful, but geneticists have such huge datasets they hardly care) is to pick a standard value for each SNP, and replace each letter with 1 if it's identical to that value, and 0 if it isn't. For technical convenience, they sometimes then "normalize" this: for each cell, subtract the mean value of its (SNP) row (so that the row mean ends up as 0), then rescale so that each column has the same variance.

Using this data matrix, you then create a covariance matrix by multiplying the data matrix by its own transposition, divided by the number of markers: in the resulting table, each cell gives a measure of the relationship between a pair of individuals. Assuming simple 0/1 values as described above, each cell will in fact give the proportion of SNPs for which the two individuals both have the same value as the chosen standard. Within linguistics, lexicostatistics offers fairly comparable tables; there, the equivalent of SNPs is lexical items on the Swadesh list, but rather than "same value as the standard", the criterion is "cognate to each other" (or, in less reputable cases, "vaguely similar-looking").

Now, there is typically a lot of redundancy in the data and hence in the relatedness matrix too: in either case, the value of a given cell is fairly predictable from the value of other cells. (If individuals X and Y are very similar, and X is very similar to Z, then Y will also be very similar to Z.) PCA is a tool for identifying these redundancies by finding the covariance matrix's eigenvectors: effectively, rotating the axes in such a way as to get the data points as close to the axes as possible. Each individual is a data point in a space with as many dimensions as there are SNP measurements; for us 3D creatures, that's very hard to visualise graphically! But by picking just the two or three eigenvectors with the highest eigenvalues - ie, the axes contributing most to the data - you can graphically represent the most important parts of what's going on in just a 2D or 3D plot. If two individuals cluster together in such a plot, then they share a lot of their genome - which, in human genetics, is in itself a reliable indicator of common ancestry, since mammals don't really do horizontal gene transfer. (In linguistics, the situation is rather different: sharing a lot of vocabulary is no guarantee of common ancestry unless that vocabulary is particularly basic.) You then try to interpret that fact in terms of concepts such as geographical isolation, founder events, migration, and admixture - the latter two corresponding very roughly to language contact.

The most striking thing about all this, for me as a linguist, is how much data is getting thrown away at every stage of the process. That makes sense for geneticists, given that the dataset is so much bigger and simpler than what human language offers comparativists: one massive multi-gigabyte cognate per individual, made up of a four-letter universal alphabet! Historical linguists are stuck with a basic lexicon rarely exceeding a few thousand words, none of which need be cognate across a given language pair, and an "alphabet" (read: phonology) differing drastically from language to language - alongside other clues, such as morphology, that don't have any immediately obvious genetic counterpart but again have a comparatively small information content.

Nevertheless, there is one obvious readily available class of linguistic datasets to which one could be tempted to apply PCA, or just eigenvector extraction: lexicostatistical tables. For Semitic, someone with more free time than I have could readily construct one from Militarev 2015, or extract one from the supplemental PDFs (why PDFs?) in Kitchen et al. 2009. Failing that, however, a ready-made lexicostatistical similarity matrix is available for nine Arabic dialects, in Schulte & Seckinger 1985, p. 23/62. Its eigenvectors can easily be found using R; basically, the overwhelmingly dominant PC1 (eigenvalue 8.11) measures latitude longitude, while PC2 (eigenvalue 0.19) sharply separates the sedentary Maghreb from the rest. This tells us two interesting things: within this dataset, Arabic looks overwhelmingly like a classic dialect continuum, with no sharp boundaries; and insofar as it divides up discontinuously at all, it's the sedentary Maghreb varieties that stand out as having taken their own course. The latter point shows up clearly on the graphs: plotting PC2 against PC1, or even PC3, we see a highly divergent Maghreb (and to a lesser extent Yemen) vs. a relatively homogeneous Mashriq. (One might imagine that this reflects a Berber substratum, but that is unlikely here; few if any Berber loans make it onto the 100-word Swadesh list.) All of this corresponds rather well to synchronic criteria of mutual comprehensibility, although a Swadesh list is only a very indirect measure of that. But it doesn't tell us much about historical events, beyond the null hypothesis of continuous contact in rough proportion to distance; about all you need to explain this particular dataset is a map.

(NEW: and with PC3:)

Wednesday, April 04, 2018

Songhay crows and Korandje ravens

In Niamey, where I went last week for a workshop on Songhay as a cross-border language, the crows do something I've never seen them do in any other country: they come to the window and start tapping on the glass, like something out of Edgar Allen Poe. The reaction of my fellow attendees taught me a new Songhay word - gaaru-gaaru "pied crow" (Heath 1998) - which in turn revealed a new Korandje etymology. In Korandje, "raven" is gạḍi. The shift of intervocalic *d to r in mainstream Songhay is well-established (Nicolaï 1981). But the vowels are more interesting.

Korandje usually derives from *ar or *or. In several inherited Songhay words, however, seems to derive from *a not followed by *r: thus kạṣ-əw "rough" < kas-ow, bạzu "skin bucket, waterbag" < baasu, hạmu "meat" < *hamu, kə̣kkạbu "key" < *karkabu. Yet *a otherwise usually yields a in similar contexts: contrast gani "louse" < *gani, akama "wheat" < *alkama, dzam-a "do it" < *dam-a. It looks as though the vowel in the following syllable is what makes the difference: if it's rounded, you get , otherwise you get a (though one or two exceptions suggest that the story may be more complicated: notably, "difficult" is gab-ə̣w < *gab-ow.) Assuming this rule, *gaadu should regularly have yielded gaaru in mainstream Songhay and gạḍu in Korandje.

What we actually get, however, is gạḍi. Why? Well, Korandje has a rule of final high vowel deletion phrase-internally: if a word ends in i or u, its final vowel will be deleted unless it comes before a pause, ie most of the time. (Basically the opposite of Classical Arabic.) In a number of words, this seems to have led to confusion between original -i, -u, and consonant-final words. For instance, ạṣạnkri "skink" comes from Berber asrmkal, which should regularly have yielded ạṣạmkər; the i is unetymological (Souag 2015). In effect, speakers must have been hypercorrecting final high vowels - a fact which suggests that, if Korandje survives, it may be on its way towards phonologically losing them altogether, much as Classical Arabic did with final short vowels.

Monday, March 19, 2018

English spelling traces in Algerian placenames

Going east of Algiers along the coast, the names of two little port towns stand out. Their inhabitants know them as جنّات /d͡ʒənnat/ (sometimes جنّاد /d͡ʒənnad/) and دلّس /dalləs/ (or الدّلّس /ddalləs/). Those names would normally be transcribed in French as *Djennat (if not *Djennette) and *Delless. Yet in French - and hence, given the region's colonial history, in most Western languages - they are in fact written as Djinet and Dellys; the latter at least is very often even (mis)pronounced accordingly as /dɛlis/. French i and y are both normally pronounced /i/; why on earth would Frenchmen write the schwa /ə/ of these names in this way, when French has a schwa and normally writes it as e?

The most likely answer is that they didn't. Rather, they adopted or adapted these placenames' spelling from English - specifically, from the widely translated work of Thomas Shaw, an English reverend and Oxford fellow who spent several years in Algeria in the early 1700s, a century before France occupied Algiers. He spelt the two towns' names as Jinnett and Dellys respectively - a spelling which, in English, yields the almost exactly correct pronunciations /d͡ʒɪnɛt/ and /dɛlɪs/.

Shaw's book was translated into French by 1743, and the translator retained the English spellings of both names. In a later edition no doubt prompted by the French invasion (1830), Jinnett got amended to Djinnett - someone had finally got around to noticing that English j is pronounced like French dj, not like French j. The doubled letters, useful for indicating vowel quality in English but serving no purpose in French, were lost within a decade, as seen in Eyriès (1839). But the i of Djinet, and the y of Dellys, remained to testify to a period when French geographers relied on an English traveller to tell them about Algeria - and to confirm most colonists' lack of interest in how the locals pronounced these names.

Saturday, March 17, 2018

Good speaking is not good writing

There's an article by Nathan Robinson that's been going around recently titled "Jordan Peterson: The Intellectual We Deserve". After pages of apparently reasonable criticisms of his subject, the author delivers what he seems to think is his coup de grâce:
Even now, however, I am being too generous to Jordan Peterson’s intellect. I have been presenting him at his most comprehensible and polished. I have not been giving you the full experience of actually listening to him talk. Sitting through a Jordan Peterson lecture is very different to watching a rapid-fire television interview. Below, please find a fully-transcribed portion of 17 minutes of Peterson’s speech.[...] (NOTE: UNDER NO CIRCUMSTANCES ATTEMPT TO READ THE ENTIRETY OF THE FOLLOWING PASSAGE. READ AS MUCH AS YOU CAN BEFORE YOU BEGIN TO FEEL WEARY, THEN SCROLL QUICKLY TO THE END.)
Just to stack the scales a bit further, the transcription features no paragraphing. Nevertheless, I did read it - much quicker than watching some random video for 17 minutes! -and, rather anticlimactically, found a perfectly coherent and reasonably entertaining (if very likely unfair) parenting anecdote, obviously intended to illustrate the importance of setting boundaries. I rubbed my eyes and thought "How is it that an intelligent, well-educated native speaker of English can apparently not only see this transcript as an incoherent mess but also assume all his readers will? Am I crazy, or is he?"

The answer is simple: good speaking is not the same thing as good writing. Take a great talk, one that keeps a non-academic audience riveted, and transcribe it verbatim; it will almost always look rambling and repetitive on the page, unless you're already accustomed to reading such transcripts (part of the job for a descriptive linguist, but a rare experience for most people). That's simply the nature of the medium, and adequately explains the expected audience reaction. Maybe it even explains the author's reaction, if the only context he ever encounters long talks in is academia.

One of the author's main points - a valid one, I think - is that academics need to communicate better with the public for everyone's sake:

[...] he is popular partly because academia and the left have failed spectacularly at helping make the world intelligible to ordinary people, and giving them a clear and compelling political vision.
If so, the first step is to learn appropriate discourse strategies. You don't talk to confused young people on YouTube as if you were addressing a learned seminar, much less writing a article. Nathan Robinson surely realises this himself - but, by going for cheap laughs at the expense of a perfectly ordinary example of spoken language, he's not only weakening his main point but encouraging the very blindness to orality that makes it difficult for many academics to communicate with the public. Academics can surely do better - let a thousand learned YouTube channels bloom! - but not without (re)learning how to talk to the people they want to talk to.

Monday, March 12, 2018

Qaswarah revisited: a Qur'anic hapax in Modern South Arabian

A long time ago, I posted some rather speculative musings on the minor mystery of the allegedly Ethiopic word qaswarah قسورة in the Qur'ān, usually considered to mean "lion". An anonymous commenter years later came up with a much better but still rather speculative idea:
Research substantiates that both “lion” and “hunter” are plausible according to analyses of Proto-Highland Eastern Cushitic wherein “kas” is to stab, pierce or cut and the suffix of “wara” creates “agent nouns”. In modern “Ethiopic” languages such as Tigrinya and Ge’ez (as well as in some other African languages) the word “Wagatwara” means “hunter” and in earlier etymons of this word the “g” is rendered a “q” and the “t” is rendered an “s”.

But just now, looking through a Hobyot vocabulary (Nakano 2013:215), I came across an entry that makes all this discussion unnecessary. In Hobyot, "panther" is ḳáyṣ̂ər, with a plural ḳaṣ̂áwrət - clearly related to the term used in the Qur'ān, and clearly (given the ṣ̂) not borrowed from Arabic. The meaning corresponds closely enough to most commentators' consensus on qaṣwarah, while the location - in the extreme south of Arabia - helps explain why the term might have been associated in their minds with Ethiopia. In fact, the irregular correspondence of Hobyot ṣ̂ to Arabic s would suggest a loan into Arabic, rather than common inheritance, even if we didn't know how much this word puzzled the commentators.

Incidentally, the minority interpretation "archers" is presumably based on Persian, where -var added to a noun means "possessor of" - presumably, Arabic qaus "bow" + Persian -var would yield "bowman", and the feminine suffix -ah would form the plural as so often with nouns of profession. In light of the Hobyot form, it also should be clear that the majority of commentators were right to reject this interpretation.

Thursday, February 15, 2018

"Don't impose on me a language that isn't a vehicle of science": the Salhi scandal

Two years ago, the Algerian state finally decided to make Tamazight (Berber) an official language. In practice, this has not by any means implied giving it the same status as Arabic (much less as French). It has encouraged an expansion of Tamazight teaching, which is being extended to all wilayas (provinces) rather than just the ones with large numbers of Berber speakers. But Tamazight lessons - unlike Arabic, French, or English - remain completely optional. Most parents have no desire for their children to learn Tamazight, and were regularly complaining even before the question arose that the curriculum was too packed. Nevertheless, the very idea that Tamazight might someday be a required school subject seems to have been enough to drive at least one MP - the now-notorious Naima Salhi - into a ranting fury.

I've been reluctant to post about the Naima Salhi scandal, since it's obviously being used by this nonentity as a way to inflate her public profile. But when I heard the actual words of her paranoid rant against Berber, I realized I had to. Her words, thankfully, have been overwhelmingly repudiated by her peers. But her "reasoning" is a perfect specimen of a linguistic ideology that many people all over the world subscribe to, with a few instructive twists coming from the diglossic context of Algeria. As such, it's worth a closer look. Here's what she said, translated from - dialectal - Algerian Arabic into English:

"So don't impose on me a language - it's not a language anyway - don't impose on me a language that isn't a vehicle of science; don't impose on me a language that isn't recognized, isn't understood by people outside; what good is it to me? Study science with it? It doesn't have - it isn't a vehicle of science. Study technology with it? It isn't a vehicle of technology. Go abroad with it, to speak to people abroad? They don't know it and don't understand it. For God's sake, what good is it to us?
When it comes to the Arabic language - and oh, what a language! - which is the world language, which more than a billion people speak, they say we won't study it; a language which has billions of books, and billions of manuscripts, and billions of - everything - you say you won't study it and don't need it. Then you bring me a dead language, which doesn't have letters, and doesn't have meanings, and doesn't have words - you want to hold me back with it so you can make progress - and you go off, and eventually you get to the point, and you tell me: Me, I'm studying English, and I'm studying German, and Spanish, and Turkish, and you all don't know them. You're going to hold me back with this?
My little daughter was studying in a private school where most of them were Kabyles. She naturally learned the language with them, because her classmates' parents taught them to speak Kabyle, so it would continue and spread. So my daughter, with the best of intentions, learned with them. She'd come and speak it, and I never asked her "Why?" I didn't shut her up; I left her free to do as she likes. But now that we've gotten to the point where it's obligatory, I told her: Say another word in Kabyle (Berber) and I'll kill you, I'll discipline you if you say another word.
And I'm saying it plainly and challenging everyone: When we were going by intentions / naive, we didn't say a thing; now that it's become "push me and I'll step on you", don't push me and I won't step on you. Now we're going to make it about who's stronger? And the most for the stronger one? The majority is stronger. You'd have been better off leaving it down to intentions. Now that you think you're so smart and coming out with insults against us, now I'll insult you.
People like me, and people who are real men, and those who don't accept humiliation and aren't used to it, and whose family aren't used to it, won't accept from you something like this. And I now forbid my children from pronouncing a single word in Tamazight. I mean the Frenchified Kabyle made by the MAK and the treasonous terrorist MAK movement. And we need to demand that the MAK is a terrorist movement."
Let's pass over the bizarre misconceptions and factual errors for now (it doesn't have words???), and go to the heart of the matter. It's not an unusual phenomenon anywhere to find speakers of a majority language objecting to having to learn a supposedly useless minority language - look at Swedish in Finland, or Welsh in Wales, or even Irish in Ireland. In this case, however, diglossia introduces a further twist, making her very examples undermine her ideas.

She presents Kabyle as useless for what seem like bluntly utilitarian reasons: it's only spoken by other Algerians and it won't help you study science and technology. Yet most Algerians spend most of their lives in Algeria, and most people anywhere don't study science and technology past high school. By her own testimony, Kabyle is widely enough spoken that her daughter could pick it up in a private school even in a non-Kabyle area. Had her daughter failed to do so, she would presumably have had fewer friends, and found herself excluded from routine social interactions. Yet somehow, for Salhi, that fact doesn't even register as relevant to the question of the language's usefulness. The dialectal Arabic she's speaking is not taught in any school, and the idea of teaching it would no doubt drive her to even greater fury. Dialectal Arabic is by far the most widely used language in Algeria, without which she would find herself deaf and dumb in her own country - just ask any Kabyle outside Kabylie whether it's worth learning - yet that doesn't enter into her definition of "useful" either. A language is "useful", in fact, only if its presence in daily life is so limited as to make it useless in most contexts. Only then can speaking it be a valuable accomplishment that gives you access to coveted jobs, rather than a routine ability that remains invisible until you run into someone who lacks it. Only then is it an appropriate subject for study.

But Tamazight activism threatens to upset that basic rule. If Tamazight ever does become part of compulsory education, that would lead to children studying and getting graded on a language that some of them already speak. How hideously unfair! The Kabyle-speaking children won't need it, and the Arabic-speaking children won't want it. Clearly the only possible explanation for such a move is that Kabyle speakers want to give themselves an unfair advantage at school, and handicap the Arabic speakers. (/sarcasm) The idea that there might be another side to this - that Kabyle speakers would still have to learn dialectal Arabic on their own as they always have, getting no extra credit for that effort, whereas Arabic speakers would be getting government help in learning Kabyle - doesn't even seem to cross her mind.