Croatian Toponyms


Toponyms are often affected by the linguistic phenomenon called tautology. A toponym is often composed of more words in different languages meaning exactly the same. A famous example of that is Torpenhow Hill.

One of the most confusing things about the language of Voynich Manuscript is that it has huge entropy of phonotactics. Its first-order entropy appears to be remarkably similar to Hebrew, but its second-order entropy is even lower than that of Hawaiian language. This has led some people to speculate that the letters within words in the Voynich Manuscript were rearranged according to some rules.
(That is slightly relevant here, because of this.)


My Interpretation of the Croatian Toponyms

UPDATE on 10/05/2022: If you speak Croatian, you may be interested in watching this YouTube video in which I summarize my ideas about toponyms (or, in case your browser cannot stream it, download the MP4 and open it in VLC or something like that).

UPDATE on 27/04/2023: I have opened a forum thread about this on Linguistics StackExchange and on r/etymology.

Sunset on the island called Mljet.
The Salt Lake
on the Mljet Island.
It's sometimes suggested
that the islands
were once the places richest
in toponyms, because people
had to use
every single source of
fresh water and every single
piece of fertile land.

I was asked to create a web-page in which I summarize my alternative interpretation of the Croatian toponyms, which I have supported on many Internet forums and on some conferences (full text is available in this PDF on the page 70) and in some journals (it is a version of this text, just slightly modified; you have an English-language summary of that text on my blog), so that we have everything about it on one page. Here we go!

ATTENTION: Some of the opinions stated in the following text are contrary to the mainstream science. I will not advise you to read it if you don't have a substantial background in linguistics. I am not a conspiracy theorist who wants to bombard people with controversial statements they don't know how to evaluate, and I am not denying it is possible my work is to historical linguistics what Anatoly Fomenko's work is to history. If you are ready to read it, click here.

The remainings of the Roman thermae in Issa.
The Roman Thermae
in Issa (Vis)
were getting the water
from a mineral spring
that doesn't exist
any more.
However, it's possible
that Issa was
named after it,
from the Indo-European
root *yos (spring).

That would be it! If you want to discuss my theory, go to the "Croatian Toponyms" forum thread I've linked to on the left. I'd like to have some sane opposition there, because I think my interpretation may be right. Ideas are correct or incorrect independent of their creators. The fact that I am not a linguist specializing in those things doesn't mean my ideas are wrong. I've used the methods that are well-accepted in linguistics (apart from applying statistics to the toponyms, which is for some reason very rarely done), I've just come to the conclusions that are different from the mainstream ones. (UPDATE: I think this paragraph I have written on Discord is a good explanation for how my methods differ from the mainstream ones, and why I think mine are better.)
If you have a proposal parallel to mine, I think it should address at least the following three things:
  1. Why do almost no toponyms in Croatia make sense to people speaking Croatian? An obvious answer seems to be that most toponyms in Croatia don't come from Croatian, but from some unattested substrate language that died relatively recently. It's fine if, in your proposal, you explain some toponyms using Slavic roots that don't exist (or have changed meaning) in modern Croatian. But if you do that again and again, then your proposal is unlikely.
  2. Why does this k-r pattern repeat in the Croatian river names? Either provide some explanation for that pattern (like, in my proposal, that *karr~kurr meant "to flow" in the substrate language) or provide some mathematical model of the language that says that pattern is not statistically significant. A crude birthday calculation (assuming the language has 20*20=400 equally likely consonant pairs) suggests the probability of that pattern occurring by chance is around 1/10'000. A birthday calculation that takes into account the collision entropy of the consonant pairs in the Croatian language suggests that the probability is somewhere between 1/300 and 1/17. Maybe an even better model would suggest it is not actually statistically significant. Now, I do not see that. While you can perhaps dismiss most of my alternative interpretation of the names of places in Croatia as baseless speculation, you cannot dismiss my measurements and calculations showing that this k-r pattern is statistically significant the same way. Crude birthday calculations suggest that we should expect 3 rivers in Croatia to accidentally start with the same two-consonant prefix. While you can push that number to 5 by adjusting those birthday calculations for the collision entropy of the phonology (including phonotactics) of the Croatian language, you cannot push it to 7 (Krka, Korana, Krapina, Kravarščica, Krbavica, 2*Karašica) that way. You need to explain it somehow differently. What I am not open for are phonosemantic explanations, explanations which contradict the basic principles of linguistics. I realize language is not the same as math, but, in order to study it scientifically, we need to have some sort of rigour. If p-values turn out to be so useful in natural sciences, why not try to apply them here?
    I am also not open for blatant ad-hoc hypotheses: inventing reasons why an experiment wouldn't work without providing evidence supporting those reasons. You respond to an experiment with an experiment, not with speculation. If you are going to claim, for example, that my measurements and calculations are misleading because the nouns in the Croatian language have a significantly lower collision entropy than the rest of the words in the Aspell word-list, be prepared to provide evidence for that claim.
  3. If you are going to propose Latin etymologies, make sure they do not contradict the sound laws that applied when borrowing from Late Latin into Old Croatian. Realize that, for example, long i (not short, as somebody who knows the basics of Croatian historical phonology might expect) gets borrowed as front yer (like in the toponym Cavtat, from "civitatem", long 'i' got borrowed as a front yer and changed to 'a'), and that short 'i' got regularly borrowed as a yat (like in the toponym Srijem, from the ancient name Sirmium). So don't, for example, try to explain an 'i' in some toponym where a non-Ikavian dialect is spoken as being a borrowing from Latin 'i'. If you are going to propose that Croatian toponyms come from some other language, propose the sound laws and stick to them. Once again, I realize that language is not the same as math, but, in order to scientifically study it, we need to have some sort of rigour.
    I would suggest you to also read the end of this StackExchange answer by Janus Bahs Jacquet, where he speculates how it is that Latin 'o' in the toponyms doesn't get borrowed as 'o' in Croatian.
    And understand that knowledge of historical phonology means very little if you are constantly dodging around the historical phonology by asserting recent borrowings or, worse yet, inventing unattested languages with sound changes that you would like. Mainstream linguistics seems to do that a lot.
I hope I do not sound too harsh, but all the proposals I am familiar with except mine do not address even one of those issues (especially not the first two).
UPDATE on 09/07/2018: You can download my Illyrian-Croatian dictionary here (it's a .DOCX file!).
UPDATE on 11/04/2021: I managed to install MatLab on my computer. So, here is that Octave program related to entropies modified so that it can be run in MatLab:
% Ovo je MatLabski program koji uspoređuje rezultate koje daje moj algoritam
% procjenjivanja entropije s rezultatima koje daje Shannonov algoritam.
suglasnici = 'bcdfghjklmnpqrstvwxyz';
testni_stringovi=cell(100 - length(suglasnici) + 1, 1);
for koliko_cemo_staviti_b_ova = 100 - length(suglasnici) + 1 : -1 : 1
  for i = 1 : koliko_cemo_staviti_b_ova
    testni_stringovi{koliko_cemo_staviti_b_ova} = [
      testni_stringovi{koliko_cemo_staviti_b_ova} 'b'
    ];
  end
  for i = 1 : 100 - koliko_cemo_staviti_b_ova
    testni_stringovi{koliko_cemo_staviti_b_ova} = [
        testni_stringovi{koliko_cemo_staviti_b_ova} suglasnici(int32(floor((i - 1) / (100 - koliko_cemo_staviti_b_ova) * (length(suglasnici) - 1))) + strfind(suglasnici, 'c'))
        ];
  end
end
samarzijine_entropije = [];
shannonove_entropije = [];
for i = 1 : length(testni_stringovi)
  str = testni_stringovi{i};
  samarzijine_entropije = [samarzijine_entropije samarzijina_entropija(str)];
  shannonove_entropije = [shannonove_entropije shannonova_entropija(str, suglasnici)];
end
sgtitle('Usporedba Shannonove i Samarzijine entropije generiranih stringova');
subplot(1,2,1);
plot(shannonove_entropije, samarzijine_entropije);
xlabel('Shannonova entropija');
ylabel('Samarzijina entropija');
subplot(1,2,2);
plot(shannonove_entropije);
hold on;
plot(samarzijine_entropije);
xlabel('Broj b-ova u stringu');
ylabel('Entropija (bit/simbol)');
legend('Shannonova entropija', 'Samarzijina entropija');
function ret = shannonova_entropija(str, suglasnici)
  apsolutne_frekvencije = [];
  for i = 1 : length(suglasnici)
    apsolutne_frekvencije = [apsolutne_frekvencije 0];
  end
  for i = 1 : length(str)
    znak = str(i);
    apsolutne_frekvencije(strfind(suglasnici, znak)) = apsolutne_frekvencije(strfind(suglasnici, znak)) + 1;
  end
  relativne_frekvencije = apsolutne_frekvencije / length(str);
  ret = 0;
  for relativna_frekvencija = relativne_frekvencije
    if relativna_frekvencija > 0
      ret = ret - log2(relativna_frekvencija) * relativna_frekvencija;
    end
  end
end
function ret = samarzijina_entropija(str)
  broj_pokusaja = 10000;
  broj_pogodaka = 0;
  for i = 1 : broj_pokusaja
    prvi = int32(floor(rand() * length(str) + 1));
    drugi = int32(floor(rand() * length(str) + 1));
    if str(prvi) == str(drugi)
      broj_pogodaka = broj_pogodaka + 1;
    end
  end
  omjer_pogodaka = broj_pogodaka / broj_pokusaja;
  ret = -log2(omjer_pogodaka);
end
Here is what it outputs:
The output of the MatLab program above.
UPDATE on 14/04/2021: I have found out why the Octave program and the MatLab program give wildly different results. Namely, there was a syntax error in my program. MatLab refused to parse it, but Octave was apparently doing automatic semicolon insertion (like JavaScript engines are doing). That resulted in incorrect test strings in testni_stringovi. This is what the test strings look like in MatLab when exported to a CSV file, and this is what they look like when exported from Octave.
Anyway, since we know the number of possible consonant pairs in Croatian is 26*26=676, the maximal possible entropy a consonant pair in the Croatian language could have is log2(676)=9.4 bits/symbol. And we have measured the Shannon's entropy to be log2(229)=7.839. So, assuming the curve representing the relationship between the Samaržija's entropy and the Shannon's entropy does not change its shape between individual consonants and consonants pairs, but only scales uniformly (which I have no idea how to test), we can estimate the Samaržija's entropy of the consonant pairs in the Croatian language the following way. We can assume the entropy of the pairs of consonants is log2(676)/log2(21)=2.14 times bigger than the corresponding entropy of individual consonants. The ratio between the measured Shannon's entropy and the maximal possible entropy in this case is 7.839/9.4=0.834. Thus, the corresponding point on the curve on the above diagram is when the Shannon's entropy is equal to 0.834*log2(21)=3.663 bits/symbol. The Samaržija's entropy at that point, as can be read from the diagram, is around 2.8 bits/symbol. Thus, we can expect the Samaržija's entropy of the pairs of consonants in Croatian to be around 2.14*2.8=5.992 bits/symbol. Thus, the probability of two random words beginning with the same pair of consonants should be around 1/(2^5.992)=1/63.65=1.57%. If that is true, then the p-value of that pattern of the Croatian river names starting with *karr~kurr is only 5.9% (the highest estimate I got by running the birthday-paradox-calculation written in C a few times), rather than around 1/500. Well, I guess it is always like that in social sciences: If you think you have a good p-value, you are probably calculating something incorrectly.
Of course, whether that is a correct estimate for the p-value depends on where the entropy of the language goes. If it is mostly syntax and morphology that decreases the entropy of the language, then those decreases in entropy do not matter in toponyms borrowed from an ancient language. Only if those decreases in entropy come from the phonology, they do matter. See the paper I linked below for a lengthy discussion about that, including my attempts to estimate which parts of the grammar are responsible for how much decrease of entropy.

UPDATE on 29/04/2021: You can see the draft of the next paper about linguistics I am planning to publish.

UPDATE on 14/09/2021: I have written a paper explaining what I think about the name Karašica, summarizing many of the things explained in the paper linked above. If you cannot open it, try opening this HTML file.

UPDATE on 06/10/2021: I asked a professional historical linguist, Dubravka Ivšić, what she thinks about my text about the river name Karašica via e-mail and posted her answer here, because, like I have said, I am not a conspiracy theorist who wants people not to hear both sides of the story: Poštovani Teo,
hvala Vam na Vašem mailu i interesu za predslavensku toponimiju.
Sinkronijski gledano, ime Karašica je slavensko, s obzirom na to da je izvedeno slavenskim sufiksom -ica. Pitanje je odakle je osnova (karas- ili karaš-), no to ne mijenja prvu činjenicu (isto kao što je npr. Jurica ime izvedeno hrvatskim sufiksom od osnove grčkoga podrijetla, pa ga to čini hrvatskim imenom). Koliko sam upoznata, hidronim Karašica prvi put je zabilježen tek u 17. st., na mađarskom se zove Karassó. Želite li doista poštivati znanstvenu metodologiju, trebalo bi prikupiti povijesne potvrde hidronima Karašica (iz pisanih izvora i sa starih karata) te utvrditi koji je najstariji oblik. S obzirom na to da dunavska Karašica teče i kroz Mađarsku, za nju u obzir dolazi i da je u mađarski ime posuđeno iz hrvatskog i obrnuto, iz hrvatskog u mađarski. Također, osnova karaš- plodna je i drugim toponimima (i izvan Hrvatske), pa bi trebalo utvrditi i jesu li svi oni povezani, tj. je li riječ o istoj onomastičkoj osnovi.
Formalno gledajući, nema prepreka da bi hidronim Karašica bio izveden od naziva ribe karas ili karaš (taj naziv se ne odnosi samo na zlatnu ribicu), a dublje podrijetlo naziva ribe u ovom slučaju nije relevantno za hidronim (slično kao što je i Krapina najvjerojatnije izvedeno od naziva ribe krap).
Što se tiče ostalih navedenih rijeka koje u svojim imenima sadrže k-r: Krka bi doista moglo biti predslavensko ime, Korana je nesigurnoga podrijetla, Krbavica je izvedeno od Krbava, a Kravarščica je izvedeno od Kravarsko (što je izvedeno od kravar).
Indoeuropski korijen koji spominjete rekonstruira se kao *k(')ers- sa značenjem 'trčati', a postoje mišljenja da se od njega u germanskim jezicima razvila riječ za konja. Indoeuropska riječ za konja rekonstruira se kao *h1ek'u-. Argument koji počinjete s „mnogi ilirski natpisi počinju s“ potpuno je promašen, s obzirom na to da ne postoje natpisi pisani „ilirskim jezikom“.
Matematičke metode u lingvistici mogu biti korisne u nekim slučajevima, no one ne mogu zamijeniti klasične lingvističke metode. U povijesnoj toponimiji nema prečaca.
Srdačan pozdrav,
Dubravka Ivšić Majić
Anyway, what do you think, who is really being more scientific here? Is it me, who has attempted to measure collision entropy of different parts of the Croatian grammar and has done numerical calculations showing the probability of that k-r pattern occurring by chance is somewhere between 1/300 and 1/17? Or is it her, who makes arguments from silence (that the name Karašica is unlikely to date back to antiquity because of its late first known attestation in the 17th century; that is also historically inaccurate, the name Karašica is first mentioned in a document from the year 1228 together with a dubious piece of information that it used to be called Mogioros in antiquity; Even if it were true, it would be much like saying Marco Polo has not really been to China because he did not mention the Great Wall or tea), does some intricate theoretical reasoning overshadowing my experimental results (like the contemporary response to the Ignaz Semmelweis experiment showing that puerperal fever was caused almost exclusively by uncleanliness), and asserts that traditional methods are superior to mathematical methods?
By the way, the etymology she suggests is obviously problematic not only because of informatics, but also because of the earliest attestations: the earliest attestation of the river name Karašica is Karassou, without the Slavic suffix -ica. Clearly, it had a non-Slavic suffix back then, so no Slavic etymology is plausible.


UPDATE on 18/12/2021: I have made a LibreOffice presentation about my alternative interpretation of the names of places in Croatia.

UPDATE on 26/12/2021: I have written a short summary of the ideas presented in the presentation: To summarize, I think that I have thought of a way to measure the collision entropy of different parts of the grammar, and that it is possible to calculate the p-values of certain patterns in the names of places using them. The entropy of the syntax can obviously be measured by measuring the entropy of spell-checker word list such as that of Aspell and subtracting from that an entropy of a long text in the same language (I was measuring only for the consonants, I was ignoring the vowels, because vowels were not important for what I was trying to calculate). I got that, for example, the entropy of the syntax of the Croatian language is log2(14)-log2(13)=0.107 bits per symbol, that the entropy of the syntax of the English language is log2(13)-log2(11)=0.241 bits per symbol, and that the entropy of the syntax of the German language is log2(15)-log2(12)=0.3219 bits per symbol. It was rather surprising to me that the entropy of the syntax of the German language is larger than the entropy of the syntax of the English language, given that German syntax seems simpler (it uses morphology more than the English language does, somewhat simplifying the syntax), but you cannot argue with the hard data. It looks as though the collision entropy of the syntax and the complexity of the syntax of the same language are not strongly correlated. The entropy of the phonotactics of a language can, I guess, be measured by measuring the entropy of consonant pairs (with or without a vowel inside them) in a spell-checker wordlist, then measuring the entropy of single consonants in that same wordlist, and then subtracting the former from the latter multiplied by two. I measured that the entropy of phonotactics of the Croatian language is 2*log2(14)-5.992=1.623 bits per consonant pair. That 5.992 bits per consonant pair has been calculated using some mathematically dubious method involving the Shannon Entropy (As, back then, I didn't know that there is a simple way to calculate the collision entropy as the negative binary logarithm of the sum of the squares of relative frequencies of symbols, I was measuring the collision entropy using the Monte Carlo method. The Shannon entropy is 7.839 bits per consonant pair, and the maximal possible entropy is log2(26*26) bits per consonant pair, so I suppose the collision entropy is around 5.992 bits per consonant pair.). Now, I have taken the entropy of the phonotactics to be the lower bound of the entropy of the phonology, that is the only entropy that matters in ancient toponyms (entropy of the syntax and morphology do not matter then, because the toponym is created in a foreign language). Given that the Croatian language has 26 consonants, the upper bound of the entropy of morphology, which does not matter when dealing with ancient toponyms, can be estimated as log2(26*26)-1.623-2*0.107-5.992=1.572 bits per pair of consonants. So, to estimate the p-value of the pattern that many names of rivers in Croatia begin with the consonants 'k' and 'r' (Karašica, Krka, Korana, Krbavica, Krapina and Kravarščica), I have done some birthday calculations, first setting the simulated entropy of phonology to be 1.623 bits per consonant pair, and the second by setting the simulated entropy of phonology to be 1.623+1.572=3.195 bits per consonant pair (In other words, in the second birthday calculation, I assumed the entropy of morphology was 0). In both of those birthday calculations, I assumed that there are 100 different river names in Croatia. The former birthday calculation gave me the probability of that k-r-pattern occuring by chance to be 1/300 and the latter gave me the probability 1/17. So the p-value of that k-r-pattern is somewhere between 1/300 and 1/17. Mainstream linguistics considers that k-r pattern in Croatian river names to be a coincidence, but nobody before me (as far as I know) has even attempted to calculate how much of a coincidence it would have to be (the p-value). So I concluded that the simplest explanation is that the river names Karašica, Krka, Korana, Krbavica, Krapina and Kravarščica are related and all come from the Indo-European root *kjers meaning horse (in Germanic languages) or to run (in Celtic and Italic languages). I think the Illyrian word for "flow" came from that root, and that the Illyrian word for "flow" was *karr or *kurr, the vowel difference 'a' to 'u' perhaps being dialectical variation (compare the attested Illyrian toponyms Mursa and Marsonia, the names Mursa and Marsonia almost certainly come from the same root, but there is a vowel difference 'a' to 'u' in them). Furthermore, based on the historical phonology of the Croatian language and what's known about the Illyrian language (for example, that there was a suffix -issia, as in Certissia, the ancient name for Đakovo, but not the suffix -ussia), I reconstructed the Illyrian name for Karašica as either *Kurrurrissia (borrowed into Proto-Slavic as *Kъrъrьsьja, which would give *Karrasja after the Havlik's Law, and then *Karaša after the yotation and the loss of geminates, to which the Croatian suffix -ica was added) or *Kurrirrissia (borrowed into Proto-Slavic as *Kъrьrьsьja, which would also give *Karaša by regular sound changes), and the Illyrian name for Krapina as either *Karpona (borrowed into Proto-Slavic as *Korpyna, which would give "Krapina" after the merger of *y and *i and the metathesis of the liquids) or *Kurrippuppona (borrowed into Proto-Slavic as *Kъrьpъpyna, which would also give "Krapina" by regular sound changes), with preference to *Karpona. Do those arguments sound compelling to you? Overall, I believe I've discovered three hard facts which will not be controversial:
  1. The collision entropy of the syntax of some language and the complexity of the syntax of that same language are not strongly correlated. For example, German has around 30% higher collision entropy of the syntax than English does, in spite of arguably having a simpler syntax. It is hard to imagine there are other factors hiding such a correlation, as German and English are closely related languages.
  2. Birthday Paradox does not remotely explain away that k-r pattern in the Croatian river names. A simple birthday calculation suggests that the probability of that k-r pattern occurring by chance is around 1/10'000.
  3. Birthday Paradox plus the collision entropy of phonology (some pairs of consonants being way more common than others) also does not appear to explain away that k-r pattern. Birthday calculations adjusted for the measured collision entropy of phonology of the Croatian language suggest that the probability of that k-r pattern occurring by chance is somewhere between 1/300 and 1/17.
I don't think those claims can reasonably be considered pure speculation. Of course, my paper contains plenty of speculation, including, but not limited, to this:
The reconstruction of the Illyrian name for Karašica.
It's customary to include such speculation (although I wouldn't call it pure speculation) in papers about the names of places.

UPDATE on 16/09/2021: The Etruscan letters are apparently flipped left-to-right on Android, I have started a Reddit thread about that.

UPDATE on 06/01/2022: A lot of the responses I get on the Internet forums when I share my ideas boil down to "You should not use mathematics in this part of linguistics.". Well, here is how I will respond to them: Samo ti slijepo vjeruj da statistika i informatika nemaju ništa za reći o hrvatskim toponimima. Toliko ne znaš o informatici i statistici da poričeš da su one uopće korisne. To je onaj prvi stadij Dunning Krugerovog efekta, kad poričeš da je vještina korisna. Zapravo, ima bolji opis što se s tobom događa: ti si u poziciji Darwina kada je komentirao na Mendelov rad: "Matematika je u biologiji ono što je skalpel u stolarevoj radionici, nema tamo što tražiti.". Danas to zvuči smiješno. Zapravo, zaboravi, ti nisi ni na toj razini, ti si na razini onih što su poricali indoeuropsku lingvistiku zbog svoje slijepe vjere u priču o Kuli babilonskoj i da lingvistika nema ništa za reći o tome. I žao mi je što u 21. stoljeću ima ljudi koji tako razmišljaju, kao da ih posljednjih nekoliko stoljeća razvoja znanosti nisu ništa naučila. Nema prave znanosti bez statistike. Whether or not my theories are correct, "You should not use mathematics in this part of linguistics." is a ridiculous argument and deserves such a response.
I think that partly what is happening is that the users of Internet forums about linguistics are reading a lot of Wikipedia and other tertiary sources of information, while reading almost no primary and secondary scientific sources. Wikipedia and other tertiary sources of information (etymological dictionaries...) almost never discuss p-values. So, no wonder that discussions about p-values seem alien to the forum users, even though they are the foundation of the modern scientific method. It is unfortunate.

UPDATE on 13/01/2022: Here is the table with the data about collision entropy of various languages, which I have measured for purposes of my experiment:
Language nameCollision entropy of consonants in a long textThe most common consonant in a long textCollision entropy of consonants in the Aspell word-listThe most common consonant in the Aspell word-listCollision entropy of the syntax
Englishlog2(11)tlog2(13)r0.241
Germanlog2(12)nlog2(15)n0.322
Croatianlog2(13)nlog2(14)n0.107
Italianlog2(12.5)nlog2(15)r0.263
Frenchlog2(10)slog2(11)s0.138
One interesting question I get by examining that data is "Does the deep orthography (such as English or French) decrease the collision entropy of a written language?". I have asked that question on a forum about linguistics.

UPDATE on 21/03/2022: I have written a NodeJS program that does all the calculations described here automatically, with no need to copy results from one program into another: "use strict"; let suglasnici = "bcčćdđfghjklmnpqrsštvwxyzž"; // NodeJS podržava ne-ASCII (hrvatske...) // znakove u stringovima. suglasnici += suglasnici.toUpperCase(); const datotecniSustav = require("fs"); const dugacakTekst = datotecniSustav.readFileSync("tekst.txt", { encoding: "utf-8", flag: "r" }); let mapaSaSuglasnicima = new Map(); for (const znak of dugacakTekst) if (suglasnici.indexOf(znak.toLowerCase()) !== -1) mapaSaSuglasnicima.set( znak.toLowerCase(), (mapaSaSuglasnicima.get(znak.toLowerCase()) | 0) + 1 ); let zbroj = 0; for (const apsolutna_frekvencija of mapaSaSuglasnicima.values()) zbroj += apsolutna_frekvencija; let kolizijskaEntropijaSuglasnikaUDugackomTekstu = 0; for (const apsolutna_frekvencija of mapaSaSuglasnicima.values()) kolizijskaEntropijaSuglasnikaUDugackomTekstu += (apsolutna_frekvencija / zbroj) ** 2; kolizijskaEntropijaSuglasnikaUDugackomTekstu = -Math.log2( kolizijskaEntropijaSuglasnikaUDugackomTekstu ); const rjecnik = datotecniSustav.readFileSync("croatian.wl", { encoding: "utf-8", flag: "r" }); mapaSaSuglasnicima = new Map(); for (const znak of rjecnik) if (suglasnici.indexOf(znak.toLowerCase()) !== -1) mapaSaSuglasnicima.set( znak.toLowerCase(), (mapaSaSuglasnicima.get(znak.toLowerCase()) | 0) + 1 ); zbroj = 0; for (const apsolutna_frekvencija of mapaSaSuglasnicima.values()) zbroj += apsolutna_frekvencija; let kolizijskaEntropijaSuglasnikaURjecniku = 0; for (const apsolutna_frekvencija of mapaSaSuglasnicima.values()) kolizijskaEntropijaSuglasnikaURjecniku += (apsolutna_frekvencija / zbroj) ** 2; kolizijskaEntropijaSuglasnikaURjecniku = -Math.log2( kolizijskaEntropijaSuglasnikaURjecniku ); let mapaSParovimaSuglasnika = new Map(); for (const prvi of suglasnici) for (const drugi of suglasnici) mapaSParovimaSuglasnika.set((prvi + drugi).toLowerCase(), 0); let prethodni, sadasnji, brojac = 0; for (const znak of rjecnik) { if (suglasnici.indexOf(znak) !== -1) { prethodni = sadasnji; sadasnji = znak.toLowerCase(); if (prethodni !== undefined) { brojac++; mapaSParovimaSuglasnika.set( prethodni + sadasnji, mapaSParovimaSuglasnika.get(prethodni + sadasnji) + 1 ); } } } let shannonovaEntropijaParovaSuglasnika = 0, kolizijskaEntropijaParovaSuglasnika = 0; for (const apsolutnaFrekvencija of mapaSParovimaSuglasnika.values()) if (apsolutnaFrekvencija) { shannonovaEntropijaParovaSuglasnika -= (apsolutnaFrekvencija / brojac) * Math.log2(apsolutnaFrekvencija / brojac); kolizijskaEntropijaParovaSuglasnika += (apsolutnaFrekvencija / brojac) ** 2; } kolizijskaEntropijaParovaSuglasnika = -Math.log2( kolizijskaEntropijaParovaSuglasnika ); console.log( "Kolizijska entropija suglasnika u dugačkom tekstu: " + kolizijskaEntropijaSuglasnikaUDugackomTekstu + "=log2(" + 2 ** kolizijskaEntropijaSuglasnikaUDugackomTekstu + ")" ); console.log( "Kolizijska entropija suglasnika u rječniku: " + kolizijskaEntropijaSuglasnikaURjecniku + "=log2(" + 2 ** kolizijskaEntropijaSuglasnikaURjecniku + ")" ); console.log( "Kolizijska entropija sintakse: " + (kolizijskaEntropijaSuglasnikaURjecniku - kolizijskaEntropijaSuglasnikaUDugackomTekstu) ); console.log( "Shannonova entropija parova suglasnika u rječniku: " + shannonovaEntropijaParovaSuglasnika ); console.log( "Kolizijska entropija parova suglasnika u rječniku: " + kolizijskaEntropijaParovaSuglasnika ); console.log( "Kolizijska entropija fonotaktike: " + (2 * kolizijskaEntropijaSuglasnikaURjecniku - kolizijskaEntropijaParovaSuglasnika) ); let iznad_koliko_kolizija_brojimo = 7, // Toliko, koliko ja znam, rijeka u Hrvatskoj počinje na k-r: Karašica (2 puta, jedna se ulijeva u Dravu, a druga u Dunav), Krka, Korana, Krbavica, Krapina i Kravarščica. koliko_ima_rijeka_u_Hrvatskoj = 100, // Ako netko ima ideju kako to točnije procijeniti, neka mi se slobodno javi. koliko_smo_puta_dobili_toliko_kolizija = 0, koliko_smo_puta_izvrtili_simulaciju = 1_000_000; for (let brojac = 0; brojac < koliko_smo_puta_izvrtili_simulaciju; brojac++) { let koliko_rijeka_pocinje_na_taj_par_suglasnika = []; for ( let brojac = 0; brojac < 2 ** (kolizijskaEntropijaParovaSuglasnika + 2 * (kolizijskaEntropijaSuglasnikaURjecniku - kolizijskaEntropijaSuglasnikaUDugackomTekstu)); brojac++ ) koliko_rijeka_pocinje_na_taj_par_suglasnika.push(0); for (let brojac = 0; brojac < koliko_ima_rijeka_u_Hrvatskoj; brojac++) koliko_rijeka_pocinje_na_taj_par_suglasnika[ Math.floor( Math.random() * 2 ** (kolizijskaEntropijaParovaSuglasnika + 2 * (kolizijskaEntropijaSuglasnikaURjecniku - kolizijskaEntropijaSuglasnikaUDugackomTekstu)) ) ] += 1; let jesmo_li_nasli_potreban_broj_kolizija = false; for ( let brojac = 0; brojac < 2 ** (kolizijskaEntropijaParovaSuglasnika + 2 * (kolizijskaEntropijaSuglasnikaURjecniku - kolizijskaEntropijaSuglasnikaUDugackomTekstu)); brojac++ ) if ( koliko_rijeka_pocinje_na_taj_par_suglasnika[brojac] >= iznad_koliko_kolizija_brojimo ) { jesmo_li_nasli_potreban_broj_kolizija = true; break; } if (jesmo_li_nasli_potreban_broj_kolizija) koliko_smo_puta_dobili_toliko_kolizija += 1; } console.log( `Vjerojatnost da ${iznad_koliko_kolizija_brojimo} od ${koliko_ima_rijeka_u_Hrvatskoj} hidronima slučajno počinje na isti par suglasnika iznosi ${ (koliko_smo_puta_dobili_toliko_kolizija / koliko_smo_puta_izvrtili_simulaciju) * 100 }%.` );This time, to calculate the collision entropy, instead of using the complicated algorithm that follows right from the definition (choose two symbols from the string randomly, check whether they are equal, and repeat that many times), I used a much simpler algorithm described at Wikipedia. I must admit my understanding of the issue has improved drastically.

UPDATE on 23/03/2022: Here is how I responded to somebody comparing me to theologians who try to use mathematics to prove the existence of God:
Mislim da, da je ontološki argument dobar, matematička logika bi bila izvrstan alat za dokazivanje postajanja Boga. Nažalost, ontološki argument zasniva se na dvije premise koje su u najmanju ruku veoma upitne:
  1. Bog postoji u nekim mogućim svjetovima. Drugim rječima, paradoks svemoći i drugi a-priori argumenti protiv postojanja Boga nisu valjani.
  2. Ono što je savršeno i postoji u nekim mogućim svjetovima postoji u svim mogućim svjetovima. To je upitno jer se, recimo, čini da savršeni krug postoji u nekim mogućim svjetovima, ali ne i u našemu.
Matematičkom logikom se eventualno može dokazati da postoji forma ontološkog argumenta koja je logički valjana, no to nam, zbog tih upitnih premisa, ne govori da Bog postoji. To jest, eventualno se matematičkom logikom može dokazati da Kant nije bio u pravu da je skrivena premisa svake forme ontološkog argumenta da je postojanje logički predikat, no to ne negira problem da su te dvije premise upitne.
Sve u svemu, problem ni s jednim oblikom ontološkog argumenta nije to da koristi matematičku logiku.

UPDATE on 19/04/2022: I have written a script for my new YouTube video about toponyms.

UPDATE on 19/04/2022: I have published a YouTube video about my alternative interpretation of Croatian toponyms. If you cannot open it, try opening this MP4 video in VLC or a similar program.

UPDATE on 18/06/2022: My informatics professor Anđelko Lišnjić suggested me that I make a table with the frequencies of consonant pairs in the Croatian language. So I did that!

UPDATE on 06/06/2023: What do people on the Internet think about my idea that the Croatian dialectism "regav" (full of cracks) and the Ancient Greek word "ῥαγή" (crack) are both loanwords from the Illyrian language? I have asked that question on Latin Language StackExchange and on Reddit.

UPDATE on 28/07/2023: The Reddit user called neuralbeans thinks that the central argument I presented in my latest paper about toponyms (you can read an English-language summary here) is flawed because maybe the nouns in the Croatian language have a significantly lower collision entropy than all the words in the Aspell spell-checking dictionary (and toponyms are nouns). If you ask me, that's an obvious ad-hoc hypothesis. Why would different word classes (nouns, verbs, adjectives...) in the Croatian language have different collision entropies? I can see why they would have different collision entropies in the Swahili language, where, due to the noun classes, verbs can start in consonant pairs that nouns cannot, but I fail to see how it would be possible in the Croatian (or English) language. And why would nouns have lower collision entropy, rather than higher? Seems like a baseless ad-hoc hypothesis, doesn't it? And it's not a burden of proof on me to do some complicated experiment because of somebody's ad-hoc hypothesis. However, some people on other Internet forums think that it is a serious problem with my paper. So, I've started a question about that on forum.hr and on Linguistics StackExchange, to see if somebody has researched that before me. Or perhaps if there is a way to test the neuralbeans'es hypothesis without spending days compiling a long list of nouns in the Croatian language.

UPDATE on 16/08/2023: I have tried to explain on Discord how my methodology is different from the methodology of mainstream onomastics (part of linguistics that deals with names), and why I think my methodology is better:
Ego censeo quia plurimi linguistae, cum student nomina locorum, utuntur methodologia quae contradicit informaticae. Principium fundamentale methodologiae, qua plurimi linguisti utuntur cum student nomina locorum, est quod etymologiae ex linguis quas scimus probabiliores sunt quam etymologiae ex linguis quas non scimus (ut lingua Illyrica). Ego cogito id principium non veritatem esse. Ego cogito pricipium gravius esse quod repetitio eorundem aut similium elementorum in aliquo significato (flumen, mons, fons...) significat quod nomina locorum veniunt ex eadem lingua. Exempli gratia, id regularitas quod primi duo consonantes in nominibus fluminum in Croatiae saepe sint 'k' et 'r'. Plurimi linguistae qui nominibus locorum in Croatia student censent eam regularitatem esse coincidentalem, sed informatica dicit quia probabilitas ut ea regularitas fit coincidentaliter sit inter 1/300 et 1/17. Id est parva probabilitas.
The basic principle of the methodology of mainstream onomastics is that etymologies involving languages that we know a lot about are supposedly more probable than etymologies involving languages we know little about. The Dubravka Ivšić'es response to an earlier version of my paper wonderfully illustrates that. I don't think that principle is true, as I don't see why etymologies involving languages we know little about would be less probable. And assuming that gives explanations for the toponyms that are improbable according to the information theory (such as that the k-r pattern in the river names is a coincidence). And by "improbable" I mean that it is possible to calculate the probability and that it will be small.
That's not to say I fully understand the methodology of mainstream linguistics. Which methodology you have to follow so that it seems to you that the name "Issa" is certainly pre-Indo-European and that it doesn't occur to you that it is perhaps the Illyrian word for spring is truly beyond me. However, had they followed a good methodology, they wouldn't get answers that appear to contradict information theory.
And proponents of mainstream linguistics say they strive to make their theories coherent with historical phonology, but to me it seems they are quite often inventing reasons why historical phonology doesn't apply. You want to support an etymology that doesn't contain sound changes that are expected to have occurred? Invent that it is about recent borrowing! You know, like Dubravka Ivšić thought the river name Karašica is related to the Latin fish name carassius. So, why didn't the 'a' change to 'o', that is, why isn't Karašica today called something like *Koroša? Well, Dubravka Ivšić invented the reason that the fish name carassius was recently borrowed into Croatian and that the river name Karašica dates only to the 17th century. You want to support an etymology that contains completely unexpected sound changes? Invent some unattested language that supposedly contained those sound changes! Like when Melich Janos proposed an etymology that Karašica comes from Turkic "kara sub" (black water). Why did 'b' disappear and why did 's' turn into 'š'? Melich Janos invented that some unattested Turkic language with those sound changes was spoken in eastern Croatia. Doing those things is not being coherent with historical phonology, that is dodging around historical phonology. Historical phonology, if anything, suggests that the river name Karašica comes from something like *Kurrurrissia.

UPDATE on 13/05/2024: I've made another video about Croatian toponyms, about how they prove that Illyrian was a centum language. You can see it on YouTube (MP4).

UPDATE on 15/05/2024: I've opened a discussion about whether Albanian is descended from Illyrian on TextKit and r/latin:
Eratne lingua Illyrica "centum" aut "satem" lingua? Suntne Albani nativi in Balkane?

Quid homines in hac agora censent, eratne lingua Illyrica "centum" aut "satem" lingua? Linguae Indo-Europeae omnes in duas uniones divisae sunt: "centum" et "satem". In "centum" linguis, Indo-Europeanum phonemum 'kj' in 'k' vertitur. Lingua Latina est "centum" lingua, etiam sunt lingua Graeca et lingua Anglica. In lingua Anglica vere 'kj' in 'h' vertitur, sed, quodam tempore, ante Grimmi legem, 'kj' in 'k' vertebatur in lingua Anglica, et propterea lingua Anglica est "centum" lingua. In "satem" linguis, 'kj' in 's' vertitur. Exempla "satem" linguarum sunt lingua Croatica, lingua Albanica et lingua Sanskrit. James Patrick Mallory scripsit in Encyclopedia of Indo-European Culture se censere id, num Illyrica erat "centum" aut "satem", ex datis quae habemus, sciri non posse. Plurimi linguistae in Croatia, et alibi in Balkane, censent linguam Illyricam fuisse "satem" linguam et etiam progenitorem esse linguae Albanicae. Sed ego censeo linguam Illyricam "centum" linguam fuisse. Die ante heri, ego publicavi YouTube filamentum in lingua Croatica de eo.
https://youtu.be/4QQ2iJZnyUk
In eo filamento, do quinque argumenta pro idea quia lingua Illyrica erat "centum" lingua. Ea argumenta sunt:
  1. 'K'-'r' regularitas in nominibus fluminum in Croatia. In multis nominibus fluminum in Croatia, primus consonans est 'k' et secundus consonans est 'r': Krka, Korana, Krapina, Krbavica, Kravarščica, et duo flumina cum nomine Karašica. Plurimi linguistae censent eam regularitatem coincidentalem esse, sed ego censeo quia theoria informationis (Paradoxa Dierum Natalium et Entropia Collisionum) docet nobis quia probabilitas ut ea regularitas apparet coincidentaliter est inter 1/300 et 1/17. Calculationem habetis in meo textu "Etimologija Karašica", quod publicavi in almanaco Valpovački Godišnjak anno Domini 2022-o. Ego censeo quia nomen "Karašica" venit ex Illyrico nomine Kurr-urr-issia, et quia "kurr" significabat "fluere" (probabiliter ex Indo-Europea *kjers, quod significabat "currere"), "urr" significabat "aqua" (ex Indo-Europea *weh1r), et "-issia" erat suffixum in lingua Illyrica, quod etiam est in antiquo nomine pro Đakovo, "Certissia". Per me, nomen "Kurrurrissia" ivit ex Illyrico in Prae-Sclavicum *Kъrъrьsьja, quod dedit "Karrasj-">"Karaš-ica" (-ica est Croaticum suffixum) in hodierna lingua Croatica. Ego etiam censeo Krapina venisse ex Illyrico nomine Kar-p-ona, "kar" ex *kjers, "p" ex *h2ep (aqua), et "ona" erat suffixum in multis Illyricis nominibus locorum, inter alia, "Salona" et "Albona". Per me, nomen "Karpona" ivit ex Illyrico in Prae-Sclavicum *Korpyna, quod dedit "Krapina" in hodierna lingua Croatica. Et cetera...
  2. Si lingua Illyrica erat "centum" lingua, "Curicum", antiquum nomen pro Krk, potest legi ut "caurus, ventus borealis", ex Indo-Euroepea *(s)kjeh1weros (unde Latinum verbum "caurus" venit), et Krk est borealissima insula in mare nostro.
  3. Si lingua Illyrica erat "centum" lingua, "Incerum", antiquum nomen pro Požega, potest legi ut "cor vallis", ex Indo-Europeais radicibus *h1eyn (vallis) et *kjer(d) (cor).
  4. Si lingua Illyrica erat "centum" lingua, "Cibelae", antiquum nomen pro Vinkovci, potest legi ut "firma casa" vel "castrum", ex Indo-Europeis radicibus *kjey (casa) et *bel (firmus).
  5. Multae inscriptiones in lingua Illyrica incipiunt cum "klauhi zis", et id probabiliter significabat "Audiat Deus...". "Klauhi" ergo probabiliter venit ex *kjlew (audire), ergo, *kj vertitur in *k in lingua Illyrica.
Audiunturne ea argumenta vobis compellentia?

UPDATE on 28/05/2024: I've written another rhetoric for defending my alternative interpretation of the Croatian toponyms:
Gle, kada uočiš kontradikciju između dva polja znanosti (recimo, između teorije informacija i onomastike), prvo, naravno, trebaš pretpostaviti da si nešto krivo razumio. Trebaš se posavjetovati sa stručnjakom (a ne pretpostaviti da možeš samostalno zaključivati o nečemu što si učio jedan semestar, kao ja teoriju informacija). Ali ako kontradikcija ne bude razriješena (stručnjaci za teoriju informacija tvrde da su moji izračuni točni), trebaš pretpostaviti da je tvrđa znanost u pravu (u ovom slučaju, naravno, teorija informacija), a da je mekana znanost u krivu. Jer obično to tako bude.
Dobro, postoje iznimke, kao što je Paradoks mladog Sunca. Početkom 20. stoljeća, recimo, geologija (koja je relativno mekana znanost) tvrdila je da je Zemlja stara barem milijunima godina, a fizika (koja je tvrda znanost) je tvrdila da Sunce ne može biti starije od sto tisuća godina. Kasnije se, naravno, ispostavilo da je tadašnja fizika potpuno krivo shvatila kako Sunce funkcionira, jer se Sunce bazira na nuklearnim reakcijama o kojima tadašnja fizika nije znala ništa. Ipak, mislim da se ovdje ne događa ništa slično kao Paradoks mladog Sunca. Onomastika nije ni približno tvrda kao geologija (a teorija informacija je, ako išta, tvrđa od fizike), a i argumenti koje onomastika daje doimaju se nevjerojatno slabima.
Osnovni argument Melich Janosevog teksta o tome da Karašica dolazi od turkijskog *kara-sub ide ovako: "U ranom srednjem vijeku, zapisana imena za baranjsku Karašicu bila su Mogyoros, Feketeviz i Karassou. S obzirom na to da ime Feketeviz znači crna voda, ime Karassou vjerojatno znači isto. Ime Karassou zvuči donekle slično kao praturkijski *kara-sub, dakle ono mora dolaziti od tamo. Evo glasovnih promjena ad-hoc za koje pretpostavljam da su se dogodile u tom turkijskom jeziku odakle dolazi ime Karassou.". Meni nije jasno kako netko taj argument može smatrati uvjerljivim.

UPDATE on 31/10/2024: Anyway, I've decided to open a Reddit thread about what the advocates of mainstream onomastics mean when they say "The etymologies from the languages we know a lot about (Croatian, Latin, Celtic...) are more probable than the etymologies from languages we know little about (Illyrian...).". What is the mathematical basis for that principle? I don't see it. What I do see is that following that principle gives results which are incompatible with information theory. Following that principle gave the result that the k-r pattern in the Croatian river names is coincidental, but basic information theory strongly suggests that the p-value of that pattern is somewhere between 1/300 and 1/17. But maybe somebody can explain that principle mathematically.

UPDATE on 03/11/2024: A possible unexpected confirmation of my hypothesis that *karr~kurr was the Illyrian word for "to flow": A forum.hr user called petielement claims that "kurit" means "to flow" in some Dalmatian dialects.