About NevGen haplogroup predictor
First version of NevGen predictor was made in May of 2014. In that time it was available only as desktop application (and its existence was known only to small number of people in Serbia), and contained only part of today available subclades. Gradually during next two years many new subclades have been added, many existing subclades have been divided into its deeper subclades, several changes in its underlying mathematics were made, and finally, at the end of year 2015, internet edition has been made called “www.nevgen.org”, whose functionality to some smaller degree is subset of functionality of desktop edition.
Inspiration for NevGen Y DNA Predictor in that time came from excellent and legendary Athey’s Haplogroup Predictor, well known in Genetic Genealogy Community and probably the most famous one. Calculation of fitness in first version of NevGen was done in the same way Athey described underlying mathematics of his predictor here and here. Like Athey’s work, NevGen also uses Bayesian-Allele-Frequency Approach in predicting to which haplogroup a Y STR haplotype belongs to.
One of differences between NevGen’s and Athey’s predictor’s working (according to how Athey described it, I never saw his code) is usage of correlation (interdependence) of values of different STRs during calculation of probabilities of subclades. Athey did mention the possibility to use correlations of STR values to improve results, but I believe he never implemented it.
In NevGen it is implemented. NevGen uses correlations between values of pairs of STRs. In future it might be extended to triples or more of STRs, but for now are only correlations of pairs used. NevGen uses both negative and positive correlations, but negative have priority over positive ones, because they are less prone and more resilient than positives to deviations resulting from unbalanced haplotype sets representing subclades (which is mostly the case with available data).
Here I shall not describe Bayesian-Allele-Frequency Approach nor calculation of fitness in earlier version, because both are well described in Athey’s description of his Predictor. But, I shall describe correlations of pairs of STR markers through one example, and its use during calculation of probabilities.
In haplogroup J1a2a1a2 P58>FGC11, let examine markers and their alleles DYS 460 = 11.0 and DYS 487 = 13.0. DYS 460 in my samples for FGC11 is present in 599 of them, in 221 of them it has value 11, which makes its frequency is 0.369. DYS 487 is present also in 599 samples, in 338 of them it has value 13, so its frequency is 338/599 = 0.564. If we assume that marker independence exists in HG J1>FGC11, we should expect that both values DYS 460 = 11.0 and DYS 487 = 13.0 exist in 0.369 * 0.564 * 599 = 124.705 haplotypes. But, it is not the case, both such values exist only in 16 haplotypes. So, we here have negative correlation of two values, which actually decreases possibility that haplotype with DYS 460 = 11.0 and DYS 487 = 13.0 belongs to HG J1>FGC11. So, when NevGen predictor gets haplotype with those two values, it will in assessment that it belongs to HG J1>FGC11, in calculation use percentage of both values happening in J1>FGC11 samples (16/599), rather than simple product of both allele frequencies (221/599 * 331/599).
Calculation with assumed STR marker independence is straightforward, it can be done only one way, it is simple multiplication of frequencies of every STR value of given haplotype in given subclade (or actually summation of its natural logarithms). With correlations of values of STR markers it is not that simple. Since the same value of any STR (for example DYS 460 = 11.0) can be correlated differently with many values of many other STR markers, predictor must decide which of them to use, and which not. NevGen predictor, for now, uses at most two of them for every STR marker (in future it might change). So, it must somehow sort them in order of “goodness”. Like I have writed before, it gives advantage to negative rather than to positive correlations. Inside both groups it has rather complicated rules of assessing which one is better, figured out after a lot of testing, which might change in future if we find logic which has better score of guessing haplogroup of samples with already known subclade.
In many cases correlation of values is not needed. We do not need it to distinguish R1b haplotype from E1b1b-V13 or from some J2b2-M241 haplotype, even on first six markers. In fact, experienced genetic genealogist do not even need predictor for that, it is obvious to him on sight. But, in some cases use of correlations can greatly improve chances for predictor to make good prediction. From my experience in some cases on 17 markers it can be hard to distinguish some I1 from some G2a haplotypes, or some C2 from some I2a1b Isles haplotypes. In such cases, correlations may help.
During preparation of R1b level on NevGen, which was released on April 29th 2016, deep testings of its accuracy were conducted, on different number of markers (12, 25, 37, 67 and 111), both with and without use of correlations. Here I shall explain how it was done and its results, both with and without use of correlations.
First, R1b level has been divided on 168 subclades (unfortunately, list is not complete, because many subclades has too few haplotypes to be present in NevGen predictor, and many haplotypes are marked with *, meaning they do not fit into any known subclade of its immediate parent clade). Sample haplotypes used during testing of accuracy of NevGen are derived from all available sample haplotypes used for R1b level creation, their number ranging from 5766 for at least 12 markers to 3263 on 111 markers. From original known haplotypes, samples used for predictor testing are made by artificially and randomly changing them through 40 generations (like it is happening in nature), using Marko Heinila’s STR mutation rates. For every original haplotype with known subclade (confirmed with SNP) 10 such random independent 40-generations-deep descendants were made, and they were used to test NevGen’s accuracy. One of reasons for use of artificial descendants for testing is to make job to predictor harder by deviating original haplotypes from which it was made. In this way results of testing are more realistic for haplotypes which did not come into original set of sample haplotypes. To be considered as correct hit, predictor has to give for every testing haplotype at least 80% of probability to its already known subclade. And here are results for every used number of markers during testing, both with and without use of correlations:
111 markers – 97.94% (92.87% without use of correlations), on 3263*10 artificial 40-gen-deep samples.
67 markers – 89.19% (77.22% without use of correlations), on 5109*10 artificial 40-gen-deep samples.
37 markers – 72.46% (53.50% without use of correlations), on 5669*10 artificial 40-gen-deep samples.
25 markers – 48.85% (31.04% without use of correlations), on 5690*10 artificial 40-gen-deep samples.
12 markers – 18.76% (11.83% without use of correlations), on 5766*10 artificial 40-gen-deep samples.
From these results can be seen that with use of correlations accuracy of NevGen predictor is considerably increased. Another reason for using 40-generations-deep descendants in testing is to make assessment how will predictor behave for haplotypes which have MRCA with some of samples used for predictors’s statistics of up to 40/2 = 20 generations, which is up to 20*30 years = 600 years to TMRCA. Such descendants are some kind of approximate substitution for haplotypes with 20 generations to MRCA with existing original samples, except they are more distant from ancestral haplotype of the whole subclade.
When original haplotypes used for NevGen’s statistics are used for testing, accuracy is greater (for example, on 67 markers it is 97.46%, comparing to 89.19%), but such results are neither relevant nor realistic for unknown haplotypes since predictor is biased towards its original subclade.
NOTE on previous statistics: it was done in April 2016, before intoduction of probability of unsupported subclades into calculation. After it was done, some probability in many cases goes to “unsupported subclades”, which decreases probabilities that go to original subclades. Here is not displayed statistics which includes unsupported subclades because calculation of it has to be finer tuned in next period. Now introduction of probability of unsupported subclades in calculation shows very marginal impact on accuracy of NevGen predictor for haplotypes already known which subclade they belong to, for example on 111 markers it now scored 97.71% (compared to 97.94% earlier). After fine tuning would be done, it is expected to be even less.
For creating distribution statistics of alleles for different STR markers and subclades, in NevGen are used Marko Heinila’s mutation rates (you can see smoothed statistics in picture generated in NevGen for entered haplotype). They are used to ‘smooth’ distributions of frequencies of alleles on STR markers, making them to be more realistic, which improves accuracy of predictor. It is especially necessary for subclades with small number of available sample haplotypes. Without smoothing, predictor would be considerably biased towards subclades with more samples and against subclades with less samples. With use of distribution smoothing bias is considerably decreased. Smoothing is the reason why in generated charts could be found some percentage for some alleles when there are no such alleles in available sample haplotype data (experienced genetic genealogist might notice that).
Haplotype data used for NevGen predictor comes from different public FTDNA projects, as they are the best and most complete source that could be found on internet. For predictor’s statistics are used only ‘green’ haplotypes (those confirmed with SNPs). ‘Reds’ (unconfirmed) are ignored. Even not all of ‘green’ haplotypes are used. Those who are not sufficiently deep SNP-tested are also not taken into account, for example, haplotypes for R1b level with only M269 or L21 confirmed.
For some typically Asian haplogroups like H and D available data is scarce, and/or insufficiently deep classified into subclades. The same goes for A and B and subbranches of E typical only for Sub-Saharan Africa. Because of that, NevGen predictor is not much usable for them, or at best not as good for predicting them as it is for typically European or Near Eastern haplogroups. For haplogroups M and S NevGen has no means to predict them, because no usable haplotype data for them is available. For haplogroups C and O available data is a bit better than for H and D, but that is still not enough their quality to be comparable to haplogroups from Western Eurasia. Our advice is, do not use NevGen for haplotypes (especially short ones) originating far from Europe, for example from Subsaharan Africa, East Asia, Oceania or Indian subcontinent.