y-Haplogroup I1 STR "Cluster" Analysis

With data based on a large sample of I1 y-Haplogroup people tested at FTDNA (see http://www.familytreedna.com/public/yDNA_I1/), and also using some other data, I have done a mathematical "cluster analysis" to determine any clusters within the I1 y-Haplogroup based solely on STR marker values. The clusters that are found are then linked to the geographic origin of the most distant male-line ancestor reported by each cluster member.

Here are the results, which are presented in the form of a "Decision Tree":

The clusters that are identified do, in most cases, have a well defined geographic origin.

The relatively high mutation rate of some STR markers (compared to the slower mutation rate of SNP's) would in theory make identifying clusters not so easy, but luckily y-Haplogroup I1 is a young haplogroup, so it is possible to find clusters based solely on STR marker values. And as shown some of those clusters do nicely correlate with a geographic region.

The word "cluster" is used here in the mathematical sense relating to a clustering of points in the 67-dimensional space of the STR marker values. It is not necessarily the same as finding a geographic cluster, but that sometimes happens which makes the approach useful for people wanting to know where their male-line ancestors might have come from.

To emphasise the inherited nature of the clusters, one perhaps could call them "clans" or some similar word rather than clusters. So, for example, "I1 STR Clan-BBA" might be a better example name, and with members of that "clan" tending to come from Norway/Sweden as suggested in the computed plots.

Anyway, I hope you find the approach useful. The main idea here is using a simple decision tree to quickly get an idea of ones geographic origin based solely on STR marker values for I1 people. The histograms provide better information for anyone who might have obtained an independent STR mutation that would otherwise confuse the decision tree.

Terry, December 2010


UPDATE1: y-Haplogroups I1 and R1b in European Countries, plus Ancient Migrations

Here are some new results: Terry, February 2011


UPDATE4: y-Haplogroup I1 Dispersal/Expansion

The map below superimposes the present day range of the various I1 STR Cluster/Clans. It is a guide to the possible I1 Dispersal/Expansion.
One should keep in mind that the range and distribution of all haplogroups in Europe have been complicated by the comparatively recent Migration of "Barbarians" (before about 500 AD) and the Migration of "Vikings" (around 800 AD to 1100 AD). The "Barbarians" were mainly Germanic tribes from east of the Rhine and north of the Danube, comprising of the Goths (Visigoths and Ostrogoths), Vandals, Lombards, Burgundians, Franks, and Suebi etc. Also the Angles, Saxons, and Jutes; plus the non-Germanic Huns from Central Asia.



Terry, August 2011


UPDATE6: y-Haplogroup I1 and Ancient European Migrations

To understand how y-Haplogroup I1 and its various clusters/clans were dispersed around Europe, one needs to understand the movements, dispersals/expansions, and migrations of people back many thousands of years. The link below summarises, in map form, the movements that influenced European people from 13,000 BC onwards to 1000 AD.
Terry, September 2011


UPDATE7: y-Haplogroups I1 and I2 Tree (Preliminary Results)

Here is a tree, based on real STR haplotypes from over four thousand people in y-Haplogroup I1 and I2:


[See next update, which shows an improved tree layout after some correction to the code. Also more data included in the next version.]

Terry, November 2011


UPDATE8: y-Haplogroups I1 and I2 Tree Branch Codes

Enter an FTDNA Kit Number, or a Ysearch ID:

See y-Haplogroups I1 and I2 STR Branches for what you can do with that "Branch Code".

Terry, February 2012


UPDATE10: TMRCA of Y-Haplogroups - based of 1000 Genomes Project data

By simply counting the nucleotide differences between Y-chromosomes, the following tree can be constructed from the 1000 Genomes Project data:


    References:
  1. The 1000 Genomes Project Consortium, "A map of human genome variation from population-scale sequencing", Nature 467, 1061-1073 (2010).
  2. Xue et al., "Human Y Chromosome Base-Substitution Mutation Rate Measured by Direct Sequencing in a Deep-Rooting Pedigree", Current Biology 19, Issue 17, 1453-1457 (2009).
  3. Cruciani et al., "A Revised Root for the Human Y Chromosomal Phylogenetic Tree: The Origin of Patrilineal Diversity in Africa", American Journal of Human Genetics 88, 814-818 (2011).
  4. Karafet et al., "New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree", Genome Research (2008).
With more data, there may exist people whose Y-Chromosome branches off at an earlier time in the various haplogroups. Also be aware that only the count of nucleotide differences is used to construct the initial version of the tree (see SNP y-Haplogroup Tree Version 1). There are errors bars (not shown) associated with all the branch split dates. Subject to those error bars, and applying some mild prior information (such as that branch K should enclose branches LT, N, O, and P), then a more accurate tree can be constructed (see SNP y-Haplogroup Tree Version 2). It is remarkable that without using any prior information, that the simple count of nucleotide differences is essentially enough to create the tree with only minor branch ordering errors that are within the corresponding branch split error bars.


Terry, April 2012


UPDATE14: Haplogroup I1 SNP Geographic Maps and SNP Phylo Tree

The geographic distribution, of the most distant known male-line ancestor, of some the I1 SNPs is as shown in these maps:



And here is a simplified tree of some of the SNPs that have been found within Haplogroup I1.

Terry, January 2013


UPDATE15: Latest Haplogroup I1 SNP Phylo Tree including Geno 2.0 Data

Here is the latest tree with some of the SNPs that have been found within Haplogroup I1. The SNPs shown in red are the new tentative ones that have recently been found in some Geno 2.0 samples:

At this stage the CTS2208 could also include CTS2208, CTS5476. Similarly, Z133 could also include Z133, Z134. And Z74 could also include Z74, Z75. And CTS1679 could include CTS1679, CTS743. And CTS9352 could include CTS9352, CTS9477. And CTS9875 could include CTS9875, F2711. And Z138 includes Z138,Z139. And Z140 includes Z140,Z141. Etc. Other novel, but apparently phyloequivalent Geno 2.0 SNP's are not shown.

Terry, January 2013 through to June 2013


UPDATE17: Latest Haplogroup I2 SNP Phylo Tree including Geno 2.0 Data

Here is the latest tree with some of the SNPs that have been found within Haplogroup I2. The SNPs shown in red are the new tentative ones that have recently been found in some Geno 2.0 samples:

At this stage the F1295 could also include F1295, PF6950. Similarly, PF3983 could also include PF3983, PF4000. And CTS616 could also include CTS616, CTS9183. And CTS7767 could include CTS7767, F2044, PF6328. And PF9894 could also include PF9894,PF9897. And CTS10057 could also include CTS10057,CTS10100. And L801 could also include L801,Z183. Etc. Other novel, but apparently phyloequivalent Geno 2.0 SNP's are not shown.

Terry, March 2013 through to May 2013


UPDATE19: Haplogroup I1 Subgroup Frequency

Based on Geno 2.0 data, and data from the FTDNA Projects, the following are estimates for the percentage of people that are within various SNP subgroups of I1.



The estimates are based mostly on those people who have done the appropriate DNA test. So the estimates are biased towards the people who test such things.

Terry, June 2013


UPDATE20: TMRCA of Y-Haplogroups - based of Complete Genomics data

By simply counting the nucleotide differences between Y-chromosomes (restricted to the non-repetitive regions of the y-chromosome, and ignoring any insertion or deletion or multi-nucleotide mutations), then the following tree can be constructed from the high-coverage full sequence Complete Genomics data:


See UPDATE10 above for the equivalent computed tree using the 1000 Genomes Project data. The computed dates are remarkably similar. The Complete Genomics data has better high coverage data.

Terry, July 2013



tracker