y-Haplogroup I1 STR "Cluster" Analysis
With data based on a large sample of I1 y-Haplogroup people tested at FTDNA (see http://www.familytreedna.com/public/yDNA_I1/), and also using some other data, I have done a mathematical "cluster analysis" to determine any clusters within the I1 y-Haplogroup based solely on STR marker values. The clusters that are found are then linked to the geographic origin of the most distant male-line ancestor reported by each cluster member.
Here are the results, which are presented in the form of a "Decision Tree":
The clusters that are identified do, in most cases, have a well defined geographic origin.
The relatively high mutation rate of some STR markers (compared to the slower mutation rate of SNP's) would in theory make identifying clusters not so easy, but luckily y-Haplogroup I1 is a young haplogroup, so it is possible to find clusters based solely on STR marker values. And as shown some of those clusters do nicely correlate with a geographic region.
The word "cluster" is used here in the mathematical sense relating to a clustering of points in the 67-dimensional space of the STR marker values. It is not necessarily the same as finding a geographic cluster, but that sometimes happens which makes the approach useful for people wanting to know where their male-line ancestors might have come from.
To emphasise the inherited nature of the clusters, one perhaps could call them "clans" or some similar word rather than clusters. So, for example, "I1 STR Clan-BBA" might be a better example name, and with members of that "clan" tending to come from Norway/Sweden as suggested in the computed plots.
Anyway, I hope you find the approach useful. The main idea here is using a simple decision tree to quickly get an idea of ones geographic origin based solely on STR marker values for I1 people. The histograms provide better information for anyone who might have obtained an independent STR mutation that would otherwise confuse the decision tree.
Terry, December 2010
UPDATE1: y-Haplogroups I1 and R1b in European Countries, plus Ancient Migrations
Here are some new results:
Terry, February 2011
- y-Haplogroups I1 and R1b in European Countries, plus Ancient Migrations
- I1-"BBA" is the dominant I1 STR cluster/clan in Norway and Sweden. Nearly 50% of Norway and Sweden's I1 population is I1-"BBA". It is also prevalent in other countries such as the Netherlands/Belgium and Denmark. Many I1-"BBA" and virtually all I1-"BBB" people have the L22 SNP mutation for I1d. This is consistent with the L22 mutation originally happening in a "BB" individual before the split into "BBA" and "BBB".
- I1-"BAA" and I1-"BAB" (collectively called I1-"BA*") are the dominant I1 STR clusters/clans in Finland and account for about 75% of Finland's I1 population. Those two clusters/clans are associated with the L258 SNP mutation for I1d3a. Due to a back-mutation in DYS511 they could otherwise have been classified I1-"BBA". Within Finland, I1-"BAA" is more likely in western areas (and even across to Sweden/Norway), and I1-"BAB" is more likely in eastern areas. (Note that a DYS459a=7 cluster should be excluded from I1-"BA*".)
- I1-"AABB" is relatively high in Ireland, Scotland, England, and the Netherlands/Belgium. That cluster/clan is associated with the L338 SNP mutation for I1f. So far, 16 out of 16 known L338+ people would be classified as I1-"AABB". Up to 10% of I1 people may have that mutation.
- I1-"AABA" is particularly high in Wales - but the sample size is low. So far, 8 out of 19 Welsh-ancestry I1 people, or about 40%, would be classified as I1-"AABA", whereas the European-wide rate is less than 10%.
- I1-"AAA" is the dominant I1 STR cluster/clan in Europe as a whole. About 25% of European I1 people would be classified as I1-"AAA". There is a slight decrease in frequency from southern to northern Europe, due to the increasing frequency of I1-"BBA" in the north.
UPDATE4: y-Haplogroup I1 Dispersal/Expansion
The map below superimposes the present day range of the various I1 STR Cluster/Clans. It is a guide to the possible I1 Dispersal/Expansion.
One should keep in mind that the range and distribution of all haplogroups in Europe have been complicated by the comparatively recent Migration of "Barbarians" (before about 500 AD) and the Migration of "Vikings" (around 800 AD to 1100 AD). The "Barbarians" were mainly Germanic tribes from east of the Rhine and north of the Danube, comprising of the Goths (Visigoths and Ostrogoths), Vandals, Lombards, Burgundians, Franks, and Suebi etc. Also the Angles, Saxons, and Jutes; plus the non-Germanic Huns from Central Asia.
Terry, August 2011
UPDATE6: y-Haplogroup I1 and Ancient European Migrations
To understand how y-Haplogroup I1 and its various clusters/clans were dispersed around Europe, one needs to understand the movements, dispersals/expansions, and migrations of people back many thousands of years. The link below summarises, in map form, the movements that influenced European people from 13,000 BC onwards to 1000 AD.
- y-Haplogroup I1 and Ancient European Migrations
Events covered include:
- Migration into Europe, 45,000 BC - 39,000 BC
- Expansion out of Refugia, 13,000 BC - 7,000 BC
- Expansion of Farming, 7,000 BC - 4,000 BC
- Expansion of Indo-Europeans, 4,000 BC - 1,000 BC
- Expansion of "Celts" and "Germanics", 1,000 BC - 250 BC
- Expansion of Roman Empire, 250 BC - 100 AD
- Migration of "Barbarians", 100 AD - 500 AD
- Migration of Slavs, 500 AD - 800 AD
- Migration of "Vikings", 800 AD - 1100 AD
Terry, September 2011
UPDATE7: y-Haplogroups I1 and I2 Tree (Preliminary Results)
Here is a tree, based on real STR haplotypes from over four thousand people in y-Haplogroup I1 and I2:
[See next update, which shows an improved tree layout after some correction to the code. Also more data included in the next version.]
Terry, November 2011
UPDATE8: y-Haplogroups I1 and I2 Tree Branch Codes
Enter an FTDNA Kit Number, or a Ysearch ID:
See y-Haplogroups I1 and I2 STR Branches for what you can do with that "Branch Code".
Terry, February 2012
UPDATE10: TMRCA of Y-Haplogroups - based of 1000 Genomes Project data
By simply counting the nucleotide differences between Y-chromosomes, the following tree can be constructed from the 1000 Genomes Project data:
With more data, there may exist people whose Y-Chromosome branches off at an earlier time in the various haplogroups. Also be aware that only the count of nucleotide differences is used to construct the initial version of the tree (see SNP y-Haplogroup Tree Version 1). There are errors bars (not shown) associated with all the branch split dates. Subject to those error bars, and applying some mild prior information (such as that branch K should enclose branches LT, N, O, and P), then a more accurate tree can be constructed (see SNP y-Haplogroup Tree Version 2). It is remarkable that without using any prior information, that the simple count of nucleotide differences is essentially enough to create the tree with only minor branch ordering errors that are within the corresponding branch split error bars.
- The 1000 Genomes Project Consortium, "A map of human genome variation from population-scale sequencing", Nature 467, 1061-1073 (2010).
- Xue et al., "Human Y Chromosome Base-Substitution Mutation Rate Measured by Direct Sequencing in a Deep-Rooting Pedigree", Current Biology 19, Issue 17, 1453-1457 (2009).
- Cruciani et al., "A Revised Root for the Human Y Chromosomal Phylogenetic Tree: The Origin of Patrilineal Diversity in Africa", American Journal of Human Genetics 88, 814-818 (2011).
- Karafet et al., "New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree", Genome Research (2008).
Terry, April 2012
UPDATE14: Haplogroup I1 SNP Geographic Maps and SNP Phylo Tree
The geographic distribution, of the most distant known male-line ancestor, of some the I1 SNPs is as shown in these maps:
And here is a simplified tree of some of the SNPs that have been found within Haplogroup I1.
Terry, January 2013
UPDATE15: Latest Haplogroup I1 SNP Phylo Tree including Geno 2.0 Data
Here is the latest tree with some of the SNPs that have been found within Haplogroup I1. The SNPs shown in red are the new tentative ones that have recently been found in some Geno 2.0 samples:
At this stage the CTS2208 could also include CTS2208, CTS5476. Similarly, Z133 could also include Z133, Z134. And Z74 could also include Z74, Z75. And CTS1679 could include CTS1679, CTS743. And CTS9352 could include CTS9352, CTS9477. And CTS9875 could include CTS9875, F2711. And Z138 includes Z138,Z139. And Z140 includes Z140,Z141. Etc. Other novel, but apparently phyloequivalent Geno 2.0 SNP's are not shown.
Terry, January 2013 through to June 2013
UPDATE17: Latest Haplogroup I2 SNP Phylo Tree including Geno 2.0 Data
Here is the latest tree with some of the SNPs that have been found within Haplogroup I2. The SNPs shown in red are the new tentative ones that have recently been found in some Geno 2.0 samples:
At this stage the F1295 could also include F1295, PF6950. Similarly, PF3983 could also include PF3983, PF4000. And CTS616 could also include CTS616, CTS9183. And CTS7767 could include CTS7767, F2044, PF6328. And PF9894 could also include PF9894,PF9897. And CTS10057 could also include CTS10057,CTS10100. And L801 could also include L801,Z183. Etc. Other novel, but apparently phyloequivalent Geno 2.0 SNP's are not shown.
Terry, March 2013 through to May 2013
UPDATE19: Haplogroup I1 Subgroup Frequency
Based on Geno 2.0 data, and data from the FTDNA Projects, the following are estimates for the percentage of people that are within various SNP subgroups of I1.
The estimates are based mostly on those people who have done the appropriate DNA test. So the estimates are biased towards the people who test such things.
Terry, June 2013
UPDATE20: TMRCA of Y-Haplogroups - based of Complete Genomics data
By simply counting the nucleotide differences between Y-chromosomes (restricted to the non-repetitive regions of the y-chromosome, and ignoring any insertion or deletion or multi-nucleotide mutations), then the following tree can be constructed from the high-coverage full sequence Complete Genomics data:
See UPDATE10 above for the equivalent computed tree using the 1000 Genomes Project data. The computed dates are remarkably similar. The Complete Genomics data has better high coverage data.
Terry, July 2013