Monday 22 January 2018

Small segments and pile-ups - a visualisation

We've recently been discussing the problem of pile-ups in the All Genetic Genealogy group on Facebook. A pile-up is a term used in genetic genealogy to describe multiple shared autosomal DNA segments that are stacked up on top of each other on the same part of the genome. The presence of a pile-up should be considered as a warning sign. For any shared segment to have genealogical significance we would expect it to be shared only with descendants of the common ancestral couple. If we share a segment with hundreds or thousands of people it is extremely unlikely that we will share that section of DNA by virtue of a recent genealogical relationship within the last ten generations or so, and it is much more likely to be indicative of a false match or a more distant relationship.

Pile-ups can occur for a number of different reasons:
  • Lack of phasing. Phasing is the process of sorting the DNA letters (the As, Cs, Ts and Gs) onto the paternal and maternal chromosomes. AncestryDNA and MyHeritage now used phased matching which means that they phase our genotypes before trying to identify shared sections of DNA. 23andMe and Family Tree DNA use a process of half-identical matching. Our DNA is not phased but instead the algorithms zigzag backwards and forwards across two columns of unsorted DNA letters looking for consecutive runs of matching SNPs. Half-identical matching works well at identifying large shared segments of DNA but is less successful on smaller segments, and particularly segments under about 10 centiMorgans (cMs) in size. if a match does not survive phasing it is a false match.
  • SNP-poor regions. The autosomal DNA tests used for genetic genealogy provide information on between 630,000 and 700,000 genetic markers known as SNPs (single nucleotide polymorphisms) which are scattered across the genome. These SNPs are only a tiny fraction of the three billion letters which make up the human genome, but the SNPs are specially selected for being the most informative about variations within and between populations. When trying to identify shared regions of the genome the companies are looking for long runs of consecutive SNPs that are the same (identical by state or IBS) in two individuals. Segments which pass the companies' matching thresholds are declared to be identical by descent (IBD) and are possibly indicative of shared ancestry in a genealogical timeframe. Some companies will also apply additional algorithms to filter out known problematic regions which are unlikely to be IBD. However, because not all of our SNPs are being tested, the length of a segment can be falsely inflated. One hypothesis is that lots of small segments can become conflated into longer segments. (1) This problem is particularly likely to occur in sections of the genome which have poor coverage on the chips. (2) 
  • Excess IBD. This is a term used to describe sections of the genome which are known to be widely shared in humans or in certain populations. Such regions often offer some type of evolutionary advantage. For an overview of known excess IBD regions see the section on excess IBD sharing in the ISOGG Wiki article on IBD. In addition to looking at the size of a shared segment, some IBD detection algorithms will, therefore, also take into account the frequency of the segment. (3) The more people who share a segment, the older it is likely to be. AncestryDNA apply their proprietary Timber algorithm to phased segments and they downweight the cM count for segments that are widely shared in their database. (4)
Each individual has their own personal pile-ups. It can be instructive to map out your pile-ups so that you are aware of your own danger zones. I've previously used Don Worth's ADSA (autosomal DNA segment analyser) tool which is available from DNAGedcom to look at my pile-ups. I've also use the matching segment search at GEDmatch (this tool is available to Tier 1 subscribers). (5)  These tools are very useful for identifying problems in specific regions but it's difficult to get a good idea of the bigger picture.

Following on from our discussion in the All Genetic Genealogy Facebook group, Dan Edwards has been working on an exciting tool to provide a new way of visualising pile-ups. It's possible that the tool will eventually be made available on the web but for the moment it is a bespoke service. Dan has been experimenting on some of my data. He has produced for me some charts showing the distribution of shared segments across my 22 autosomes and on the X-chromosome. Dan has kindly given me permission to share my charts which are reproduced below.

The charts are based on my Family Finder chromosome browser data from Family Tree DNA. FTDNA updated their match thresholds in May 2016, but they are still the only company that continue to include small segments under 6 cMs when inferring a relationship. It is generally accepted by genetic genealogists that the use of such small segments is problematical. (6)

The problem with small segments can be clearly seen in the charts below. Rather than being distributed evenly across my genome, the smaller shared segments form huge spires and skyscrapers. As the segment size increases the pile-ups are greatly reduced, but there are still some parts of my genome which have some quite sizeable pile-ups on segments over 10 cMs in size. Chromosomes 9, 14, 18 and 19, in particular, seem to have a few problem areas which it is probably best for me to avoid. As more matches come in, these spires and skyscrapers can be expected to grow even more. Remember too that FTDNA only reports "matches" on small segments if the match thresholds have already been met. If matches were reported on all matches in the database down to 1 cM it's likely that the spires would be even more pronounced.

If Dan is able to develop his tool further and make it more widely available it will be interesting to see how other people's pile-ups compare with mine. I hope that we might also be able to identify a reason for some of the pile-ups. In the meantime I hope you enjoy looking at my pictures.























Footnotes

(1) See: Chiang CWK, Ralph P, Novembre J (2016). Conflation of short identity-by-descent segments bias their inferred length distribution. G3 Genes Genomes Genetics 6: 1287.

(2) For a useful overview of SNP coverage on the chips used by AncestryDNA and 23andMe see Rebekah Canada's series of articles on the subject of exploring microarray chips.

(3) For a good overview of the methodology of IBD detection see Browning and Browning (2012):  Identity by descent between distant relatives: detection and applications (Annual Review of Genetics 2012; 46: 617-33). The authors state: "The key idea behind IBD segment detection is haplotype frequency. If the frequency of a shared haplotype is very small, the haplotype is unlikely to be observed twice in independently sampled individuals, so one can infer the presence of an IBD segment. This criterion can be applied in several ways. The first is length of sharing, which is a proxy for frequency. If two densely genotyped haplotypes are identical at all or most (allowing for some genotyping error) assayed alleles over a very large segment of a chromosome, then the haplotypes are likely to be identical by descent across the whole segment. The second is direct use of haplotype frequency: Shared haplotypes with estimated frequency below some threshold are determined to be identical by descent. The third makes use of a population genetics model to infer probability of IBD. Given the frequency of the shared haplotype and a probability model for the IBD process along the chromosome, one can estimate the probability that the individuals are identical by descent at any position on the segment."

(4) For a good explanation of how the AncestryDNA algorithm works read the blog post by Julie Granka on Filtering DNA matches at AncestryDNA with Timber. Take a look in particular at the figure in that blog post. Although the majority of phased segments filtered out by Timber are smaller segments under 15 cMs, note that it also downweights some larger segments up to 50 cMs in size.

(5) Peter Alefounder has developed a tool known as the Geneal Segment Stacker but I've not yet had time to play around with it. There are further details in this thread in the ISOGG Facebook group.

(6) For an excellent summary on the current state of our knowledge on the subject of small segments see the blog post A small segment round up by Blaine Bettinger.

Further reading

Sunday 14 January 2018

A chromosome browser and a new matching algorithm at MyHeritage

There was a big update at MyHeritage on Thursday this week. They rolled out their updated matching algorithms and also introduced a new chromosome browser feature. MyHeritage have written an excellent blog post which explains the changes in more detail and also provides a good overview of the technicalities of DNA matching written in easy-to-understand language. You can read the article here:
All MyHeritage customers are currently automatically opted in to DNA matching. If, for any reason, you do not want to be notified of matches you can opt out in the My Privacy DNA settings.
I previously had 49 matches at MyHeritage. The new algorithms have allowed them to drop the threshold and report more distant matches. I now have a grand total of 1474 matches. Before the changeover I found that 72% of my matches did not match either of my parents. Previously I had to go through all my matches one by one and check whether or not they matched my parents. Now, if I click on my matches with my mum and dad, I can see the tally of the matches along with a list of all the matches I share with them. I now share 530 matches with my dad and 473 with my mum. This means that 1003 of my 1474 matches (68%) match my parents. The mismatch rate has been reduced to 32% which is a huge improvement. MyHeritage announced at the end of December that they had tested 1.08 million people so the number of matches is much more in line with what we might expect from such a large database. MyHeritage advised in November that the majority of their customers were in the US but that "sales in Europe are strong".

There are some useful filters which can be used to sort your matches. Currently you can view matches that have family trees, shared surnames and Smart Matches.

I found that 1,255 of my 1474 matches (85%) have uploaded trees. However, no indication is given of the completeness of the trees, and I've noticed that some of the trees only contain a single person.

Two hundred and thirty-one of my matches have shared ancestral surnames. On a brief perusal, many of these are common surnames like Johnson and Williams, and the people I match with these surnames seem to be mostly in America and will likely have no connection with Berkshire or Devon where my ancestors with these surnames are to be found. I would suggest it's best to focus on shared matches with rarer surnames.

I like the way that MyHeritage displays country flags as this makes it much easier to identify people in the countries where you are mostly likely to find recent genetic cousins. Even better, it is possible to filter matches by country, as well as searching for matches by surname and full name. The menu can be found on your DNA Matches page.


Note that the country search box will only accept a single word so if you are searching for matches from Great Britain simply enter the word "Great". Similarly if you're trying to locate matches from New Zealand search for the word "New". I currently have 123 matches from Great Britain, 12 matches from Ireland, 62 matches from Australia, 16 matches from New Zealand, 41 matches from Canada and 867 matches from the USA. Many thanks to Louise Coakley for alerting me to this filter and for the tip about searching for matches from Great Britain and New Zealand.

My Heritage have also added a chromosome browser so that you can see a visual display of your matches. You need to scroll right down to the bottom of the match page to locate the tool. Here's the chromosome browser view of my closest match from the UK.
If I click on the Advanced Options on the top right of the chromosome browser I can download the matching segment data. In this case my match shares three segments of DNA with me which are 13.07 cMs, 6.04 cMs and 6.14 cMs respectively in size.

I recognise the names of some people who match me at other companies. I've not done a proper check but my sense is that the people who match me as 3rd to 5th cousins at MyHeritage are assigned more distant relationships at Ancestry (4th to 6th cousin or 5th to 8th cousins). Given that I'm not able to make the genealogical connections with these people I suspect the AncestryDNA estimates are more appropriate.

There's also a facility to sort matches by shared DNA, largest segment, full name and most recent. Apart from my mum and dad, I currently have no matches closer than third to fifth cousin. My highest match is somebody in America who shares 0.4% (31.9 cMs with me (0.4%) spread across four segments. However, the longest segment is only 12.8 cMs. This match only shares a total of 12.8‎ cMs (0.2%) with my dad. I can see that the remaining three segments this match shares with me that are not shared with my dad are all very small (6.49, 6.03 and 6.62 cMs respectively) so I would guess that these are false positive segments.

Partnership with FTDNA
MyHeritage use the Family Tree DNA labs in Houston, Texas, for their testing. If you've tested at MyHeritage you have the option of taking advantage of the free transfer to Family Tree DNA. The link can be found at the bottom of your DNA results page.
Further details of the transfer programme can be found here.

Similarly, if you've tested at FTDNA you can transfer your results free of charge to MyHeritage using the MyHeritage Upload link. Both companies have different databases and you will find people in both databases who have not tested elsewhere. You never know where you are going to get those all-important breakthrough matches so it's best to "fish in all the ponds".

Conclusion
MyHeritage have done an excellent job overhauling their matching algorithms. It is surprisingly difficult with current technology to identify distant matches, especially when results are being combined across different platforms. I think that MyHeritage are going about the matching in the right way and they are being very responsive to the feedback provided by genetic genealogists. I am sure we will see further improvements in the months and years to come. I look forward to receiving many more matches and to confirming my first relationship at MyHeritage DNA.

Other reviews