FAANG List Archived Post

From streeter@ebi.ac.uk  Mon May 18 08:16:29 2015
Date: Mon, 18 May 2015 14:14:03 +0100
From: Ian Streeter <streeter@ebi.ac.uk>
Subject: Re: Bioinformatics pipelines to analyze FAANG data
To: Multiple Recipients of <faang@animalgenome.org>

Hi Mick,

In the NextGen project (sheep, cattle and goat), we used GATK's VQSR for
variant recalibration and filtering.  The details of how we trained the
models are in this readme document:
ftp://ftp://ftp.ebi.ac
.uk/pub/databases/nextgen/ovis/variants/population_sites/README_ovis_population_sites
Briefly, our training set came from merging together our highest
confidence calls from the intersection of samtools, freebayes, and GATK
Unifiedgenotyper.  We used this merge set to train the Gaussian model,
and then applied the model to recalibrate all other variants.  We did
this separately for SNPs and indels.

This avoided using SNP chip variants as the training set. We were
concerned at the time that the 50k chips for sheep and goat was not a
large enough training set, and possibly had a ascertainment bias towards
commercial breeds.

Regards,
Ian



On 18/05/2015 08:42, WATSON Mick wrote:

> Hi Martien
>
> Excellent, I am glad to see we are not alone!
>
> For the validated SNP sets we use to train models, the SNP chip variants are
> a great place to start - but from there it seems to make sense to me to double
> check and make sure: i) variants are not in poor quality areas of the genome;
> ii) variants are in, or close to, hardy-weinberg equilibrium in at least one
> study.  Perhaps much of this has already been done in the array design, but
> it would be unusual for there to be no false positive SNPs on each array, and
> we should attempt to filter these out :-)
>
> Indels are an interesting one - very high false positive rate in all studies
> I know about.
>
> Laura, do you know anything about how the GATK human indel reference sets were
> created?  1000G_phase1.indels.b37.vcf and
> Mills_and_1000G_gold_standard.indels.b37.sites.vcf?
>
> Cheers
> Mick
>
> -----Original Message-----
> .From: Groenen, Martien [martien.groenenwur.nl]> .Sent: 18 May 2015 08:30
> .To: WATSON Mick; Multiple Recipients of
> .Subject: RE: Bioinformatics pipelines to analyze FAANG data
>
> Hi Mick,
>
> Our results in pig confirm yours and we also filter out quite a bit of SNPs
> using a number of additional criteria after calling.
>
> In our previous pipeline (from a couple of years ago) we used a
> Mosaik/Samtools
> pipeline for the initial calling followed by some additional filtering. For pigs
> we submitted 28 million SNPs to dbSNP  and we only submitted SNPs seen in at
> least two individuals and excluded all indels.
>
> We now use a different pipeline based on BWA/GATK followed by several filtering
> steps. The pipeline also includes VEP. We have noticed that this pipeline
> generates more false positives so requires more stringent filtering afterwards.
> E.g. we exclude SNPs within 5 bp of an indel and also filter out SNPs that
> show an excess of heterozygotes.
>
> We used this to analyse pig, chicken and turkeys (300 pigs, 240 layers, 180
> turkey). For pigs we identified ~75 million SNPs which after further filtering
> was reduced to 44 million (including indels). We also used these to develop
> the porcine 700K Affy HD chip.
>
> Best regards,
> Martien

Contact: faang@iastate.edu