FAANG List Archived Post

From laura@ebi.ac.uk  Mon May 18 08:53:20 2015
From: Laura Clarke <laura@ebi.ac.uk>
Date: Mon, 18 May 2015 14:52:04 +0100
Subject: Re: Bioinformatics pipelines to analyze FAANG data
To: Multiple Recipients of <faang@animalgenome.org>

Hi Mick

Ian Streeter from my group has already responded about how it is
possible to do this for livestock species. There are certainly
improvements which can be made

As far as the human data sets go, the two indel sets come from the
Phase 1 1000 Genomes indels (We have phase 3 indels mapped to GRCh38
for the next remap) and the Mills/Devine dataset was based on a high
quality well validated indel set that members of the consortium
provided

thanks

Laura

On 18 May 2015 at 08:42, WATSON Mick roslin.ed.ac.uk> wrote:
> Hi Martien
>
> Excellent, I am glad to see we are not alone!
>
> For the validated SNP sets we use to train models, the SNP chip variants are
> a great place to start - but from there it seems to make sense to me to double
> check and make sure: i) variants are not in poor quality areas of the genome;
> ii) variants are in, or close to, hardy-weinberg equilibrium in at least one
> study.  Perhaps much of this has already been done in the array design, but
> it would be unusual for there to be no false positive SNPs on each array, and
> we should attempt to filter these out :-)
>
> Indels are an interesting one - very high false positive rate in all studies
> I know about.
>
> Laura, do you know anything about how the GATK human indel reference sets were
> created?  1000G_phase1.indels.b37.vcf and
> Mills_and_1000G_gold_standard.indels.b37.sites.vcf?
>
> Cheers
> Mick
>
> -----Original Message-----
> .From: Groenen, Martien [martien.groenenwur.nl]> .Sent: 18 May 2015 08:30
> .To: WATSON Mick; Multiple Recipients of
> .Subject: RE: Bioinformatics pipelines to analyze FAANG data
>
> Hi Mick,
>
> Our results in pig confirm yours and we also filter out quite a bit of SNPs
> using a number of additional criteria after calling.
>
> In our previous pipeline (from a couple of years ago) we used a Mosaik/Samtools
> pipeline for the initial calling followed by some additional filtering. For pigs
> we submitted 28 million SNPs to dbSNP  and we only submitted SNPs seen in at
> least two individuals and excluded all indels.
>
> We now use a different pipeline based on BWA/GATK followed by several filtering
> steps. The pipeline also includes VEP. We have noticed that this pipeline
> generates more false positives so requires more stringent filtering afterwards.
> E.g. we exclude SNPs within 5 bp of an indel and also filter out SNPs that
> show an excess of heterozygotes.
>
> We used this to analyse pig, chicken and turkeys (300 pigs, 240 layers, 180
> turkey). For pigs we identified ~75 million SNPs which after further filtering
> was reduced to 44 million (including indels). We also used these to develop
> the porcine 700K Affy HD chip.
>
> Best regards,
> Martien

Contact: faang@iastate.edu