I've just noticed something very odd while observing the resulting VCF file (the first one, containing both linked and unlinked SNPs). For a given locus containing multiple SNPs (i.e. more than one polymorphism in the same "bubble"), each and every individual, diploid sample will be either heterozygous or homozygous for ALL SNPs, what isn't realistic at all.
I have a ddRAD dataset containing over a million loci (of which about 60% have the number of polymorphisms >= 2), 25 samples, and not a single sample has a combination of both heterozygous and homozygous sites in the same locus. I observed this using both discoSnp and discoSnpRad (the parameters I used in this particular case were k_31_c_3_D_0_P_5_m_5). Looking in the .fa file (the first generated output) I can see why the VCF couldn't get any further information about exact bases on each SNP, since for each locus we have only one genotype (either 0/0, 0/1, 1/1 or ./.) for each sample , forcing any converting tools to assume the same genotype combination for every SNP on that locus. I'm using the last available version (2.6.2).
What's happening? Maybe that information is being lost between file conversions? Or maybe it's irretrievable somehow? Or am I missing something, like a parameter in the pipeline?
Thank you very much in advance!
Érico.
I've just noticed something very odd while observing the resulting VCF file (the first one, containing both linked and unlinked SNPs). For a given locus containing multiple SNPs (i.e. more than one polymorphism in the same "bubble"), each and every individual, diploid sample will be either heterozygous or homozygous for ALL SNPs, what isn't realistic at all.
I have a ddRAD dataset containing over a million loci (of which about 60% have the number of polymorphisms >= 2), 25 samples, and not a single sample has a combination of both heterozygous and homozygous sites in the same locus. I observed this using both discoSnp and discoSnpRad (the parameters I used in this particular case were k_31_c_3_D_0_P_5_m_5). Looking in the .fa file (the first generated output) I can see why the VCF couldn't get any further information about exact bases on each SNP, since for each locus we have only one genotype (either 0/0, 0/1, 1/1 or ./.) for each sample , forcing any converting tools to assume the same genotype combination for every SNP on that locus. I'm using the last available version (2.6.2).
What's happening? Maybe that information is being lost between file conversions? Or maybe it's irretrievable somehow? Or am I missing something, like a parameter in the pipeline?
Thank you very much in advance!
Érico.