Working with the reference genotype data

Introduction

The reference genotype database has been widely used for GWA studieds. It provides more information on the genetic variants when the individual-level genotype data is not available or the genotypes data has low sequencing depth. The reference also has been used for imputing the missing genotypes to boost the power of GWAS.

Common reference panels include:

  • HapMap3 reference panels: Early reference panels
  • 1000 Genomes Phase3 (1KGP, widely used), 2506 individuals from 26 populations, 49,143,605 Sites. 504 individuals for European ancestry.
  • The Haplotype Reference Consortium (HRC), the first release consists of 64,976 haplotypes at 39,235,157 SNPs. Mainly european ancestry but also includes 1KGP
  • Topmed reference panel. Version r2 of the panel includes 97,256 reference samples and 308,107,085 genetic variants distributed across the 22 autosomes and the X chromosome.

Not all reference panels are public. Large reference needs some applications . Here we focus on working with 1KGP reference panel.

Work with 1KGP

access to the 1KGP project data

To convert them to the plink file, we need to run following commands:

for j in {1..22}; do
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
./plink --make-bed --out chr$j --maf 0.01 --vcf ALL.chr$j.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
done
plink2 --pfile all_phase3 --pgen-info # check the information 
# or use the following command if all_phase3 is zst compressed
plink2 --pfile all_phase3 vzs --pgen-info # check the information 

The output is the information of the genotypes

--pgen-info on all_phase3.pgen:
  Variants: 84805772
  Samples: 2504
  REF alleles are all known
  Maximum allele count for a single variant: >2, not explicitly stored
  Explicitly phased hardcalls present
  No dosages present

Then we extract the European indviduals with maf > 0.01 amd remove snps with multi allels, duplicate posisions and names


plink2 --pfile all_phase3 'vzs' --keep-if SuperPop == EUR --rm-dup exclude-all --snps-only  'just-acgt' --max-alleles 2 --chr 1-22 --maf 0.01 --make-bed --out all_phase3_eur_maf_0.01

2504 samples (1271 females, 1233 males; 2497 founders) loaded from
all_phase3.psam.
77818345 out of 84805772 variants loaded from all_phase3.pvar.zst.
2 categorical phenotypes loaded.
--rm-dup: 191 duplicated IDs, 382 variants removed.
--keep-if: 2001 samples removed.
503 samples (263 females, 240 males; 503 founders) remaining after main
filters.
Calculating allele frequencies... done.
69267844 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
8550119 variants remaining after main filters.

Download and insert genetic distances


plink --bfile all_phase3_eur_maf_0.01 --cm-map ./genetic_map/genetic_map_chr@_combined_b37.txt --make-bed --out all_phase3_eur_maf_0.01_ctm

Replace predictor names with generic names of the form Chr:BP (often not required, but I find this format more convenient). It is useful to create a file containing the original names, generic names and alleles.

genofile='all_phase3_eur_maf_0.01'
awk < ${genofile}_ctm.bim '{$2=$1":"$4;print $0}' > ${genofile}_ref.bim
awk < ${genofile}_ctm.bim '{print $2, $1":"$4, $5, $6}' > ${genofile}_ref.names
cp ${genofile}_ctm.bed ${genofile}_ref.bed
cp ${genofile}_ctm.fam ${genofile}_ref.fam

Then we remove the intermediate results and keep the ${genofile}_ref files as our reference panel.

Reference

Jianqiao Wang
Jianqiao Wang
Assistant Professor

My research interests include statistical genetics and genomics.