admin管理员组

文章数量:1032881

如何计算群体中的单倍型频率

大家好,我是邓飞。

昨天写了一篇(单倍型的显著性分析)的博文,里面介绍了为什么GWAS分析后,要进行单倍型的显著性分析,简而言之,如果显著性位点在block中,以block为代表进行利用,可以进行PRS(多基因评分)或者MAS(分子标记辅助选择。

1,数据

数据是plink格式的map和ped格式的基因型数据。

测试数据下载:

有54个位点,310个个体。

2,计算单倍型

使用plink1.9,通过参数--blocks计算单倍型

代码语言:javascript代码运行次数:0运行复制
$ plink --file a19 --blocks no-pheno-req
PLINK v1.90b4.6 64-bit (15 Aug 2017)           www.cog-genomics/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
Options in effect:
  --blocks no-pheno-req
  --file a19

15488 MB RAM detected; reserving 7744 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (54 variants, 310 people).
--file: plink-temporary.bed + plink-temporary.bim + plink-temporary.fam
written.
54 variants loaded from .bim file.
310 people (0 males, 0 females, 310 ambiguous) loaded from .fam.
Ambiguous sex IDs written to plink.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 310 founders and 0 nonfounders present.
Calculating allele frequencies... done.
54 variants and 310 people pass filters and QC.
Note: No phenotypes present.
--blocks: 8 haploblocks written to plink.blocks .
Extra block details written to plink.blocks.det .
Longest span: 102.68kb.

从日志可以看出,共有8个单倍型,打印结果:

代码语言:javascript代码运行次数:0运行复制
]$ cat plink.blocks.det 
 CHR          BP1          BP2           KB  NSNPS SNPS
  19     26609728     26619327          9.6      4 QTLSaanen19.135_O|QTLSaanen19.137_O|QTLSaanen19.26_O|QTLSaanen19.28_O
  19     26647834     26659230       11.397      2 QTLSaanen19.32|GoatD01.005522
  19     26818809     26824298         5.49      2 QTLSaanen19.45_O|QTLSaanen19.46
  19     27170279     27214437       44.159      5 snp24006-scaffold244048787|QTLSaanen19.49|QTLSaanen19.50_O|QTLSaanen19.51|QTLSaanen19.53_O
  19     27220123     27228780        8.658      2 QTLSaanen19.54_O|QTLSaanen19.55
  19     27480793     27482976        2.184      2 snp24012-scaffold244-1259949|19_27482976_AF-PAKI
  19     27502643     27605322       102.68     15 QTLSaanen19.63_O|QTLSaanen19.64|QTLSaanen19.65_O|snp24013-scaffold244-1309587|QTLSaanen19.67|QTLSaanen19.68|QTLSaanen19.69_O|snp24014-scaffold244-1338180|QTLSaanen19.70_O|QTLSaanen19.71|QTLSaanen19.74|QTLSaanen19.76_O|QTLSaanen19.77_O|QTLSaanen19.78_O|snp24015-scaffold244-1384770
  19     27618016     27623534        5.519      3 QTLSaanen19.79|QTLSaanen19.79_O|QTLSaanen19.80

3,计算每个block的频率

使用plink1.07,参数:--hap-freq

代码语言:javascript代码运行次数:0运行复制
$ plink1.07 --file a19 --hap plink.blocks --noweb  --hap-freq

@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        /        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ plink.log ]
Analysis started: Wed Apr  2 07:56:49 2025

Options in effect:
	--file a19
	--hap plink.blocks
	--noweb
	--hap-freq

54 (of 54) markers to be included from [ a19.map ]
Warning, found 310 individuals with ambiguous sex codes
These individuals will be set to missing ( or use --allow-no-sex )
Writing list of these individuals to [ plink.nosex ]
310 individuals read from [ a19.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 310 missing
0 males, 0 females, and 310 of unspecified sex
Before frequency and genotyping pruning, there are 54 SNPs
310 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 54 SNPs
After filtering, 0 cases, 0 controls and 310 missing
After filtering, 0 males, 0 females, and 310 of unspecified sex

Read 8 haplotypes from [ plink.blocks ]
Estimating haplotype frequencies/phases ( MHF >= 0.01 )
Considering phases P(H|G) >= 0.01
Requiring per individual per haplotype missingness < 0.5 
Writing haplotype frequencies to [ plink.frq.hap ]
8 out of 8 haplotypes phased       
Analysis finished: Wed Apr  2 07:56:49 2025

查看结果文件:

代码语言:javascript代码运行次数:0运行复制
$ cat plink.frq.hap 
     LOCUS    HAPLOTYPE          F
        H1         ACCC    0.05323
        H1         GATC       0.35
        H1         GATT     0.5968
        H2           GC     0.4821
        H2           AT     0.4885
        H2           GT    0.01956
        H3           CA     0.1339
        H3           AG     0.8661
        H4        CGGTT     0.1758
        H4        TAGCC    0.01867
        H4        CAACC     0.0477
        H4        TAACC     0.7442
        H5           GT     0.1774
        H5           AC     0.4613
        H5           GC     0.3613
        H6           AT     0.1806
        H6           GT     0.1435
        H6           GC     0.6758
        H7 TGCCAATCTTACAAT     0.1871
        H7 ATTTGGTTCCCAGGC       0.25
        H7 ATTTGGCTCCCAGGC     0.1581
        H7 ATTTGGTCTCACAAT    0.04355
        H7 ATTTGGTCTTACAAT     0.3323
        H7 ATTTGGTTCCACAAT    0.01774
        H8          AAG     0.1371
        H8          GGT     0.1242
        H8          AGT     0.3323
        H8          AAT     0.4065

可以看到,H1~H8是八个block,每个block里面包括不同的单倍型以及对应的频率,比如第一个block,包括3个类型:

- ACCC

- GATC

- GATT

对应的频率分别是:0.05323, 0.35和0.5968

把对应的基因型提取出来,确定每个样本的单倍型类型,结合表型数据,就可以进行单倍型间的显著性分析了,进而绘制箱线图、小提琴图等图了。

这种图:

这种图:

或者这种图:

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。原始发表:2025-04-02,如有侵权请联系 cloudcommunity@tencent 删除数据blockfilefrequency日志

如何计算群体中的单倍型频率

大家好,我是邓飞。

昨天写了一篇(单倍型的显著性分析)的博文,里面介绍了为什么GWAS分析后,要进行单倍型的显著性分析,简而言之,如果显著性位点在block中,以block为代表进行利用,可以进行PRS(多基因评分)或者MAS(分子标记辅助选择。

1,数据

数据是plink格式的map和ped格式的基因型数据。

测试数据下载:

有54个位点,310个个体。

2,计算单倍型

使用plink1.9,通过参数--blocks计算单倍型

代码语言:javascript代码运行次数:0运行复制
$ plink --file a19 --blocks no-pheno-req
PLINK v1.90b4.6 64-bit (15 Aug 2017)           www.cog-genomics/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
Options in effect:
  --blocks no-pheno-req
  --file a19

15488 MB RAM detected; reserving 7744 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (54 variants, 310 people).
--file: plink-temporary.bed + plink-temporary.bim + plink-temporary.fam
written.
54 variants loaded from .bim file.
310 people (0 males, 0 females, 310 ambiguous) loaded from .fam.
Ambiguous sex IDs written to plink.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 310 founders and 0 nonfounders present.
Calculating allele frequencies... done.
54 variants and 310 people pass filters and QC.
Note: No phenotypes present.
--blocks: 8 haploblocks written to plink.blocks .
Extra block details written to plink.blocks.det .
Longest span: 102.68kb.

从日志可以看出,共有8个单倍型,打印结果:

代码语言:javascript代码运行次数:0运行复制
]$ cat plink.blocks.det 
 CHR          BP1          BP2           KB  NSNPS SNPS
  19     26609728     26619327          9.6      4 QTLSaanen19.135_O|QTLSaanen19.137_O|QTLSaanen19.26_O|QTLSaanen19.28_O
  19     26647834     26659230       11.397      2 QTLSaanen19.32|GoatD01.005522
  19     26818809     26824298         5.49      2 QTLSaanen19.45_O|QTLSaanen19.46
  19     27170279     27214437       44.159      5 snp24006-scaffold244048787|QTLSaanen19.49|QTLSaanen19.50_O|QTLSaanen19.51|QTLSaanen19.53_O
  19     27220123     27228780        8.658      2 QTLSaanen19.54_O|QTLSaanen19.55
  19     27480793     27482976        2.184      2 snp24012-scaffold244-1259949|19_27482976_AF-PAKI
  19     27502643     27605322       102.68     15 QTLSaanen19.63_O|QTLSaanen19.64|QTLSaanen19.65_O|snp24013-scaffold244-1309587|QTLSaanen19.67|QTLSaanen19.68|QTLSaanen19.69_O|snp24014-scaffold244-1338180|QTLSaanen19.70_O|QTLSaanen19.71|QTLSaanen19.74|QTLSaanen19.76_O|QTLSaanen19.77_O|QTLSaanen19.78_O|snp24015-scaffold244-1384770
  19     27618016     27623534        5.519      3 QTLSaanen19.79|QTLSaanen19.79_O|QTLSaanen19.80

3,计算每个block的频率

使用plink1.07,参数:--hap-freq

代码语言:javascript代码运行次数:0运行复制
$ plink1.07 --file a19 --hap plink.blocks --noweb  --hap-freq

@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        /        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ plink.log ]
Analysis started: Wed Apr  2 07:56:49 2025

Options in effect:
	--file a19
	--hap plink.blocks
	--noweb
	--hap-freq

54 (of 54) markers to be included from [ a19.map ]
Warning, found 310 individuals with ambiguous sex codes
These individuals will be set to missing ( or use --allow-no-sex )
Writing list of these individuals to [ plink.nosex ]
310 individuals read from [ a19.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 310 missing
0 males, 0 females, and 310 of unspecified sex
Before frequency and genotyping pruning, there are 54 SNPs
310 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 54 SNPs
After filtering, 0 cases, 0 controls and 310 missing
After filtering, 0 males, 0 females, and 310 of unspecified sex

Read 8 haplotypes from [ plink.blocks ]
Estimating haplotype frequencies/phases ( MHF >= 0.01 )
Considering phases P(H|G) >= 0.01
Requiring per individual per haplotype missingness < 0.5 
Writing haplotype frequencies to [ plink.frq.hap ]
8 out of 8 haplotypes phased       
Analysis finished: Wed Apr  2 07:56:49 2025

查看结果文件:

代码语言:javascript代码运行次数:0运行复制
$ cat plink.frq.hap 
     LOCUS    HAPLOTYPE          F
        H1         ACCC    0.05323
        H1         GATC       0.35
        H1         GATT     0.5968
        H2           GC     0.4821
        H2           AT     0.4885
        H2           GT    0.01956
        H3           CA     0.1339
        H3           AG     0.8661
        H4        CGGTT     0.1758
        H4        TAGCC    0.01867
        H4        CAACC     0.0477
        H4        TAACC     0.7442
        H5           GT     0.1774
        H5           AC     0.4613
        H5           GC     0.3613
        H6           AT     0.1806
        H6           GT     0.1435
        H6           GC     0.6758
        H7 TGCCAATCTTACAAT     0.1871
        H7 ATTTGGTTCCCAGGC       0.25
        H7 ATTTGGCTCCCAGGC     0.1581
        H7 ATTTGGTCTCACAAT    0.04355
        H7 ATTTGGTCTTACAAT     0.3323
        H7 ATTTGGTTCCACAAT    0.01774
        H8          AAG     0.1371
        H8          GGT     0.1242
        H8          AGT     0.3323
        H8          AAT     0.4065

可以看到,H1~H8是八个block,每个block里面包括不同的单倍型以及对应的频率,比如第一个block,包括3个类型:

- ACCC

- GATC

- GATT

对应的频率分别是:0.05323, 0.35和0.5968

把对应的基因型提取出来,确定每个样本的单倍型类型,结合表型数据,就可以进行单倍型间的显著性分析了,进而绘制箱线图、小提琴图等图了。

这种图:

这种图:

或者这种图:

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。原始发表:2025-04-02,如有侵权请联系 cloudcommunity@tencent 删除数据blockfilefrequency日志

本文标签: 如何计算群体中的单倍型频率