小梁的个人空间

植物基因组组装综述

默默家有 — Tue, 21 Nov 2023 02:13:00 +0000

基因组特征评估Survey

基因组大小、杂合度和重复序列含量是决定测序成本、组装难度和最终组装效果的最重要的几个特征。

全部测序read 中K-mer（在测序read 上相隔1 bp 取长度为K 的子序列）的种类及其出现次数（K-mer深度）通过分布曲线展示出来，即可观察到基因组的基本特征。

在测序覆盖均匀、没有测序错误和重复序列的基因组上，K-mer 分布曲线符合泊松分布。如果基因组存在某些复杂特征，会使分布曲线偏离泊松分布，出现与特征相对应的峰。

在实际测序数据的K-mer 分布曲线上，第一个极高的值是测序错误导致的K-mer，深度只有1-2。
单倍体或纯合基因组的K-mer 分布曲线只有一个主峰。杂合二倍体基因组的K-mer 分布曲线有两个峰，分别为杂合峰和纯合峰，前者深度只有后者的一半。
杂合多倍体基因组则会出现多个杂合峰。杂合峰的比例越高，表示杂合度越大。
重复序列含量较高时会在主峰后面形成一个小峰或者在极高深度处形成拖尾。

基因组大小可以由（总K-mer 数量）/（K-mer期望测序深度）来估计，通常以K-mer 分布曲线的主峰深度作为期望测序深度。该公式估算的基因组大小有10% 左右的误差，可以结合流式细胞实验检测DNA 含量，估算基因组大小进行综合考虑。

几种植物基因组Illumina 测序数据K-mer 分布曲线：

a：测序错误导致的峰，深度只有1-2；b：单倍体或者纯合二倍体基因组的主峰；c：低拷贝数重复序列组成的峰，深度常为主峰的2 倍；d：高频重复序列组成的峰。在杂合二倍体基因组中，b1 峰包含杂合区域的k-mer，b2 峰包含纯合区域的k-mer。b1 深度只有平均深度的一半。在杂合同源四倍体植物中，b1 和b2 峰都表示杂合区域的k-mer，b3 峰表示纯合区域的k-mer

简单植物基因组组装

基因组大小不超过1Gb，纯合或者杂合度低于千分之五，重复序列含量低于50% 的基因组可以被归类为简单基因组，使用二代测序数据、二三代测序数据混合或者纯三代测序数据，都可以完成组装。

在二代数据为主的项目中，通常用小片段文库组装contig，大片段文库（mate-pair）构建scaffold ；加入少量三代数据混合组装，以填补scaffold 中的gap区域。

与前两种方式相比，使用纯三代数据组装，能够显著提高组装的连续性、完整性等指标，缩短组装时间。使用三代测序数据获得高质量的组装片段，再利用遗传图谱、Hi-C 图谱、光学图谱等构建成染色体，是当前解析简单基因组最高效的方案，也是学术期刊对简单基因组组装的普遍要求。

由于三代测序数据单碱基错误率高达10%-15%，组装得到的基因组通常需要先进行序列纠错（“抛光”）再进行基因注释等分析。基因组纠错可以使用二代数据或者三代数据，必要时两种数据结合进行多次纠错。

高杂合基因组组装

自交不亲和和无性繁殖在自然界的植物中普遍存在，造成了基因组的杂合特征。

高杂合基因组杂合度约为1%-2%，即同源片段的序列差异达到1%-2%，导致组装时同源区域的read 无法充分合并，产生大量分支结构，严重影响组装的连续性及后续分析。

将基因组DNA 分成小份分别进行测序、组装是避免杂合片段干扰的一种有效方法，每份DNA 含有极少量杂合片段，基本可作为纯合基因组组装，从而降低组装难度。

早期解决杂合基因组使用BAC-by-BAC 策略，构建数万个BAC 克隆，每个单独测序、组装，然后合并成一套基因组。另一种方法是借助减数分裂分离出单套基因组，比如通过花粉培养获得单倍体个体。而对于无法获得单倍体的物种，研究人员则设法从二倍体的测序数据中提取单倍体数据。

如在杂合菠萝（Ananas comosus（L.）Merr.）基因组项目中，将杂合菠萝F153与CB5 杂交，通过比较后代F1 个体与亲本F153 的测序read，分离出F153 其中一套基因组的read 进行组装。

近年来发展的10×Genomeics 技术，将大片段DNA 分子包裹进油滴添加标签后测序，产生的linked-read 保留了基因组长距离的信息，有助于构建更长的scaffold。该方案能以最少的测序和计算成本提供可用的参考基因组，已经在植物基因组中广泛应用。

在早期的基因组项目中，组装的目的是得到一个完整的单倍体参考基因组，因此只取单套基因组进行组装或者将基因组内杂合区域尽量合并。随着对基因组研究的深入，基因组单体型信息越来越受到重视，对杂合物种的基因组提出了分型组装的需求。

Falcon-unzip 是最早利用三代测序数据进行杂合基因组组装和分型的工具，其组装结果包含一个单倍体参考基因组和杂合区域的局部单体型信息，是目前杂合基因组分型最常见的呈现方式。

由于三代测序数据的读长优势，Falcon-unzip 组装的杂合物种参考基因组在contig 连续性上有显著提升，但是输出的参考基因组混合了两个单体型的序列，在基因注释等后续分析中仍然存在问题。

由于组装算法的局限或变异位点分布不均匀，单纯使用全基因组测序组装的单体型都是局部的、片段化的。借助遗传信息分离同源区域的基因组数据，再将每个区域组装成单体型，是目前解决高杂合物种组装最成功的方法。

“亲本-子代”家系测序（Triobin）是区分杂合个体内两套单体型最直接的方法。Triobin 方法将家系测序与第三代测序技术结合，使用亲本测序数据将杂合F1个体的测序数据分成两类，然后两类分别组装成两个亲本的单体型。该方法对拟南芥F1 个体（杂合度1.36%）的组装结果显示，两个单体型的完成度和质量都达到较高水平。Triobin 对来自亲本杂合区域的read 分类效果较差，更适用于纯合亲本的情况。另外，家系测序的条件在很多研究中无法满足，限制了Triobin 的应用范围。

Triobin分型方案。利用亲本测序数据的特异性K-mer 将子代的测序数据分成两份，分别组装出两个亲本的单体型。

遗传群体也是基因组分型的有力工具。如杂合马铃薯分型组装的流程包含3 个阶段：（1）用HiFi测序数据组装出二倍体基因组的全部contig 序列；（2）构建遗传图谱将contig分配到12 个连锁群中，对应单倍体基因组的12 条染色体；（3）同一连锁群的contig 根据基因型分成两组，代表染色体的两个单体型。

基于单倍体群体测序的分型方案。预先组装的BAC 片段作为分型的输入序列。研究人员测序了12 个梨的花粉细胞，并开发barcoding 的方法将BAC 片段的基因型转换成12 位的二进制条码。该方法中的BAC 序列可以替换成HiFi read 或组装的contig 等高准确率长片段。

与其他分型方法类似，该流程也先区分不同染色体，再区分染色体的两个单体型。在阶段（2）中，研究人员开发了利用contig 构建连锁群的方法，使用遗传连锁群区分不同染色体，避免了对已知参考基因组的依赖，扩展了应用范围。

基于自交分离群体的分型方案。该方案从头组装出二倍体contig，并测序分离群体对contig 进行基因型鉴定。构建遗传图谱区分出不同的染色体，再利用基因型的相似性区分同一染色体、不同单体型的contig。

高杂合基因组的组装和分型一直是基因组方法领域的难点，目前仍然没有相对简便的方法和工具。

高重复基因组组装

重复序列在物种进化和功能调控中扮演不可或缺的角色，是基因组重要的组成部分。重复序列的序列相似性高、长度不一、拷贝数变化范围大，一直是组装中的难题。

相比于二代测序技术，三代长读长测序可以跨过重复序列区域，提高重复序列的区分度，显著改善组装的连续性和重复序列组装的完整性、准确性，这种优势在85% 的序列都来源于转座子扩增的玉米基因组中得到充分体现。PacBio 数据组装的玉米B73 基因组，相对之前基于二代组装的版本，contig 连续性提高了52 倍，并且纠正了着丝粒区的组装错误，极大改善了基因功能区注释和转座子的进化分析。

高重复序列基因组的另外一类代表是拥有巨大基因组的植物，如火炬松（Pinus taeda L.，22 Gb，82%）、挪威云杉（Picea abies，20 Gb，>71%）、银杏（Ginkgo biloba，10 Gb，80%），基因组70%以上都是重复序列，远超拟南芥（20%）、水稻（40%）等模式植物。这些裸子植物都是杂合的，可以选择单倍的配子体胚乳进行测序。

大型基因组的测序成本和组装技术难度都较大。阮珏团队利用PacBio数据和Hi-C重新组装的银杏基因组是目前发表的最高质量的裸子植物基因组。2020 年发表的大蒜（Allium sativum）基因组经历3 次全基因组复制及重复序列扩张，基因组达到16.9 Gb，其中91.3% 都是重复序列，是迄今组装的重复序列比例最高的基因组，组装方法采用了 PacBio 构建contig、10×G 文库连接成scaffold、最后用Hi-C 数据挂载染色体。杂合加州红杉（Sequoia sempervirens ）基因组（6 倍体，单倍体27 Gb）组装使用PacBio HiFi 数据和Hifiasm软件获得47.47 Gb contig 序列，N50 达到1.92 Mb，展示了高准确率三代数据在大型植物基因组组装上的应用前景。

高倍性基因组组装

由于杂交和基因组加倍导致了多倍体植物的存在，一些重要的农作物例如小麦、棉花、马铃薯等都是多倍体，其基因组的解析是影响作物育种进展的重要因素。

多倍体物种根据其形成机制分为异源多倍体和同源多倍体，异源多倍体中染色体来源于不同祖先，基因组内可以区分亚基因组，对组装干扰较少；而同源多倍体中多套染色体之间高度相似，相当于高杂合基因组，组装难度极大。异源多倍体基因组通常可以当做纯合基因组进行组装，其重点是组装后区分亚基因组。

国际小麦测序联盟解析六倍体栽培小麦（Triticum aestivum，AABBDD）基因组时利用流式细胞仪分离技术将21条染色体分离开，分别构建BAC 文库进行测序和组装。分离染色体的技术和成本要求较高，并不常见于普通植物研究。四倍体油菜基因组（Brassica napus，AACC）和四倍体花生基因组（Arachis hypogaea，AABB）的组装借助了二倍体祖先的测序数据区分出两个亚基因组。

相对二代测序数据，三代测序数据可以更好区分相似序列，组装出连续性更长的contig，再结合全基因组遗传图谱或者Hi-C 图谱区分异源染色体。2015 年发表的四倍体棉花TM-1（Gossypium hirsutum，AADD）基因组由10 万个BAC 克隆和遗传图谱组装完成，2019和2020 年发表的新版本的TM-1 基因组均由PacBio数据和Hi-C 图谱、光学图谱完成，提高了参考基因组质量，也提供了更高效、更低成本的多倍体组装方法。

相比异源多倍体由自然杂交产生，同源多倍体通过染色体加倍形成，遗传上多套染色体都可以联会，序列上同源区域相似度较高，在组装过程中互相干扰。在二代测序数据为主的时代，为构建物种的参考基因组，只能测序单倍体材料降低组装难度或者容忍、合并杂合区域。

2017 年发表的六倍体甘薯基因组（Ipomoea batatas，B1B1B2B2B2B2）首次报道了同源多倍体植物的单倍体参考基因组和基因组30% 区域的分型结果。

2018 年同源四倍体甘蔗基因组（Saccharum officinarum，1n=4x）首次攻克了同源多倍体单体型组装的难题，其关键步骤是使用BAC 文库和三代测序数据克服序列相似性，组装出四倍体全部contig，再结合Hi-C 图谱分成4 套染色体。其中Hi-C 分型软件ALLHIC借助近缘物种高粱基因组，区分出甘蔗不同染色体的contig，再根据Hi-C 互作信号对同源contig 进行区分及锚定。

同源四倍体紫花苜蓿（Medicago sativa L.，2n=4x）基因组的解析也使用了该方案，在二倍体苜蓿（M. truncatula）基因组的辅助下，成功获得了4套分型结果。四倍体苜蓿首次使用了高准确率的PacBio HiFi 数据进行多倍体组装，获得了比甘蔗基因组更好的contig 连续性。

虽然同源多倍体的组装和分型在多个物种上都获得了成功，但是基于Hi-C的分型软件仍然要依赖单倍体的参考基因组，并且在处理差异较小的同源染色体时区分效果不明显，解析复杂同源多倍体基因组还需继续探索多种类型数据和技术整合。

植物泛基因组组装

泛基因组（pan-genome）通过对物种的不同个体进行测序及组装，尽可能地捕获该物种的全部遗传信息，为后续功能研究提供新的参考基因组。

泛基因组构建的方式有3 种。早期研究由于测序数据较少，将个体测序数据比对到参考基因组，提取没有比对上的read 进行组装，产生的新序列迭代补充到参考基因组上，这种方式称为迭代组装（“map-to-pan”策略），如3K Rice。这种方式构建的泛基因组连续性较差，无法检测大的结构变异，重新组装的新序列也会导致泛基因组的冗余。

迭代组装泛基因组。通过将序列比对回参考基因组，提取未比对序列进行组装，迭代延长参考基因组构建泛基因组

第二种方式是从头组装个体基因组后再构建泛基因组。高质量的个体基因组是泛基因组分析的前提，因此组装成本较高。从头组装有利于系统鉴定各类群的“存在-缺失”变异集（PAV），染色体水平的比较能够揭示全基因组大规模序列重排和结构变异，为解析复杂表型的遗传机制提供更精确的信息。

从头组装构建泛基因组。对所有个体进行从头组装和注释，通过基因聚类算法构建泛基因集合，根据基因在各品系中出现的频率进行分类，得到核心基因集和可变基因集，根据线性模型绘制泛基因组累积曲线图

第三种方式是近年来快速发展的图基因组（graph-based genome），用图上的路径（path）表示不同个体中相同和差异的序列。图参考基因组的构建一般基于从头组装的基因组，将不同个体的基因组比对到线性参考基因组提取变异，所有个体的变异经过去冗余，再与线性基因组进行整合，通过多条路径的方式展示各种变异。图基因组考虑了个体间的相似性和差异性，也能更加直观的展示群体中复杂的结构变异。图基因组相对线性基因组，能够更好的协调多个基因组的坐标对应关系，以最小的数据结构保留全部个体的序列信息，将在泛基因组分析模型中获得广泛应用。

图基因组。基于参考基因组进行变异提取，整合变异数据集进行图基因组构建，灰框展示不同于参考基因组的路径，右图展示图基因组两个区域的真实图形结构

测序技术发展与组装质量

早期使用Sanger 测序BAC 等大片段克隆，再将大片段拼接成基因组。如人类、大肠杆菌、酵母、线虫及果蝇等模式物种的标准参考基因组，基因组质量较好，但成本过高。

二代测序时代组装，建库需要PCR，存在GC 偏好性，有些区域无法被二代测序覆盖，影响组装完整性。读长较短，通常构建2 kb-40 kb 的mate-paire文库以跨过重复序列等难组装区域，导致基因组含有大量gap，contig 只有几十kb。难以解决如着丝粒，端粒等基因组复杂区域，基本上是草图。

以PacBio 和Nanopore 为代表的第三代测序技术无需PCR 建库过程对基因组覆盖更均匀，实现了单分子测序，读长可以达到几十kb 到上百kb。Nanopore ultra-long 测序技术，甚至可以产生Mb 级别的read。能够跨长距离复杂区域，提供足够多标记区分相似、同源片段，将组装contig N50 提高到Mb 甚至几十Mb级别。之前使用二代组装的物种基因组，很多都使用三代数据重新进行了组装，提高contig连续性且补充之前二代测序没有覆盖的区域。

PacBio 的CLR（continuous long reads）数据原始碱基准确率为85%-92%。碱基错误随机，增加测序深度进行校正提高一致性序列准确性可达99.99%。Nanopore 的准确率与CLR相似，但错误不完全随机，纠错后准确率可以提高到99%。

但在植物杂合基因组或者高重复序列基因组中，同源或者多拷贝的序列之间差异只有1%-2%，远低于三代序列的测序错误（10%-15%），对原始数据进行纠错不可避免会合并基因组上的相似序列，在后续组装和分型过程中损失该类序列的信息。在使用CANU 等软件组装这类基因组时，有时纠错阶段会将原始数据量减少至三分之一，导致最后组装结果远小于预估基因组大小。并且原始数据纠错耗时较长，在大型基因组（>10 Gb）组装中成为短板因素。

近两年来PacBio推出的高保真HiFi read，碱基准确率>99%。高准确率显著提高了参考基因组组装的质量并且精减了原始序列纠错、组装结果抛光等步骤，是当前质量认可度最高的测序数据。HiFi read 测序时对DNA 插入片段进行多次循环读取，以牺牲长度换取高准确率，平均读长只有CLR 的1/2（10-20 kb vs 20-40 kb），并且通量只有CLR 的1/5，当前一张SMRT cell 芯片可以产出>100 Gb CLR read 数据，而只能产出20-25 Gb HiFi read 数据，无法跨过长距离复杂区域，且数据有效率较低、成本较高，这些是HiFi 数据在解决大型、复杂基因组时的局限。

基因组组装的质量在很大程度上取决于测序技术产出的片段长度和准确率。HiFi 提供了高精度单分子测序，Nanopore ultra-long 提供了超长片段，这两种技术的综合应用推动植物基因组进入端粒到端粒（T2T）的“完成图”组装时代。

在实际研究中，每个待组装的基因组所面临的技术问题和后续的分析需求不尽相同。建议在项目初期做好基因组特征评估和对组装质量的预期，再选择测序和组装策略是比较明智的做法。

SAM/BAM格式简介

默默家有 — Fri, 17 Nov 2023 04:58:00 +0000

1、SAM/BAM格式简介

SAM存储格式发明的目的：使不同测序平台下机数据，经过不同比对软件后有一个统一的存储格式。
SAM(Sequence Alignment/Map format简写）格式文件，存储测序数据和参考基因组比对结果的文件，每行以table键分割，包含标头部分（header section）和比对部分（alignment section）见下图。
BAM（Binary Alignment/Map format简写）格式文件，SAM的二进制格式文件，通过BGZF library参考库压缩而成。

2、术语与概念理解

该部分有助于后文SAM格式理解，后文反复出现如下概念。

模板（Template）：一段DNA/RNA序列，它的一部分在测序仪上被测序，或被从原始序列中组装。（意思就是：我们通过测序仪测序的那段序列，或者通过组装原始序列得到的更长的序列，就是模板的一部分）。（从后文来看，对于Illumina双端测序来说，template指的就是插入片段）
片段（Segment）：一段连续的序列或子序列（subsequence）（从上下文来看，segment既可以指一条完整的read，也可以指read的一部分）；
读段（Read）：一段来自测序仪的原始序列。read可以包含多个片段（一条read在比对过程中可能会被拆分成几段，对应到参考序列不同的位置上。read被拆分后形成的片段即为segment）。对于测序数据，reads根据测序顺序进行编号；
线性比对（Linear alignment）：一条比对到参考序列上的read可能会有插入、缺失、skips和切除（clipping），但只要没有方向的改变（例如，read的一部分比对到了正义链上，另一部分比对到了反义链上），就是Linear alignment。一个线性比对结果可以代表一个SAM记录；（意思似乎是：一条SAM记录能且只能保存一个线性比对结果）
嵌合比对（Chimeric alignment）：不是线性比对的比对。嵌合比对中包含了一套没有大范围重叠的线性比对（嵌合比对中的每一个片段都是线性比对。关于大范围重叠的说法是为了和多重比对区分）。一般地，嵌合比对中的一个线性比对被认为是“有代表性的比对”（representative alignment），而其他的线性比对被称为补充的（supplementary），用补充比对标志（supplementary alignment flag）加以区别（representative和supplementary成一对，对应嵌合比对）。嵌合比对的所有SAM记录有相同的QNAME，其flag值的0x40和0x80位都相同（见1.4节）（0x40位和0x80位分别表示模板中的第一个片段和最后一个片段，为什么会都相同呢？总要有一个是第一个片段，总要有一个是最后一个片段吧，它俩的0x40位和0x80位不应该相同啊？）。哪个线性比对被视为有代表性是任意选择的。（可见嵌合比对中，各个segments的独立性更强：都不在双链的同一条链了。另外，如果一条read的不同部分比对到了不同的染色体上，那肯定也是嵌合比对了，因为不同染色体之间讨论方向相同是没有意义的，肯定不可能是线性比对了。）
read比对（read alignment）：能代表一条read的比对结果的线性比对或嵌合比对；
多重比对（Multiple mapping）：由于重复序列等情况的存在，一条read在参考基因组上的正确位置可能无法确定。在这种情况下，一条read可能会有多种比对结果，其中一种被视为主要的（primary），所有其他的比对结果的SAM记录的flag标志中都会有一个“次要（secondary）比对结果”的标志。所有这些SAM记录拥有相同的QNAME，flag标志的0x40位和0x80位有相同的值。一般被指定为“主要”的比对结果是最佳比对，如果都是最佳比对，则任意指定一条（primary和secondary成一对，对应多重比对）。（原文注释：嵌合比对主要由结构变异、基因融合、组装错误、RNA测序或实验过程中的一些原因造成，更经常出现在长reads中（长read有利于检测嵌合比对。这就是为什么三代测序是检测染色体结构变异的更有力工具）。嵌合比对中的线性比对之间没有大片段的重叠，每个线性比对有较高的mapping质量值，可以用于SNP/INDEL的检测；而多重比对主要是序列重复造成的，不经常出现在长reads中。如果一条read有多重比对的情况，所有的比对互相之间几乎完全完全重叠。除了一个最佳比对外，所有其他比对的质量值都<3，且会被大多数SNP/INDEL检测软件忽略）。
以1为起始的坐标系（1-based coordinate system）：序列的第一位是1的坐标系。在这种坐标系中，一个区域用闭区间表示。例如，第三位和第七位碱基之间的区域表示为[3,7]。SAM, VCF, GFF和Wiggle格式使用以1为起始的坐标系；
以0为起始的坐标系（0-based coordinate system）：序列的第一位是0的坐标系。在这种坐标系中，一个区域用左闭右开区间表示。例如，第三位和第七位碱基之间的区域表示为[2,7)。BAM, BCFv2, BED和PSL格式使用以0为起始的坐标系；
Phred scale：如果一个概率值0
3、标头部分（header section）详解

该部分为SAM/BAM的注释部分，该部分并非必须，可以省略。每一行都以@符开头，后面跟着两个大写字母，每个字段之间以\t分割，每个字段遵循（TAG:Value）的格式（@CO开头的行除外）。每行可以使用以下正则表达式表示：/^@(HD|SQ|RG|PG)(\tA-Za-z:[ -~]+)+$/ or /^@CO\t./，@后紧跟的两个大写字母主要有HD，SQ，RG，PG和CO五类，前四类常用如下表，其中加了号的表示该标签必须存在，例如@HD这个标签存在时，VN必须同时存在，详细介绍如下。
4、比对信息部分（alignment section）详解
比对部分概述
该部分是SAM文件的核心部分，每一行代表一个序列的线性比对（linear alignment of a segment），每行包含前11个必需字段，和第12个字段后多个可选字段，使用TAB-separated分割，当某个字段信息缺省时，如果字段是字符串型以*替代，如果字段是整型以‘0’来替代，下表为11个必需字段含义的概述。
比对部分详细介绍
第一列、QNAME
被比对序列的名称（query template name），如果QNAME唯一，则序列被认为来源于同一模板；‘*’表示该字段缺省；一般情况下，该字段为FASTQ文件的第一行信息；嵌合（Chimeric alignment）比对或者多次比对（Multiple mapping）的序列会导致一个QNAME在SAM中多次出现。
#### 第二列、FLAG
SAM中显示的是下图中第一列值或者第一列中的数值和，当显示的是下表中第一列数值时，意义为Description所列出，如果是多个数值和，意义为Description多行意义汇总，常用的意义见下表：
1 ：该read使用双端测序，单端测序为0；
2：该read和完全比对到参考序列；
4：该read没有比对到参考序列；
8：双端序列的另外一条序列没有比对上参考序列（read1或者read2）；
16：该read比对到参考序列的负链上（该read反向互补比对到参考序列）；
32 ：该read的另一条read比对到参考序列的负链上；
64 ：双端测序 read1;
128 : 双端测序read2；
256：该read不是最佳的比对序列，一条read能比对到参考序列的多个位置，只有一个是最佳的比对位置，其他都是次要的；
512：该read在过滤（碱基质量，测序平台等指标）时没通过；
1024: PCR（文库构建时）或者仪器（测序时）导致的重复序列；
2048: 该read可能存在嵌合（发生在PCR过程中），当前比对部分只是read的一部分；
如果FLAG不在上表第一列，可以使用如下两个网站查询：
网站1：Explain SAM Flags
例如，FLAG 88=8(0x8对应值)+16(0x10对应值)+64(0x40对应值)，该FLAG值意义为三个意义的汇总。
网站2：SAM Format Flag

另外一些常用FLAG
One of the reads is unmapped（双端reads只有一条reads比对上）:
73, 133, 89, 121, 165, 181, 101, 117, 153, 185, 69, 137
Both reads are unmapped（双端reads都没比对上）:
77, 141
Mapped within the insert size and in correct orientation（reads比对上了，大小方向均对）:
99, 147, 83, 163
Mapped within the insert size but in wrong orientation（比对上了，但是方向不对）:
67, 131, 115, 179
Mapped uniquely, but with wrong insert size（唯一比对，但是大小不对）:
81, 161, 97, 145, 65, 129, 113, 177
第三列、RNAME
Reference sequence NAME of the alignment，比对时参考序列的名称，一般是染色体号（如果物种为人，则为chr1~chr22，chrX，chrY，chrM）。RNAME（如果不是*）必须在header section部分@SQ中SN标签后出现。如果没有比对上参考基因组，用*来表示。如果RNAME值是*，则后面POS和CIGAR也将没有值。
第四列、POS
该read比对到参考基因组的位置坐标，最小为1（1-based leftmost）。该read如果没有比对上参考序列，则RNAME和CIGAR也无值。
#### 第五列、MAPQ
对应参考序列的质量（MAPing Quality），比对的质量分数，越高说明该read比对到参考基因组上的位置越准确。其值等于-10 lg Probility （错配概率），得出值后四舍五入的整数就是MAPQ值。如果该值是255，则说明对应质量无效。例如，MAPQ为20，即Q20，错误率为0.01，20 = -10log10(0.01) = -10*(-2)。
#### 第六列、CIGAR
Compact Idiosyncratic Gapped Alignment Representation的简写，描述read与参考序列的比对具体情况信息。CIGAR中的数字代表碱基的个数，字符的含义见下表：
举个栗子：3M1D2M1I1M：3个碱基匹配（M)（3M）、接下来1个碱基缺失（D）、接下来2个匹配（2M）、接下来1个碱基插入（1I）、接下来1个碱基匹配（1M），如下图：
Q1: 区分一下N、S、H
好的，让我们更详细地讨论 CIGAR 字符串中的 N、S 和 H，以及它们之间的区别。
N (跳过)
描述: N 在 CIGAR 字符串中表示跳过参考序列上的一个区间。它通常用于 mRNA 到基因组的比对，表示一个内含子区域。在这种情况下，读取的序列与参考序列之间存在一个不匹配的间隙，但这个间隙是预期的，例如由于内含子造成。
示例: 在比对 mRNA 到基因组时，如果 mRNA 中的一个片段跨越了一个内含子，那么 CIGAR 字符串可能会包含一个 N 操作。
S (软裁剪)
描述: S 在 CIGAR 字符串中表示软裁剪。这意味着这部分的序列数据不与参考序列进行比对，但仍然包含在读取序列中。软裁剪通常用于表示那些在一端比对不佳的读取，但我们仍然希望保留这部分的序列数据。
示例: 如果一个读取的前5个碱基与参考序列不匹配，但其余部分匹配，CIGAR 字符串可能为 5S95M，表示前5个碱基被软裁剪，而接下来的95个碱基与参考序列匹配。
H (硬裁剪)
描述: H 在 CIGAR 字符串中表示硬裁剪。与软裁剪不同，硬裁剪的部分完全从读取中删除，不包含在序列数据中。硬裁剪通常用于永久删除某些不需要的序列片段。
示例: 如果我们知道一个读取的前5个碱基是由于某种原因而被引入的，我们可能会使用硬裁剪来永久删除它们。在这种情况下，CIGAR 字符串可能为 5H95M，表示前5个碱基被硬裁剪，而接下来的95个碱基与参考序列匹配。
区别:
N 是针对参考序列的操作，表示在参考序列上跳过一段区间，而读取中没有对应的序列。
S 和 H 都是针对读取的操作，表示读取的一部分不与参考序列进行比对。
S 与 H 的主要区别在于，软裁剪 (S) 的序列仍然包含在读取中，而硬裁剪 (H) 的序列则从读取中完全删除。
Q2:区分一下D,N
N可以理解为gap，就是比如exon之间大段的gap，D是明确的deletion
#### 第七列、RNEXT
双端测序中另外一条read比对的参考序列的名称，单端测序此处为0，RNEXT（如果不是或者=，是完全没有比对上，=是完全比对）必须在header section部分@SQ中SN标签后出现。第3和第7列，可以用来判断某条read是否比对成功到了参考序列上，read1和read2是否比对到同一条参考染色体上。
#### 第八列、PNEXT
双端测序中，是指另外一条read比对到参考基因组的位置坐标，最小为1（1-based leftmost）。
#### 第九列、TLEN
文库长度，insert DNA size。
#### 第十列、SEQ
read 碱基序列，FASTQ的第二行。
#### 第十一列、QUAL
FASTQ的第四行。
#### 第十二列之后，Optional fields
可选的自定义区域（Optional fields），可能有多列，多列间使用\t隔开，并不是每行都存在这些列。
XT:A:R NM:i:0 X0:i:4 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 XA:Z:chr1,+102573964,50M,0
XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
该行该列没有内容
XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
每列格式为TAG:TYPE:VALUE，其中
TAG为两个大写字母；
TYPE可以由如下格式A (character), B (general array), f (real number), H (hexadecimal array), i (integer), or Z (string)；
VALUE ，内容与TYPE相关，TYPE为i时VALUE为整数，以此类推；
TAG详细介绍
可分为6类，详细介绍如下：
1.1 Additional Template and Mapping data（一些比对信息）
AM:i:score The smallest template-independent mapping quality of any segment in the same template as
this read. (See also SM.)
AS:i:score Alignment score generated by aligner.
BQ:Z:qualities Offffset to base alignment quality (BAQ), of the same length as the read sequence. At the
i-th read base, BAQi = Qi
(BQi
64) where Qi is the i-th base quality.
CC:Z:rname Reference name of the next hit; ‘=’ for the same chromosome.
CG:B:I,encodedCigar Real CIGAR in its binary form if (and only if) it contains >65535 operations. This
is a BAM fifile only tag as a workaround of BAM’s incapability to store long CIGARs in the standard
way. SAM and CRAM fifiles created with updated tools aware of the workaround are not expected to
contain this tag. See also the footnote in Section 4.2 of the SAM spec for details.
2CP:i:pos Leftmost coordinate of the next hit.
E2:Z:bases The 2nd most likely base calls. Same encoding and same length as SEQ. See also U2 for
associated quality values.
FI:i:int The index of segment in the template.
FS:Z:str Segment suffiffiffix.
H0:i:count Number of perfect hits.
H1:i:count Number of 1-difffference hits (see also NM).
H2:i:count Number of 2-difffference hits.
HI:i:i Query hit index, indicating the alignment record is the i-th one stored in SAM.
IH:i:count Number of alignments stored in the fifile that contain the query in the current record.
MC:Z:cigar CIGAR string for mate/next segment.
MD:Z:[0-9]+(([A-Z]|^[A-Z]+)[0-9]+)* String for mismatching positions.
The MD fifield aims to achieve SNP/indel calling without looking at the reference. For example, a string
‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed
by an A on the reference which is difffferent from the aligned read base; the next 5 reference bases are
matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are
matches. The MD fifield ought to match the CIGAR string.
MQ:i:score Mapping quality of the mate/next segment.
NH:i:count Number of reported alignments that contain the query in the current record.
NM:i:count Number of difffferences (mismatches plus inserted and deleted bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch（可以结合CIGAR字段计算错配碱基个数）. Note this means that ambiguity codes in both
sequence and reference that match each other, such as ‘N’ in both, or compatible codes such as ‘A’ and
‘R’, are still counted as mismatches. The special sequence base ‘=’ will always be considered to be a
match, even if the reference is ambiguous at that point. Alignment reference skips, padding, soft and
hard clipping (‘N’, ‘P’, ‘S’ and ‘H’ CIGAR operations) do not count as mismatches, but insertions and
deletions count as one mismatch per base.Note that historically this has been ill-defifined and both data and tools exist that disagree with this defifinition.
PQ:i:score Phred likelihood of the template, conditional on the mapping locations of both/all segments
being correct.
Q2:Z:qualities Phred quality of the mate/next segment sequence in the R2 tag. Same encoding as QUAL.
R2:Z:bases Sequence of the mate/next segment in the template. See also Q2 for any associated quality
values.
SA:Z:(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+ Other canonical alignments in a chimeric alignment, for
matted as a semicolon-delimited list. Each element in the list represents a part of the chimeric align
ment. Conventionally, at a supplementary line, the fifirst element points to the primary line. Strand is
either ‘+’ or ‘-’, indicating forward/reverse strand, corresponding to FLAG bit 0x10. Pos is a 1-based
coordinate.
SM:i:score Template-independent mapping quality, i.e., the mapping quality if the read were mapped as
a single read rather than as part of a read pair or template.
3TC:i: The number of segments in the template.
TS:A:strand Strand (‘+’ or ‘-’) of the transcript to which the read has been mapped.
U2:Z: Phred probability of the 2nd call being wrong conditional on the best being wrong. The same
encoding and length as QUAL. See also E2 for associated base calls.
UQ:i: Phred likelihood of the segment, conditional on the mapping being correct.
1.2 Metadata（这部分内容和 SAM中header section部分相关，描述read测序相关信息）
RG:Z:readgroup The read group to which the read belongs. If @RG headers are present, then readgroup
must match the RG-ID fifield of one of the headers.
LB:Z:library The library from which the read has been sequenced. If @RG headers are present, then library
must match the RG-LB fifield of one of the headers.
PG:Z:program id Program. Value matches the header PG-ID tag if @PG is present.
PU:Z:platformunit The platform unit in which the read was sequenced. If @RG headers are present, then
platformunit must match the RG-PU fifield of one of the headers.
CO:Z:text Free-text comments.
1.3 Barcodes(UMI/单细胞测序cell barcode)
DNA barcodes can be used to identify the provenance of the underlying reads. There are currently three
varieties of barcodes that may co-exist: Sample Barcode, Cell Barcode, and Unique Molecular Identififier
(UMI).
• Despite its name, the Sample Barcode identififies the Library and allows multiple libraries to be combined
and sequenced together. After sequencing, the reads can be separated according to this barcode and
placed in difffferent “read groups” each corresponding to a library. Since the library was generated from
a sample, knowing the library should inform of the sample. The barcode itself can be included in the
PU fifield in the RG header line. Since the PU fifield should be globally unique, it is advisable to include
specifific information such as flflowcell barcode and lane. It is not recommended to use the barcode as
the ID fifield of the RG header line, as some tools modify this fifield (e.g., when merging fifiles).
• The Cell Barcode is similar to the sample barcode but there is (normally) no control over the assignment
of cells to barcodes (whose sequence could be random or predetermined). The Cell Barcode can help
identify when reads come from difffferent cells in a “single-cell” sequencing experiment.（在单细胞测序中，追溯read来源的标签）
• The UMI is intended to identify the (single- or double-stranded) molecule at the time that the barcode
was introduced. This can be used to inform duplicate marking and make consensus calling in ultra
deep sequencing. Additionally, the UMI can be used to (informatically) link reads that were generated
from the same long molecule, enabling long-range phasing and better informed mapping. In some
experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes.
These templates can also be considered duplicates even though technically they may have difffferent
UMIs. Multiple UMIs can be added by a protocol, possibly at difffferent time-points, which means that
specifific knowledge of the protocol may be needed in order to analyze the resulting data correctly.（UMI信标签，RNA-seq中UMI可以对原始的 RNA 分子进行“绝对定量”）
BC:Z:sequence Barcode sequence (Identifying the sample/library), with any quality scores (optionally)
stored in the QT tag. The BC tag should match the QT tag in length. In the case of multiple unique
molecular identififiers (e.g., one on each end of the template) the recommended implementation con
catenates all the barcodes and places a hyphen (‘-’) between the barcodes from the same template.
QT:Z:qualities Phred quality of the sample barcode sequence in the BC tag. Same encoding as QUAL,
i.e., Phred score + 33. In the case of multiple unique molecular identififiers (e.g., one on each end of
the template) the recommended implementation concatenates all the quality strings with spaces (‘ ’)
between the difffferent strings from the same template.
4CB:Z:str Cell identififier, consisting of the optionally-corrected cellular barcode sequence and an optional
suffiffiffix. The sequence part is similar to the CR tag, but may have had sequencing errors etc corrected.
This may be followed by a suffiffiffix consisting of a hyphen (‘-’) and one or more alphanumeric characters to form an identififier. In the case of the cellular barcode (CR) being based on multiple barcode sequences
the recommended implementation concatenates all the (corrected or uncorrected) barcodes with a
hyphen (‘-’) between the difffferent barcodes. Sequencing errors etc aside, all reads from a single cell
are expected to have the same CB tag.
CR:Z:sequence+ Cellular barcode. The uncorrected sequence bases of the cellular barcode as reported
by the sequencing machine, with the corresponding base quality scores (optionally) stored in CY. Se
quencing errors etc aside, all reads with the same CR tag likely derive from the same cell. In the case
of the cellular barcode being based on multiple barcode sequences the recommended implementation
concatenates all the barcodes with a hyphen (‘-’) between the difffferent barcodes.
CY:Z:qualities+ Phred quality of the cellular barcode sequence in the CR tag. Same encoding as QUAL,
i.e., Phred score + 33. The lengths of the CY and CR tags must match. In the case of the cellular
barcode being based on multiple barcode sequences the recommended implementation concatenates all
the quality strings with with spaces (‘ ’) between the difffferent strings.
MI:Z:str Molecular Identififier. A unique ID within the SAM fifile for the source molecule from which this
read is derived. All reads with the same MI tag represent the group of reads derived from the same
source molecule.
OX:Z:sequence+ Raw (uncorrected) unique molecular identififier bases, with any quality scores (optionally)
stored in the BZ tag. In the case of multiple unique molecular identififiers (e.g., one on each end of the
template) the recommended implementation concatenates all the barcodes with a hyphen (‘-’) between
the difffferent barcodes.
BZ:Z:qualities+ Phred quality of the (uncorrected) unique molecular identififier sequence in the OX tag.
Same encoding as QUAL, i.e., Phred score + 33. The OX tags should match the BZ tag in length. In the
case of multiple unique molecular identififiers (e.g., one on each end of the template) the recommended
implementation concatenates all the quality strings with a space (‘ ’) between the difffferent strings.
RX:Z:sequence+ Sequence bases from the unique molecular identififier. These could be either corrected or
uncorrected. Unlike MI, the value may be non-unique in the fifile. Should be comprised of a sequence of
bases. In the case of multiple unique molecular identififiers (e.g., one on each end of the template) the
recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the difffferent
barcodes.If the bases represent corrected bases, the original sequence can be stored in OX (similar to OQ storing the original qualities of bases.)
QX:Z:qualities+ Phred quality of the unique molecular identififier sequence in the RX tag. Same encoding
as QUAL, i.e., Phred score + 33. The qualities here may have been corrected (Raw bases and qualities
can be stored in OX and BZ respectively.) The lengths of the QX and the RX tags must match. In the
case of multiple unique molecular identififiers (e.g., one on each end of the template) the recommended
implementation concatenates all the quality strings with a space (‘ ’) between the difffferent strings.
1.4 Original data
OA:Z:(RNAME,POS,strand,CIGAR,MAPQ,NM ;)+ The original alignment information of the record
prior to realignment or unalignment by a subsequent tool. Each original alignment entry contains
the following six fifield values from the original record, generally in their textual SAM representations,
separated by commas (‘,’) and terminated by a semicolon (‘;’): RNAME, which must be explicit
(unlike RNEXT, ‘=’ may not be used here); 1-based POS; ‘+’ or ‘-’, indicating forward/reverse strand
respectively (as per bit 0x10 of FLAG); CIGAR; MAPQ; NM tag value, which may be omitted (though
the preceding comma must be retained).
5In the presence of an existing OA tag, a subsequent tool may append another original alignment entry
after the semicolon, adding to—rather than replacing—the existing OA information.
The OA fifield is designed to provide record-level information that can be useful for understanding the
provenance of the information in a record. It is not designed to provide a complete history of the
template alignment information. In particular, realignments resulting in the the removal of Secondary
or Supplementary records will cause the loss of all tags associated with those records, and may also
leave the SA tag in an invalid state.
OC:Z:cigar Original CIGAR, usually before realignment. Deprecated in favour of the more general OA.
OP:i:pos Original 1-based POS, usually before realignment. Deprecated in favour of the more general OA.
OQ:Z:qualities Original base quality, usually before recalibration. Same encoding as QUAL.
1.5 Annotation and Padding
The SAM format can be used to represent de novo assemblies , generally by using padded reference sequences and the annotation tags described here. See the Guide for Describing Assembly Sequences in the SAM Format Specifification for full details of this representation.
CT:Z:strand;type(;key(=value)?)*
Complete read annotation tag, used for consensus annotation dummy features.
The CT tag is intended primarily for annotation dummy reads, and consists of a strand, type and zero or
more key=value pairs, each separated with semicolons. The strand fifield has four values as in GFF3,2
and supplements FLAG bit 0x10 to allow unstranded (‘.’), and stranded but unknown strand (‘?’)
annotation. For these and annotation on the forward strand (strand set to ‘+’), do not set FLAG bit
0x10. For annotation on the reverse strand, set the strand to ‘-’ and set FLAG bit 0x10.
The type and any keys and their optional values are all percent encoded according to RFC3986 to
escape meta-characters ‘=’, ‘%’, ‘;’, ‘|’ or non-printable characters not matched by the isprint() macro
(with the C locale). For example a percent sign becomes ‘%25’.
PT:Z:annotag(|annotag)*
where each annotag matches start;end;strand;type(;key(=value)?)* Read annotations for parts of the padded read sequence.The PT tag value has the format of a series of annotation tags separated by ‘|’, each annotating a sub-region of the read. Each tag consists of start, end, strand, type and zero or more key=value pairs,each separated with semicolons. Start and end are 1-based positions between one and the sum of the M/I/D/P/S/=/X CIGAR operators, i.e., SEQ length plus any pads. Note any editing of the CIGAR
string may require updating the PT tag coordinates, or even invalidate them. As in GFF3, strand is
one of ‘+’ for forward strand tags, ‘-’ for reverse strand, ‘.’ for unstranded or ‘?’ for stranded but unknown strand. The type and any keys and their optional values are all percent encoded as in the CT tag.
1.6 Technology-specifific data
FZ:B:S,intensities Flow signal intensities（测序拍照的光强度数据） on the original strand of the read, stored as (uint16 t)
round(value * 100.0).
1.6.1 Color space
CM:i:distance Edit distance between the color sequence and the color reference (see also NM).
CS:Z:sequence Color read sequence on the original strand of the read. The primer base must be included.
CQ:Z:qualities Color read quality on the original strand of the read. Same encoding as QUAL; same
length as CS.
2 Locally-defifined tags
You can freely add new tags. Note that tags starting with ‘X’, ‘Y’, or ‘Z’ and tags containing lowercase letters in either position are reserved for local use and will not be formally defifined in any future version of this specifification. If a new tag may be of general interest, it may be useful to have it added to this specifification. Additions can be proposed by opening a new issue at https://github.com/samtools/hts-specs/issues and/or by sending email to samtools-devel@lists.sourceforge.net.
参考资料
[1] Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools[J]. Bioinformatics, 2009, 25(16): 2078-2079.
[2] https://www.samformat.info/sam-format-flag
[3] http://note.youdao.com/share/?id=312fa04209cb87f7674de9a9544f329a&type=note#/
[4] https://samtools.github.io/hts-specs/SAMv1.pdf
[5] https://yulijia.net/slides/bioinfomatcis_for_medical_students/2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II.html#1
[6] http://samtools.github.io/hts-specs/SAMtags.pdf

欢迎使用 Typecho

默默家有 — Thu, 07 Sep 2023 08:42:45 +0000

如果您看到这篇文章,表示您的 blog 已经安装成功.