所有动物工作均由劳伦斯·伯克利国家实验室动物福利和研究委员会审查和批准。使用C57BL/6N菌株肌肉动物进行所有发育阶段的组织收集 。对于E14.5和P0,从Charles River Laboratories(C57BL/6NCRL菌株)和Taconic Biosciences(C57BL/6NTAC菌株)购买育种动物。对于所有剩余的发育阶段 ,育种动物都是从Charles River Laboratories(C57BL/6NCRL菌株)购买的。使用标准的定时育种策略将野生型男性和雌性小鼠交配 。使用批准的机构方案收集胚胎和P0幼崽进行解剖。如果胚胎不在预期的发育阶段,则将其排除在外。为避免样品降解,一次只处理一个胚胎垃圾或P0幼崽 ,并且在解剖过程中保持冰冷 。将每种组织类型的收集试管放在干冰乙醇浴中,以便在解剖时立即将组织样品闪烁。将来自多个胚胎的组织聚集在同一收集管中,并为每个组织阶段收集至少两个单独的收集试管以进行生物复制。由于没有单独的治疗和对照组的评估 ,因此没有进行实验者的盲盲 。鉴于生产规模,随机化是不可行的。在-80°C或在干冰上将组织存储在冰箱中,直到进一步加工。逐步收集的分步协议 ,包括有关如何确定胚胎阶段的详细信息,可以在Encode Project网站上找到https://www.encodeproject.org/documents/631AA21C-8E48-467E-8CAC-D40C875B3913/@@downloadload/attachment/attachment/AttArchment/Attrachment/StansArdTissueExcisionCisisionProtoccol_02132017.pdf 。
完整的芯片 - 序列数据系列包括来自564个芯片 - seq实验的660亿多次测序读数,每个读取由两种来自不同胚胎池的生物学重复(n = 1,128个重复总计)。如前所述5进行了所有E11.5到P0的芯片 - 隔离实验。由于输入量低(Micro-Chip – Seq) ,所有E10.5实验略微修改了芯片 - seq方案。标准和微芯片seq的详细协议,包括所使用的抗体和执行的抗体验证,可在https://www.encodeproject.org/上获得与此处描述的每个实验相关联的相关联 。它们也可以在下面的链接中找到。
组织固定与超声处理:https://www.encodeproject.org/documents/3125496b-c833-4414-bf5f-84ddd633eb30d/@@ downloadload/todyloadload/attaChment/attaChment/Attren_tissue_tissue_tissue_fixation_fixation_fixation_fixation_fixation_and_sonication_v0606060606144.pddf。Immunoprecipitation: https://www.encodeproject.org/documents/89795b31-e65a-42ca-9d7b-d75196f6f4b3/@@download/attachment/Ren%20Lab%20ENCODE%20Chromatin%20Immunoprecipitation%20Protocol_V2.pdf.库准备:https://www.encodeproject.org/documents/4f73fbc3-956e-47ae-47ae-aa2d-41a2d-41a7df552c81/@downloadload/attarload/attachment/tathachment/ren_chip_chip_chip_library_preary_preary_preary_preary_prearary_preparation_preparation_v060606144.pdf 。
组织固定与超声处理:https://www.encodeproject.org/documents/1FCAAB50-6CA0-4778-88CB-5F6B85170D21/@@download/attachment/attachment/ren%20lab%20lab%20CencecteNcectecode%20Tissue%20FIXEND;免疫沉淀和图书馆准备:https://www.encodeproject.org/documents/18580E80-0907-4258-A412-A412-46BCC37BD040/@@@@download/attachment/ren%%20lab;
完整的ATAC – SEQ数据系列包括来自66个实验的超过70亿个测序读数(n = 132个重复总计)。我们的ATAC – SEQ程序基于先前发布的方法8,并进行了修改以优化冷冻组织。简而言之 ,在液氮中用砂浆和杵将组织粉碎,然后通过在核中通透性缓冲液中重悬于核核化(PBS,1 mm DTTM 0.2%igepal-CA630 ,15%BSA,1×完整的蛋白酶抑制剂cocktail)和Incuugumuugumuugumugumugation和Incumugation 。Our full ATAC–seq protocol is available via the ENCODE data portal here: https://www.encodeproject.org/documents/4a2fc974-f021-4f85-ba7a-bd401fe682d1/@@download/attachment/RenLab_ATACseq_protocol_20170130.pdf.我们需要至少2000万个可用的ATAC -SEQ读取对,每个数据集和读取重叠的TSS(Frot)的最小比例为0.1(扩展数据图3)。我们将Frot用作ATAC-SEQ数据集中信噪比的量度 ,因为即使在未表达基因的组织中,TSS被广泛标记为开放式染色质。我们将每个库的Frot计算为映射Gencode V4 TSS 1 kb的读数,除以可用读取的总数 。ATAC – SEQ数据在与Pearson和Spearman相关性相同的组织阶段的生物学重复之间高度可重现(扩展数据图3)。此外 ,跨确定峰的ATAC – SEQ富集的多维缩放分析证实,样品倾向于主要由组织类型和发育阶段聚集(扩展数据图3)。
使用编码数据协调中心(DCC)为编码财团实现的软件管道分析了组蛋白芯片 - seq数据 。管道的每个步骤都对应于用Python编程语言编写的脚本,该脚本组装了输入文件 ,运行外部程序(例如MACS2峰呼叫者),并计算质量控制指标。该方法类似于先前针对Encode62所述的方法,并具有以下修改:使用BWA版本0.7.10和Samtools 1.0版的映射步骤,Macs2版本2.1.0用于信号轨道生成和峰值呼叫。为了确保噪声的足够采样以进行随后的复制比较 ,最初以1×10-2的松弛p阈值调用峰。为每个生物学重复,重复汇集的重复以及每个真实复制的拟合作倍数生成了这种松弛的峰集(每个伪复制品的合并伪复制物)(每个伪层状都包含一半的读取,而无需替换) 。如果汇总重复集合的峰值至少与两个生物学重复的至少一半长度(在基部)峰重叠 ,则将其保留在复制的峰集中。此外,将两个合并的拟倍物重叠的峰添加到复制的峰集中。以这种方式,非常强大的生物学重复可以“拯救”仅在第二秒重复的边缘的峰值 。该管道可在Dnanexus(https://www.dnanexus.com/)Web平台上运行 ,并由来自Amazon Web Services(AWS)的云计算支持,并且是用于分析所有编码组蛋白CHIP -SEQ实验的管道。该平台既提供了用于程序执行的API,也提供了基于Web的接口 ,用于交互式执行相同的工作流程。编码DCC使用此方法来确保对财团内不同实验室的主要数据均匀处理, 因此,以最大程度地减少可能混淆随后比较的因素63 。编码DCC并行分析了实验 ,并将结果登录到Encode Portal62(https://www.encodeproject.org/)。
为了促进跨阶段的比较,通过合并每个组织内阶段的复制峰来产生每个组织每个分数一个峰值清单。然后,使用BigWigaverateOverbed(https://github.com/encode-dcc/kentutils/blob/blob/master/bin/bin/bin/linux.x864/bigwigaverageoverbetbed)使用chip – seq折叠的浓度对相应组织中每个阶段的输入进行评分 。这些值在各个阶段进行了归一化,以消除偏见在阶段之间信号分布中的潜在混杂作用。这些归一化的得分表(每分之一 ,组织,重复)用于以下分析。
对于每个实验的重复之间的相关性,我们使用了Pearson的相关性 ,如扩展数据中所绘制的图2d 。狭窄的标记(H3K4ME3,H3K4ME2,H3K27AC和H3K9AC)具有更富集的峰值峰值 ,并且倾向于比宽标记(H3K27ME3,H3K4ME1,H3K9ME3和H3K36ME3)更强烈地相关。对于阶段之间的相关性 ,我们使用Pearson的相关性比较了上述每个标记的复制,但是将所有阶段在一个组织和一个标记中彼此进行比较。然后,我们根据比较数据集的数量分开的数量将这些相关性分开:例如 ,从同一阶段进行真实生物学重复的零,或7个用于比较E11.5数据集与P0数据集的零。这些相关性在图1E中绘制 。为了进一步促进跨组织的比较,在上面的“来自单个组蛋白标记的数据的分析”中描述了类似的方法,但是在这种情况下 ,通过合并所有组织阶段的复制峰来产生一个主峰列表。如上所述,然后使用芯片– seq折叠的富集对每个组织阶段的输入进行评分,但是在这种情况下 ,使用从两个重复的数据(合并数据)中汇集的数据。然后将这些值跨组织阶段归一化,并将所得的主得分表(每分)用于用默认参数在R中执行的分层聚类 。所得的树状图在扩展数据中绘制在图4A中。对于图1中的H3K27AC K均值聚类,在聚类之前执行了一个附加的数据处理步骤:跨每一行 ,将值转换为R(X/√(sum(sum(x2)))中的单位向量,以防止总体富集水平占主导地位。这些单位矢量值仅用于聚类; 绘制的值是上述得分表中标准化的H3K27AC富集值 。在R = 8和默认参数的R中进行K-均值聚类。根据平均归一化富集在每个群集中排序行。
整个基因组分为1-kb的瓷砖箱 。使用BigwigaverateRover的折叠信号计算每个垃圾箱的平均折叠信号。将合并峰值重叠至少20%的垃圾箱表示为峰值。在给定的组织中将来自每个峰值的平均倍数富集信号归一化 。计算每个峰的信号强度为与该峰重叠的所有垃圾箱的信号之和。主成分分析是在每个组蛋白标记的峰信号上进行的,其中R函数“ PRCOMP ”。为每个样品绘制PC1 ,PC2和PC3值。
为了说明扩展数据中有源和无声基因的特征富集模式图1b,我们使用“活性”基因的保守定义,因为在此处评估的每个组织阶段中 ,每百万映射读数的读数(RPKM)> 10,“无声”基因被定义为RPKM < 2 in all tissue-stages. Metagene profiles were plotted with deeptools plotProfile64, using data from E15.5 heart.
ATAC–seq data were analysed using a standardized software pipeline developed by the ENCODE DCC for the ENCODE Consortium to perform quality-control analysis and read alignment. ATAC–seq reads were trimmed with a custom adaptor script and mapped to mm10 using bowtie version 2.2.6 and samtools version 1.2 to eliminate PCR duplicates. MACS2 version 2.1.1.20160309 was used for generating signal tracks and peak calling with the following parameters: —nomodel —shift 37 —ext 73 —pval 1e-2 -B —SPMR —call-summits. To produce a set of ‘replicated’ ATAC–seq peaks for analysis, the peak calling steps above were performed for each pair of replicates independently as well as for a pooled set of data from both replicates. The intersectBed tool from the bedtools v2.27.1 suite was used to identify a set of replicated peaks, which we define as the subset of peaks called in the pooled set that were also present independently in both replicate peak call sets.
To obtain a uniform d-TAC catalogue that can enable multi-dimensional analysis across all 66 tissue-stages, the aforementioned replicated peak sets for each sample were concatenated, merged, sorted, and then labelled using the mergeBed and sortBed tools from the bedtools v2.27.1 suite. The intersectBed tool was used associate each d-TAC with the original tissue-stages where its constituent peaks were accessible. The catalogue was further categorized as being TSS distal or proximal based on a ±1-kb window around GENCODE v4 TSSs.
To evaluate the sensitivity of our peak calls in detecting potential cis-regulatory elements, we calculated the true positive rate, or fraction of peaks recovered, for every applicable tissue-stage with respect to two reference sets: actively transcribed promoters; and enhancers from the VISTA enhancer database (accessed 22 July 2017) with activity at E11.5. Using matched RNA-seq downloaded from https://www.encodeproject.org/, transcripts with counts of ≥10 TPM were classified as actively transcribed for each tissue-stage.
Catalogue specificity was assessed by calculating the true negative rate of each tissue-stage’s d-TACs against GENCODE v4 TSSs that were not accessible to matched DNase-seq from https://www.encodeproject.org/. To further probe the tissue-specificity of the d-TAC catalogue, the overlap between d-TACs for each tissue at E11.5 and enhancers that showed activity in the matching tissue pattern was calculated and compared to a background hit rate of enhancers with activity in any pattern. Enrichment significance was computed using a binomial test.
To calculate enrichment in ChromHMM states, the d-TAC catalogue was overlapped with autosomal ChromHMM state calls for each tissue-stage (pooled or replicate call set, as indicated). Enrichment for a given state s in a particular tissue-stage was calculated as the observed number of base pairs of the d-TAC catalogue that overlapped state s, divided by the total number of base pairs expected to overlap state s on the basis of its genome coverage (total bp coverage of d-TAC catalogue × fraction of genome covered by state s).
To identify differentially accessible d-TACs, for each d-TAC in the uniform catalogue, we counted the number of ATAC–seq reads that overlapped the d-TAC for each tissue-stage and replicate using the coverage function in bedtools v2.27.1. For each tissue, d-TACs at any stage were classified as temporally dynamic if they showed a significant change in accessibility (fold change ≥2, P ≤ 0.05) between any sequential stages of development, using DEseq2.
To investigate the relationship between changes in accessibility and changes in chromatin state, the dynamic d-TACs were classified as either gaining (positive log[fold change]) or losing (negative log[fold change]) accessibility. For each tissue-stage-transition (n to n + 1), these sets of gain- or loss-of-accessibility d-TACs were overlapped with ChromHMM state calls for stages n and n + 1. Enrichment was calculated by taking the observed fraction of dynamic base pairs that overlapped each combination of states (state at n, state at n + 1) and dividing by the expected fraction of base pairs that overlapped each state combination based on the dynamic and non-dynamic d-TACs.
To investigate the temporal relationship between H3K27ac and chromatin accessibility, dynamic strong-enhancers (replicated, ChromHMM state U5) at each stage-transition were overlapped against d-TACs for the respective tissue to identify matching enhancers and d-TACs. In cases where more than one d-TAC overlapped an enhancer, the d-TAC with the largest number of overlapping base pairs was selected. The sequential log[fold-change] in ATAC–seq signal was evaluated at every possible stage-transition for these matching d-TACs and a mean was taken. These stage-transitions were converted to ‘offsets’ relative to the strong enhancers and the fold-changes averaged for the purpose of deriving a global trend (that is, for dynamic enhancers at E11.5–E12.5; E11.5–E12.5 is an offset of 0, E12.5–E13.5 is an offset of 1, and so on until E16.5–P0 is an offset of 5). The inverse analysis was also performed to assess the log[fold-change] in H3K27ac at dynamic d-TACs.
A correlative map between d-TACs was generated for each chromosome by calculating the Pearson correlation coefficient (PCC) for each pair of d-TACs, using the ATAC–seq read counts normalized to RPKM and log2-transformed with a small pseudocount. We define ‘correlated d-TACs’ as those in the same TAD (as defined by mouse embryonic stem (ES) cells) with a pairwise PCC ≥ 0.7.
To assess d-TAC correlations as a function of genomic distance, we assigned each d-TAC to a 10-kb bin. For each bin A, the correlation was measured between its d-TACs and those of bin B, at various distances away ranging from 10 kb to 2 Mb. The average of these correlations across all chromosomes was plotted as a function of distance. Additionally, to investigate the validity of using mouse ES cell TAD boundaries as a constraint for the correlative map, the mean correlations between d-TACs at various genomic distances were compared for pairs located within the same TAD and those not sharing a TAD. The significance of the difference in correlation between intra-TAD and inter-TAD d-TAC pairs was calculated using the Wilcoxon signed-rank test.
To enable comparison to GWAS of human phenotypes, we used liftOver with default settings to convert d-TACs from mm10 to hg19 genomic coordinates. We then defined novel d-TACs by removing those that overlapped DNaseI hypersensitivity sites from any cell line or tissue in two published data sets9,36, one of which included embryonic tissue. We obtained index variants for all traits in the GWAS catalogue (https://www.ebi.ac.uk/gwas/api/search/downloads/full) and retained a unique set of variants that were identified as genome-wide significant (P < 5 × 10−8) in GWAS of individuals with European ancestry. To obtain a background set of variants for enrichment testing, we used the filtered index variants as the input for SNPsnap60, which matches based on (1) minor allele frequency, (2) distance to the nearest annotated gene, (3) gene density in the surrounding region, and (4) number of SNPs in linkage disequilibrium (LD), with the following parameters: European population, ten matched SNPs, exclude HLA SNPs and input SNPs, and report clumping. As GWAS index variants are not necessarily causal and can be in LD with the true causal variant, we next defined loci for all index and matched background variants as all SNPs in high LD (r2> 0.8) with the variant in European 1000 Genomes65 samples using PLINK v1.90p66. We then calculated the number of GWAS and background loci with at least one variant that overlapped either all d-TACs or novel d-TACs and used a hypergeometric test to assess the enrichment significance of GWAS loci compared to matched background loci.
To test for enrichment of complex phenotypes and diseases with publicly available summary statistics, we first defined sets of human orthologues of enhancer d-TACs. For each tissue, we collapsed all strong and weak enhancer chromatin states (En-Sd, En-Sp, En-W) across time points and used liftOver to convert genomic coordinates from mm10 to hg19. We then intersected orthologous enhancers with orthologous d-TACs to obtain a set of orthologous enhancer d-TACs for each tissue. We collected summary statistics for 41 human traits and diseases (Supplementary Table 11), converting odds ratios and confidence intervals to log odds ratios and standard errors for binary traits and estimating allele frequencies from the European subset of 1000 Genomes where unavailable from the summary data. We used polyTest61 to test for enrichment of variant effects on each phenotype within orthologous enhancer d-TAC annotations with the parameters ‘–univariate–maf 0.05–high-mem’. We used hierarchical clustering on signed −log10(P) for enrichments that were z-score normalized within studies to group similar phenotypes.
We obtained the aggregate accessible chromatin peaks for each cell cluster in the P56 mouse forebrain and removed peaks that overlapped promoters (2 kb upstream of mm10 RefSeq TSSs), retaining sets of promoter-distal peaks38. For this analysis, we did not restrict peaks to enhancer chromatin states, as doing so would potentially bias results for cell types that were over-represented in the bulk tissue. We converted genomic coordinates for promoter-distal peaks from mm10 to hg19 using liftOver. We then used polyTest to assess cell-type-specific enrichment of phenotypes and diseases that showed at least nominally significant enrichment (P < 0.05) in mouse forebrain d-TACs from the previous analysis. We used hierarchical clustering on z-score-normalized signed −log10(P) for enrichment as described in the previous analysis and plotted results for traits that showed at least nominal significance in at least one cell cluster.
We note that the chromatin state annotations reported here are specific to this study and are distinct from larger efforts by the ENCODE Data Analysis Center to integrate data from across the entire consortium into a comprehensive ‘encyclopaedia.’ We also note that we excluded E10.5 from the ChromHMM analysis because this stage did not have the full complement of eight histone modification profiles, and testing showed that models with only six marks failed to capture the full set of states derived from eight marks (Extended Data Fig. 7). However, we provide a set of ChromHMM annotations using the six-mark model on E10.5 on our website here: http://renlab.sdsc.edu/renlab_website//download/encode3-mouse-histone-atac/.
Chromatin data sets (.bam files) were downloaded from the ENCODE DCC on 15 October 2016. De-duplicated .bam files for each sample, along with their respective input controls, were binarized using the binarizeBam function of ChromHMM, with default parameters. Models considering 2–24 states were learned separately on the two replicates using the LearnModel function, with default parameters. For the rest of the analyses, we leveraged the availability of two distinct replicate time series; namely, we applied the same strategy separately and compared the results a posteriori. The conclusions obtained were invariably consistent, suggesting that the inferences on a single time series (at least in terms of global genomic patterns) are highly reproducible.
We devised two strategies to identify the minimal number of states that captures the combinations of histone modifications present in the data, both of which converged on a 15-state model. First, the ChromHMM CompareModels function was run separately on the two series. This function compares the emission parameters of a selected model to a set of models (in terms of Pearson’s correlation), and outputs the maximum correlation of each state in the selected model with its best matching state in each other model. We used this function to compare the ‘full’ model (the one that considers 24 states) to the states in the simpler models. We then calculated the median correlation of all the 24 states against the simpler models, plotted these numbers against the number of states in the model and looked at the number of states at which both series reached a plateau. As a complementary strategy, the emission probabilities from all the 23 models (considering 2–24 states) from both replicates were clustered together. The rationale behind this strategy is that very similar states across models will tend to cluster together, so there must be an optimal number of clusters corresponding to the optimal number of states in the model. To this end, we applied k-means clustering with k between 2 and 24, and evaluated the goodness of the separation for each k as the ratio between the ‘between sum of squares’ (referred to as Between SS) and the ‘total sum of squares’ (Total SS). Very cohesive, well-separated clusters tend to approach a ratio of 1. Given a value of k, the ratio was averaged over one hundred realizations of the clustering. The ratio observed for k = 24 was used as a maximum, and the optimal number of states was then defined by the smallest value of k that showed a ratio equal to or higher than 95% of the maximum. To compare the eight-mark model to a six-mark model (Extended Data Fig. 7) we used the 66 tissue-stages for which we had the full complement of eight marks, then downsampled to six marks (H3K4me1, H3K4me3, H3K27ac, H3K27me3, H3K9me3 and H3K26me3), and repeated the analyses described above to arrive at an optimal number of states. We then compared the 8-mark 15-state model to 6-mark models with two different numbers of states (11 states and 16 states) representing the minimum and maximum of the optimal range, respectively.
The segmentation was run separately for each sample, using the MakeSegmentation function of ChromHMM (default parameters) and the model derived from the first replicate. For the final set of replicated state calls we required that a region was assigned to the same state in both biological replicates (within a given tissue and stage). Regions that were not assigned to the same state in both replicates were reclassified as ‘no reproducible signal’ (distinct from state 15: no signal in both replicates). The unionbedg functionality of BEDTools67 v2.17.0 was used to keep track of the chromatin state of genomic intervals across a defined set of samples. The total coverage of annotated sequence is 2,725,535,600 bp. This number was used as the denominator to calculate per cent genome coverage from ChromHMM states in the text and figures.
Given a replicate for a defined tissue, all those genomic intervals classified in a specified state (for example, number 5, strong enhancers) at one or more time points were tracked using the approach described above. Considering each pair of adjacent developmental time points, the genomic coverage of each transition between each pair of states was then calculated. The resulting numbers were then normalized on the coverage of the largest transition in the time series under investigation (for example, liver, replicate number 2) and shown as a directed graph.
After tracking the changes in chromatin state of each genomic base pair in the genome across multiple stages and tissues, the resulting matrix was binarized according to each segment being classified in a specified state (1) or any other state (0). The binary distances between all the pairs of samples considered in each specific analysis were then calculated. These were used either for comparisons or hierarchical clustering (Ward’s method).
Functional enrichments through GREAT68 were obtained using the greatBatchQuery.py script. The resulting lists were first filtered for the relevant ontologies. After that, only the terms showing a binomial FDR ≤ 0.05 and a regional enrichment equal or higher than twofold were considered. VISTA validated elements were downloaded from https://enhancer.lbl.gov on 17 June 2016. Mm9 and hg19 coordinates were converted to mm10 using liftOver (setting -minMatch to 0.95 and 0.1, respectively). VISTA positive elements with any of the following annotations: forebrain, midbrain, hindbrain, neuraltube, limb, facial mesenchyme or heart were considered for the following analysis. Liver was not considered in this enrichment analysis since there are currently fewer than 10 validated elements in VISTA that show reproducible staining in the liver. coverageBed from BEDTools v2.17.0 was used to calculate the coverage of the regions in each state in the E11.5 predictions with each tissue-specific group of VISTA elements. The fraction of bases covered was then normalized to the expected overlap, based on the overall genome-wide coverage of each state. The enrichment for repetitive elements was calculated using the OverlapEnrichment function of ChromHMM.
A 2-kb window was defined around the TSS coordinates of all protein coding transcripts in GENCODE69 vM9. These 2-kb windows were overlapped with ChromHMM calls (‘pooled’ set) to determine their chromatin state in each tissue and stage. A TSS was classified as active in a given tissue-stage if this 2-kb window overlapped the active promoter state (state no. 1, Fig. 1a), and did not overlap any repressive states (states 3, 13, 14). A TSS was classified as repressed in a given tissue-stage if this 2-kb window overlapped the state characteristic of polycomb-mediated repression state (state 13), and did not overlap any active states (states 1, 2, 4, 5, 6, 7, 10, 12). TSSs that did not meet the criteria for either active or repressed in a given tissue-stage were left unclassified. A gene was classified as a putative PcG target if it had at least one repressed TSS in at least one tissue-stage. To determine whether the genes we identified as putative PcG targets had been identified previously, we compared our data to five published studies examining the genome-wide distribution of H3K27me3 and/or PcG proteins13,29,30,31,32, including the NIH Epigenome Roadmap data set, which examined at more than 100 human sample types. For refs. 29,30, any gene with a TSS annotated as H3K27me3-positive in any sample (irrespective of other marks) was considered a previously identified PcG target. For refs. 31,32, any gene classified in one of the ‘PRC’ states in any sample was considered a previously identified PcG target. For ref. 13, ChromHMM state calls for 127 human samples were downloaded on 31 March 2018 (‘15_coreMarks_mnemonics’ call set). Putative PcG target genes in this data set were identified as described above for our mouse ChromHMM calls, with the following modifications: the GENCODE v27 annotation set was used for human (gencode.v27lift37.annotation.gtf.gz), and the Polycomb-associated heterochromatin states considered were ‘13_ReprPC’, ‘14_ReprPCWk’, and ‘10_TssBiv’. Any gene with at least one TSS overlapping one of these states in at least one sample was considered a previously identified PcG target. Ensembl v84 was used to match mouse gene IDs with human orthologous gene IDs (attribute = hsapiens_homolog_ensembl_gene). For CpG analyses, a list of CpG Islands with corresponding values (length GC#, CpG#, GC%, CpG%) was downloaded from UCSC Table Browser on 8 January 2018, and overlapped with the list of TSSs using bedtools v2.20.1. If a TSS overlapped more than one CGI, the corresponding values of all overlapping CGIs were combined and associated with the overlapping TSS.
We obtained a list of Mendelian disease genes from https://www.omim.org/70 (genemap2.txt, accessed on 14 January 2018). To filter out genes associated with complex diseases or non-disease phenotypes, we performed the following filter steps. 1) We required that the genes be classified as type 3 (the molecular basis of the disorder is known). 2) We required that the gene have at least one associated phenotype that is not in brackets (nondiseases) or braces (multifactorial disorders), or containing a question mark (relationship between the phenotype and gene is provisional). 3) We further required that the human gene Ensembl ID mapped uniquely to one mouse Ensembl ID. 4) Finally, we considered only autosomes, because of the mixed-gender litter pools used for ChIP–seq. These filtering steps led to a set of 3,281 genes that we classified as MDGs. To identify TF genes, we downloaded a list of mouse TFs from the TFClass database71 (accessed 18 February 2017). As alternative sources of TF genes to support the TFClass results, we used the DBD: transcription factor prediction database72, and genes associated with one or more GO terms containing the phrase ‘TF’ as determined by AmiGO73 (accessed on 14 January 2018, taxon_subset_closure_label: Mus musculus, document_category: bioentity). AmiGO was also used in this way to identify genes associated with one or more GO terms containing the word ‘development’ (Extended Data Fig. 12). Genes with at least one transcript tagged as a consensus coding sequence (CCDS) in GENCODE were classified as CCDSs in Extended Data Fig. 12.
The temporal dynamic analysis was performed for each tissue separately. First, 1-kb genomic bins that overlapped with regions defined as ChromHMM strong enhancer states in at least one stage were identified. Then we selected dynamic elements (bins) from these strong enhancer bins using the bioconductor LIMMA package74 v3.28.21. LIMMA is a package developed for calling differentially expressed genes for microarray but was also adapted for sequencing data with the LOOM functionality. LIMMA was used to call differential enrichment between each adjacent stage comparison (for example, E11.5 versus E12.5, E12.5 versus E13.5, and so on). P values were calculated with the eBayes function within LIMMA with trend parameter disabled, and were adjusted using the Benjamini–Hochberg method. A bin was called overall dynamic if its adjusted P value was less than 0.05 in any adjacent stage comparison; otherwise it was called a non-dynamic bin. Non-dynamic bins were not included in the following analysis to reduce noise. We performed k-means clustering on dynamic bins across stages. The rows (bins) were normalized by dividing by a common value so that the squares of the values sum to 1. The optimal k was determined using the elbow method to cut off at the k value where percentage of ‘withinness’ values transition from increasing quickly to increasing steadily with larger k. The resulting heatmaps of the k-means clusters are shown in Fig. 4a and Extended Data Fig. 16. For each of the identified clusters, we performed enrichment testing of GO Biological Processes using GREAT. Over-represented motifs for each dynamic cluster were identified as follows: first, all vertebrate motif position weight matrices (PWMs) were downloaded from the JASPAR TF database and used to scan the peak-bins for motif occurrences with FIMO, MEME suite v4.11.275. For each motif, we computed the odds ratio and the significance of enrichment in each cluster, comparing to a non-dynamic bin pool using Fisher’s exact test. The non-dynamic bin pool was sampled with replacement to match the distribution of average signal strength from the dynamic bins. Following that, significant TF PWMs were grouped in subfamilies using the structural information from TFClass71 because they share similar if not identical binding motifs. The top significantly over-represented TFs and their associated subfamilies were reported.
Super-enhancers were identified using rose v0.176,77 with default parameters for each tissue-stage with H3K27ac signals. Super-enhancers were then combined within the same tissue and across all tissues to generate a non-redundant set of super-enhancers (Extended Data Fig. 14c, Supplementary Table 6).
The reproducible strong enhancer calls (state no. 5) were merged using the mergeBed utility from BEDTools v2.17.0. After that, those regions or sub-regions that overlapped the intervals ±2.5 kb from the TSSs of genes in Gencode were excluded from the merged regions using subtractBed from BEDTools v2.17.0. Regions smaller than 2 kb were enlarged to 2 kb from their central coordinate (to allow more robust signal estimation). This resulted in 66,556 putative enhancers. H3K27ac signals at these regions were then quantified using uniquely aligned, de-duplicated reads. These measurements were carried out using the coverageBed utility from BEDTools v2.17.0, then normalized to RPKM according to the sequencing depth of each sample, and log2-transformed (zeros were replaced by the smallest detectable value larger than zero). The mRNA expression of protein-coding genes was tracked across the 66 samples. Small and non-coding RNAs were excluded from any subsequent step by considering only those genes with a GENCODE biotype supporting protein-coding functionality. FPKM were log2-transformed (zeros were replaced by the smallest detectable value larger than zero). For each TAD defined in the genome of mouse ES cells45, the putative enhancers and genes were retrieved. All the enhancer–gene pairs within the TAD were then evaluated in terms of SCC between the H3K27ac pattern of enrichment and the mRNA expression across the samples. Each gene was assigned to the putative enhancer showing the highest value of SCC. To attach P values to these correlations, a null distribution was estimated empirically, by calculating the SCC of the enhancer with all the genes on the chromosome. Two strategies were used to estimate a P value: 1) a z-score was calculated by subtracting the mean and dividing by the standard deviation of the null, and the corresponding P value was then calculated using the pnorm function in R; 2) an empirical P value was defined as the number of times an equal or better than the observed SCC was found in the null. Only those putative enhancers showing a P value ≤ 0.05 (for both strategies) and an SCC ≥ 0.25 were retained. Two maps were independently derived from the two biological replicates. Only these overlapping associations were used for further evaluation and analyses.
Capture-C interaction data from the developing limb and brain48 were retrieved from the GEO (GSE84792). Chromatin interaction analysis by paired-end tag sequencing (ChIA–PET) interactions at sites bound by the cohesion subunit SMC1A in the developing limb49 were retrieved from Supplementary Table 2 of the original publication. Enhancer–gene contacts in fetal liver cells as inferred from Capture HiC50 were downloaded from ArrayExpress (E-MTAB-2414). In all cases, mm9 coordinates were mapped to mm10 using liftOver. For each published data set, only those regions in the enhancer–gene map that overlapped any experimentally validated interaction were retained. The fraction of interactions showing experimental support was then calculated for both the gene assigned by correlation and the nearest RefSeq gene.
The putative enhancer regions were mapped to the human genome (hg19) using liftOver, with a strategy similar to previous reports78. Each region was required to both uniquely map to hg19, and to uniquely map back to the original region in mm10, with the requirement that ≥50% of the bases in each region were mapped back to mouse after being mapped to human. For each enhancer–gene pair, the orthologous human gene was inferred using BioMart79 (Ensembl version 87; from http://www.ensembl.org/biomart/martview, Filters -> Multiple Species Comparisons -> 属性 - >同源物 - >鼠标直系同源物) 。直系同源对也需要在人类中共享相同的TAD(源自人ES细胞45)。我们的小鼠图中的三千,五百七十个基因具有人类直系同源物(基因),并且至少有一个与人类基因组中一个可对齐区域的链接的增强子(位于同一人类TAD中)。在成功映射到HG19的17,689个推定的增强子中 ,有12,564个被分配给人类具有明确同源的基因 。
GTEX Consortium80生成的单组织EQTL – Gene关联是从GTEX Portal(http://gtexportal.org,版本v6p)下载的。仅考虑那些具有超过750,000个带注释的EQTL的组织。生成了一组控制器 - 元素关联,以匹配实际增强子 - 元素图的大小和TSS距离分布 。简而言之 ,对于每个增强子 - 基因对,计算了基因的TSS与增强子的中心坐标之间的距离;之后,一个区域与基因TSS相同的距离的相同大小相同 ,但在增强子的另一侧被选为对照组。对于EQTL分析,然后根据增强子与基因的TSS之间的距离计算了十个等尺寸的垃圾箱支持的EQTL的比例。将相同的过程应用于最近的基因 。然后,分别计算了两组和十个垃圾箱中的每一个。这些数字用于使用Fisher的精确测试来得出每个垃圾箱的P值。为了进行此分析 ,我们仅考虑了从人体组织中得出的那些EQTL,在本研究中(大脑,心脏 ,肝脏,肺,胃,胃和小肠)对等效组织进行了分析 。
参考数据集。下载了6 ,Genehancer81,Jeme82和Ripple83,并始终使用升降机重新映射到HG19基因组。使用BedTools v2.17.0的密码床进行了不同地图之间增强子 - 元素关联的映射。
在本工作中使用的功能验证的增强器的名称(MM数字)是Vista增强器浏览器(https://enhancer.lbl.gov/)34的唯一标识符 。选择增强剂进行测试如下:H3K27AC峰呼叫三个组织(E12.5心脏 ,前脑和肢体),从TSS-DISTAL H3K27AC峰从均匀的加工管道(MM10-Minimal)中从TSS-DISTAL H3K27AC峰取自Encode DCC(缩小复制峰)。每个组织的峰通过富集评分(最不重要的)进行排名。然后,我们从每个组织的排名列表中的三个不同垃圾箱中选择了预测增强子进行测试(垃圾箱约为1-85、1,500–1,550和3,000–3,050) 。已经包含在Vista增强器浏览器或似乎重叠未注释的启动子的基因座被排除在测试之外。总共测试了150个预测的增强子 ,包括60名最高排名的候选者(每组织20名),45个中间排名(每组织15个)和45个排名较低的候选者(每组织15个)。如前所述52,84,在FVB/NCRL菌株小鼠(Charles River)中进行了转基因小鼠测定 。简而言之 ,将预测的增强子放大并克隆到最小Hsp68启动子和LACZ报告基因上游的质粒中。通过将所得质粒核注射到受精的小鼠卵中,产生转基因胚胎。将胚胎植入替代母亲中,在E12.5收集 ,并为β-半乳糖苷酶活性染色 。如果至少三个胚胎在同一组织中的β-半乳糖苷酶染色相同,则将元素评为正增强子。如果未观察到可重复的染色,并且至少获得了携带转基因插入的五个胚胎,则将元素评分为负。补充表10中提供了每个元素的基因组坐标和结果 。 (https://enhancer.lbl.gov/)。
总体而言 ,分别考虑了422 、299和414个元素,分别显示了前脑,肢体或心脏活动的活性。对于每个排名的H3K27AC区域列表 ,与正(在同一组织中显示活性的元素,从中得出了H3K27AC谱的相同活性),并且计算了负(在其他组织中或其他组织中的所有组织或阳性)元素。对于三个组织中的每一个 ,都使用样条来拟合重叠(0-1值)与等级(Smooth.Spline R函数,自由度(DF)= 2)的级别(平滑 。为了得出每个组织的背景验证速率的估计值,杠杆率高于H3K27AC剖面所遗漏的远景元素。具体而言 ,在组织和部分集合中验证的远景元件的数量除以本集中的远景元素总数。使用上述的样条方法得出了排名前脑远景元件的验证率 。对于以下每个类别,将每个元素注释至最佳重叠特征(就信号或保守元素的LOD评分而言):H3K27AC富集,P300结合 ,DNasei-hypersinementive位点(DHSS),ATAC和PHASTCONS和PHASTCONS保存。如果可用,则使用生物学重复来得出单独的等级,然后使用它们的等级总和来重新排列元素。DHSS从编码DCC网站(登录:ENGSR014SFF)或GEO(访问:GSM348064 ,GSM348066,GSM5559652)下载 。PHASTCON保守的元素于2018年1月24日从UCSC基因组浏览器下载(PhastConselements60way和PhastConselements60WayPlacental)85。
由于编码分析管道主要集中在唯一映射的读取上,因此我们使用了单独的方法来研究重复区域。更具体地说 ,我们使用了带有两轮映射步骤的管道来重新处理所有FASTQ文件 。在第一轮映射中,使用Bowtie与:Bowtie HG19 -P 16 -T -M 1 -S -S -C -ChunkMBS 512 – Max Multimap.fast.fast.fast.fast.fastq utput.SAM86对齐样品读取与参考基因组MM10对齐。–max用于将映射映射到基因组的多个位置与唯一映射的读取。在第二轮映射中,通过将每个重复元件的基因组实例连接构建 ,以构造每个重复的元素亚家族的基因组实例,其15 bp侧翼基因组序列和FastA Format87中的200 bp间隔序列 。从repotmasker.org下载了此步骤中使用的重复元素的注释文件。python脚本与参数一起使用如下:python repenrich.py /data/mm10_repeatmasker.txt /data /sample_a sample_a /data /data /mm10_setup_folder samplea_multimap.fastq samplea_unique.bas samplea_unique.bam samplea_unique.bam – cpus 1688。使用来自重复元素和非唯一映射读取的唯一映射读取的信息确定或重复元素类。由于某些重复的元素亚家族彼此非常相似,因此使用分数计数方法将映射到多个重复元素亚家族的读取 。它总结读取了一次唯一地映射到一个重复的元素亚家族中 ,并使用分数1/ns映射到多个亚家族,其中ns是读取对位的重复性元素亚科的数量。估计跨不同组织重复类的富集信号的计数表是绘制数字的最终输出。
在统计计算环境R v.3.3.1(https://www.r-project.org/)中,大多数描述的数据处理步骤(绘图 ,统计测试,计算相关性和分层聚类)均在统计计算环境中进行 。
有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。
http://http://www.0517kq.com/news/show-8336.html/sitemaps.xml http://http://www.o-press.com/news/show-112.html/sitemaps.xml http://http://www.0517kq.com/news/show-8364.html/sitemaps.xml http://http://www.0517kq.com/news/show-8244.html/sitemaps.xml http://http://www.0517kq.com/news/show-8234.html/sitemaps.xml http://http://www.0517kq.com/news/show-8373.html/sitemaps.xml http://http://www.0517kq.com/news/show-8083.html/sitemaps.xml http://http://www.o-press.com/news/show-69.html/sitemaps.xml http://http://www.o-press.com/news/show-141.html/sitemaps.xml http://http://www.0517kq.com/news/show-8256.html/sitemaps.xml
本文来自作者[qingdaomobile]投稿,不代表青鸟号立场,如若转载,请注明出处:https://www.qingdaomobile.com/life/202506-27399.html
评论列表(4条)
我是青鸟号的签约作者“qingdaomobile”!
希望本篇文章《小鼠胎儿发育中动态染色质景观的地图集》能对你有所帮助!
本站[青鸟号]内容主要涵盖:国足,欧洲杯,世界杯,篮球,欧冠,亚冠,英超,足球,综合体育
本文概览: 所有动物工作均由劳伦斯·伯克利国家实验室动物福利和研究委员会审查和批准。使用C57BL/6N菌株肌肉动物进行所有发育阶段的组织收集。对于E14.5和P0,从Charles...