RNA-seq Data Analysis in Saccharomyces cerevisiae with SPRC Treatment
1 Introduction
This project investigates the effect of hydrogen sulfide (H₂S), a byproduct of the cystathionine-γ-lyase (CSE)-catalyzed reaction of S-propargyl-cysteine (SPRC), on Saccharomyces cerevisiae using an RNA-seq analysis pipeline. The pipeline includes data preprocessing, read counting, differential expression analysis, and KEGG pathway enrichment.
Initially, the raw sequencing data undergoes quality control, alignment to the S. cerevisiae reference genome, and quantification of gene expression. Differential expression analysis is then performed to identify genes with significant changes in expression between experimental conditions. Finally, KEGG pathway enrichment analysis highlights biological pathways associated with the differentially expressed genes (DEGs).
This analysis provides insights into how H₂S influences gene expression and related pathways in Saccharomyces cerevisiae.
RNA-seq data are from the following article: Mechanism of Growth Regulation of Yeast Involving Hydrogen Sulfide From S-Propargyl-Cysteine Catalyzed by Cystathionine-γ-Lyase
2 Snakemake Workflow
Snakemake is a workflow management tool used to automate and organize complex data analyses. It works by defining rules that describe the steps in the pipeline, with their input files, output files and commands to be executed.
Snakemake enhances the execution of workflows through key technical features. It automatically resolves rule dependencies, ensuring the correct execution order. The system also supports parallelization, enabling multiple tasks to be processed simultaneously for improved efficiency. Furthermore, Snakemake promotes reproducibility by supporting isolated environments via tools like Conda or Docker, ensuring that workflows can be reliably recreated in different settings.
2.1 Define resources and metadata
We need to set up the required workflow resources, including SRA identifiers for the samples and reference files for the genome, GTF annotation, and SNPs, to enable dynamic parameterization of the workflow with resources linked to specific conditions and replicates.
# Define resource dictionary mapping sample conditions to their SRA identifiers.
resources = {
"samples": {
"SPRC": {
"rep1": "SRR13978640",
"rep2": "SRR13978641",
"rep3": "SRR13978642",
},
"CTRL": {
"rep1": "SRR13978643",
"rep2": "SRR13978644",
"rep3": "SRR13978645",
},
},
"references":{
"fasta": "genome/Saccharomyces_cerevisiae.R64-1-1.dna.primary_assembly.fa.gz",
"gtf": "genome/Saccharomyces_cerevisiae.R64-1-1.108.gtf.gz",
"snps": "genome/saccharomyces_cerevisiae.vcf.gz"
}
}
2.2 Extract SRA identifiers
To ensure the workflow is flexible and reusable for any sample set, we have defined a function, which dynamically extracts SRA identifiers based on the condition (SPRC or CTRL) and replicate (rep1, rep2, etc.).
# Function to extract SRA ID based on the sample's condition and replicate.
def getSRR(wildcards):
all_keys = wildcards.sample.split("_")
return resources["samples"][all_keys[0]][all_keys[1]]
# List of all sample names in the format 'condition_replicate'.
samples = []
for condition, reps in resources["samples"].items():
for rep in reps:
sample_name = f"{condition}_{rep}"
samples.append(sample_name)
2.3 Download FASTQ files from SRA
The dump_fastq
rule downloads raw FASTQ files from SRA
for each sample using fasterq-dump
. The downloaded files
are compressed in .fastq.gz
format and stored in the
rawReads/
directory.
# 'dump_fastq' rule: download raw fastq files from SRA.
rule dump_fastq:
output:
"rawReads/{sample}_1.fastq.gz",
"rawReads/{sample}_2.fastq.gz",
conda:
"rnaseq"
params:
sra_id=lambda wildcards: getSRR(wildcards),
shell:
"""
fasterq-dump {params.sra_id} -O rawReads --split-files -p
for read in 1 2; do
mv rawReads/{params.sra_id}_$read.fastq rawReads/{wildcards.sample}_$read.fastq
gzip rawReads/{wildcards.sample}_$read.fastq rawReads/{wildcards.sample}_$read.fastq.gz
done
"""
2.4 Trim Adapters and Perform Quality Control with fastp
The fastp
rule trims adapter sequences and performs
quality control on paired-end FASTQ files using the tool
fastp
. It generates trimmed FASTQ files in
.fastq.gz
format and provides a detailed quality control
report in HTML format. This step ensures that reads with poor quality or
adapter contamination are filtered out before downstream analysis.
# 'fastp' rule: run fastp to trim and generate quality reports for the reads.
rule fastp:
input:
"rawReads/{sample}_1.fastq.gz",
"rawReads/{sample}_2.fastq.gz",
output:
"trimmedReads/{sample}_1_trimmed.fastq.gz",
"trimmedReads/{sample}_2_trimmed.fastq.gz",
"fastpQC/{sample}_fastp_report.html",
conda:
"rnaseq"
shell:
"""
fastp -i {input[0]} -I {input[1]} -o {output[0]} -O {output[1]} --detect_adapter_for_pe -q 20 -h {output[2]} -R "{wildcards.sample} fastp report"
"""
2.5 Index Genome for Alignment (Samtools, STAR, Picard)
The index_genome
rule generates the necessary index
files for the genome to be used in the alignment step. It employs
several tools: samtools
to index the compressed FASTA file,
STAR
to create a genome index for RNA-Seq alignment,
tabix
to index SNP data, and Picard
to create
a sequence dictionary file.
# 'index_genome' rule: generate genome index files using Samtools, STAR, Tabix, and Picard.
rule index_genome:
input:
fasta_gz=resources["references"]["fasta"], # Compressed FASTA file (.gz)
gtf_gz=resources["references"]["gtf"],
gtf="genome/Saccharomyces_cerevisiae.R64-1-1.108.gtf",
fasta="genome/Saccharomyces_cerevisiae.R64-1-1.dna.primary_assembly.fa",
snps=resources["references"]["snps"]
output:
genome_dir=directory("starIndex"),
samtools_index="genome/Saccharomyces_cerevisiae.R64-1-1.dna.primary_assembly.fa.gz.fai",
dict_file="genome/Saccharomyces_cerevisiae.R64-1-1.dna.primary_assembly.dict",
tbi="genome/saccharomyces_cerevisiae.vcf.gz.tbi"
conda:
"rnaseq"
shell:
"""
# Generate the index for the compressed genome FASTA (.gz)
samtools faidx {input.fasta_gz}
# Generate the genome index with STAR
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir {output.genome_dir} \
--genomeSAindexNbases 10 --sjdbGTFfile {input.gtf} --genomeFastaFiles {input.fasta}
# Create the tabix index for SNPs
tabix -f {input.snps}
# Create the sequence dictionary file for Picard
picard CreateSequenceDictionary -R {input.fasta}
"""
2.6 Align reads with STAR
The star_align
rule aligns reads to the indexed genome
using the STAR
aligner. The output of this step is a sorted
BAM file containing the aligned reads.
# 'star_align' rule: align reads with STAR.
rule star_align:
input:
fq1="trimmedReads/{sample}_1_trimmed.fastq.gz",
fq2="trimmedReads/{sample}_2_trimmed.fastq.gz",
indexDir="starIndex"
output:
bam="starAligned/{sample}/{sample}_Aligned.sortedByCoord.out.bam"
conda:
"rnaseq"
shell:
"""
gunzip -c {input.fq1} > {wildcards.sample}_1_trimmed.fastq
gunzip -c {input.fq2} > {wildcards.sample}_2_trimmed.fastq
mkdir -p starAligned/{wildcards.sample}
STAR --runMode alignReads \
--genomeDir {input.indexDir} \
--outSAMtype BAM SortedByCoordinate \
--readFilesIn {wildcards.sample}_1_trimmed.fastq {wildcards.sample}_2_trimmed.fastq \
--outFileNamePrefix starAligned/{wildcards.sample}/{wildcards.sample}_
rm {wildcards.sample}_1_trimmed.fastq {wildcards.sample}_2_trimmed.fastq
"""
2.7 Index BAM File with Samtools
The index_bam
rule generates a BAM index file for the
aligned BAM file produced by star_align
.
2.8 Remove Duplicate Reads from BAM File
The rmDuplicates
rule uses samtools rmdup
to remove duplicates and creates a new BAM file without duplicates.
Duplicate reads can arise during PCR amplification, and their removal
helps improve the accuracy of downstream analyses.
rule rmDuplicates:
input:
bam="starAligned/{sample}/{sample}_Aligned.sortedByCoord.out.bam"
output:
bam="starAligned/{sample}/{sample}_Aligned.sortedByCoord.RD.out.bam",
params:
dupStats="starAligned/{sample}/{sample}_dupStats.txt"
conda:
"rnaseq"
threads:
3
shell:
"""
samtools sort --threads {threads} {input.bam} | samtools rmdup -s - {output.bam} 2> {params.dupStats}
"""
2.9 Index Non-Duplicated BAM Files
The index_nonDupBam
rule creates BAM index files for the
deduplicated BAM files generated by the rmDuplicates rule. Indexing the
deduplicated BAM files is crucial for efficient access to specific
regions in subsequent steps, such as variant calling or visualization
(IGV).
2.10 Count the Number of Mapped Reads in BAM Files
The countsFeature
rule uses the
featureCounts
tool to count the number of reads that map to
each feature (e.g., genes or exons) in the genome. Here the input is the
aligned BAM files from the rmDuplicates step and the gene annotation in
GTF format.
rule countsFeature:
input:
gtf="genome/Saccharomyces_cerevisiae.R64-1-1.108.gtf",
bam=expand("starAligned/{sample}/{sample}_Aligned.sortedByCoord.RD.out.bam", sample=samples)
output:
countTable="featureCounts/all_counts.txt"
conda:
"rnaseq"
threads:
3
shell:
"""
featureCounts -a {input.gtf} -o {output.countTable} -T {threads} {input.bam}
"""
2.11 Generate DAG (Directed Acyclic Graph) for Workflow
A Directed Acyclic Graph (DAG) can be generated with Snakemake to visualize the workflow structure and task dependencies. Each node represents a task, and edges show the execution order.
Here’s an example DAG for the RNA-seq workflow:
3 Differential Gene Expression Analysis
counts <- read.delim("featureCounts/all_counts.txt", header = TRUE, comment.char = "#")
head(counts)
Number of GeneIDs inside the data
## [1] 7127
Number of the duplicated GeneIDs
## [1] 0
#Subset the data to keep only the counts columns
geneIDs <- counts$Geneid
counts <- counts[, 7:ncol(counts)]
rownames(counts) <- geneIDs
colnames(counts) <- gsub("starAligned\\.|_Aligned.sortedByCoord.RD.out.bam", "", colnames(counts))
colnames(counts) <- sub("\\..*", "", colnames(counts))
head(counts)
To perform DESeq2 analysis, it is necessary to map the columns of the count data to their corresponding groups or conditions.
DESeq2 requires a metadata as a data-frame where the samples are represented as row names and the conditions or groups are specified as column names.
sampleNames <- colnames(counts)
conditions <- c("SPRC", "SPRC", "SPRC", "CTRL", "CTRL", "CTRL")
metadata <- data.frame(sample = sampleNames,
condition = conditions)
rownames(metadata) <- metadata$sample
metadata
Although pre-filtering low-count genes before running DESeq2 is not mandatory, it offers two key benefits: first, by removing rows with very few reads, it reduces the memory size of the DESeqDataSet (dds) object and speeds up the transformation and testing functions in DESeq2. Second, since features without data for differential expression are not plotted, this step can enhance the quality of visualizations.
Genes with a total count of 10 or more are retained for the differential expression analysis.
## [1] 6210
3.1 Perfom DESeq2 Analysis
library(DESeq2)
# create DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~ condition)
dds
## class: DESeqDataSet
## dim: 6210 6
## metadata(1): version
## assays(1): counts
## rownames(6210): YDL243C YDR387C ... Q0045 Q0158
## rowData names(0):
## colnames(6): SPRC_rep1 SPRC_rep2 ... CTRL_rep2 CTRL_rep3
## colData names(2): sample condition
The DESeq function estimates size factors and dispersions from the provided count data, then applies a negative binomial generalized linear model to identify differentially expressed genes. The updated DESeqDataSet (dds) object includes the results of the analysis.
## class: DESeqDataSet
## dim: 6210 6
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(6210): YDL243C YDR387C ... Q0045 Q0158
## rowData names(22): baseMean baseVar ... deviance maxCooks
## colnames(6): SPRC_rep1 SPRC_rep2 ... CTRL_rep2 CTRL_rep3
## colData names(3): sample condition sizeFactor
The results of the differential expression analysis are extracted from the dds object using the results function. This function calculates the log2 fold changes, p-values, and adjusted p-values (using the Benjamini-Hochberg method) for each gene. The contrast argument specifies the comparison of interest—in this case, comparing the SPRC condition to the CTRL condition based on the condition variable. The alpha argument sets the significance threshold to 0.05, controlling the false discovery rate (FDR). The resulting res object contains the differentially expressed genes (DEGs) along with their corresponding statistical metrics, which can be further explored and analyzed.
Out of 6,210 genes analyzed, 1.4% (88) are upregulated, and 1.1% (66) are downregulated with an adjusted p-value < 0.05.
##
## out of 6210 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 88, 1.4%
## LFC < 0 (down) : 66, 1.1%
## outliers [1] : 1, 0.016%
## low counts [2] : 362, 5.8%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Order the results table by the smallest padj_value
## log2 fold change (MLE): condition SPRC vs CTRL
## Wald test p-value: condition SPRC vs CTRL
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## YJR150C 190.185 4.16703 0.276159 15.0893 1.90566e-51 1.11424e-47
## YER011W 570.506 2.55423 0.206814 12.3504 4.84864e-35 1.41750e-31
## YLR092W 907.651 1.70277 0.145128 11.7329 8.64529e-32 1.68497e-28
## YFR030W 1258.373 1.84181 0.170150 10.8247 2.63044e-27 3.84504e-24
## YKL001C 439.106 1.95575 0.181218 10.7923 3.74450e-27 4.37882e-24
## YJR137C 1028.568 1.87670 0.176751 10.6178 2.46361e-26 2.40079e-23
3.2 PlotMA
The MA plot visualizes gene expression changes between two conditions (e.g., Control vs. Treated). It displays the log2 fold change (M) on the Y-axis and the log of the mean normalized expression counts on the X-axis. Genes with lower mean expression often show higher variability in log2 fold change.
- Blue dots: Represent significant DEGs with adjusted p-values < 0.05.
- Triangles at the edges: Indicate genes with high fold changes, with the triangle’s direction corresponding to the fold change direction.
- Right-side genes: Represent genes with high mean normalized counts and large fold changes, making them particularly interesting for further investigation.
Shrinkage of effect size (log2 fold change estimates) is valuable for both visualization and gene ranking. Using shrunken log2 fold change values in the MA plot enhances clarity by reducing noise caused by low-count genes. This approach eliminates the need for arbitrary filtering thresholds, providing a more accurate and interpretable representation of the data.
## [1] "Intercept" "condition_SPRC_vs_CTRL"
3.3 Dispersion Plot
Dispersion measures the variability or spread in the data. Common
measures of dispersion include variance, standard deviation, and
interquartile range (IQR). However, DESeq2 uses a specific dispersion
model that relates the mean (\(\mu\))
and variance (\(\text{var}\)) of the
data as follows:
\[
\text{var} = \mu + \alpha \cdot \mu^2
\]
Where \(\alpha\) is the dispersion
parameter.
- A dispersion value of 0.01 indicates an expected 10% variation
around the mean across biological replicates.
- Dispersion in DESeq2 has the following relationships:
- Inversely related to the mean: Higher dispersion is
observed with lower mean counts.
- Directly related to variance: Dispersion increases
with higher variance.
- Inversely related to the mean: Higher dispersion is
observed with lower mean counts.
- For genes with the same mean, dispersion values differ based on their variance, reflecting the variability in gene expression for a given mean.
Step 1: Dispersion estimates are calculated for each gene using maximum likelihood estimation.
Step 2: Fit a curve to the gene-wise dispersion
estimates.
This curve represents the expected dispersion values based on the mean
expression levels of the genes. It models the relationship between the
mean and dispersion to provide a smooth trend.
Step 3: Shrink the gene-wise dispersion estimates
toward the curve.
Gene-wise dispersions are adjusted (shrunken) toward the curve to obtain
the final dispersion estimates. This shrinkage reduces the influence of
extreme dispersion values and stabilizes the results.
- Exception: Some genes with extremely high variability (dispersion) will not be shrunk toward the curve. DESeq2 assumes these genes do not follow the modeling assumptions, avoiding potential false positives that could arise from forcing these genes to fit the model.
3.4 Principal Component Analysis (PCA)
PCA captures the variance within the dataset and reveals whether the samples are related or distinct from one another.
## SPRC_rep1 SPRC_rep2 SPRC_rep3 CTRL_rep1 CTRL_rep2 CTRL_rep3
## YDL243C 5.615505 5.734092 5.886628 5.868497 5.726914 5.793761
## YDR387C 7.992983 7.913163 7.853176 8.107994 7.944845 8.209682
## YDL094C 3.799819 3.778870 3.787080 3.800397 3.770846 3.744740
## YDR438W 7.073898 7.180850 7.014078 7.159573 7.294449 6.896694
## YDR523C 3.924193 3.904075 3.932480 3.754058 3.853457 4.023484
## YDR492W 9.830358 9.825192 9.481520 9.485170 9.605194 9.551596
The first principal component (PC1) explains 51% of the variation in the data and shows a clear clustering, with SPRC samples on the left and CTRL samples on the right. The second principal component (PC2) explains 27% of the variation but does not reveal a biologically meaningful pattern.
library(ggplot2)
PCAA <- plotPCA(rld, intgroup = 'condition')
PCAA + geom_text(aes(label = name), size = 2.5) + ggtitle("PCA Plot")
3.5 VolcanoPlot
Volcano Plot is a quick way to visualize DEG by plotting log2 fold change (x-axis) against -log10(p-value) (y-axis). Genes with large fold changes and low p-values appear in the upper corners, helping identify significant upregulated or downregulated genes.
## Loading required package: ggrepel
EnhancedVolcano(res,
lab = rownames(res),
x = "log2FoldChange",
y = "padj",
title = 'SPRC vs. CTRL',
pCutoff = 5e-2,
FCcutoff = 1,
pointSize = 2,
labSize = 3)
3.6 Subset Down & Up Regulated Genes
down <- res[which(res$padj < 5e-2 & res$log2FoldChange < -1),]
up <- res[which(res$padj < 5e-2 & res$log2FoldChange > 1),]
dclust <- assay(rld)[rownames(down),]
uclust <- assay(rld)[rownames(up),]
dclust_order <- hclust(dist(dclust))$order
dclust_sorted <- dclust[dclust_order, ]
uclust_order <- hclust(dist(uclust))$order
uclust_sorted <- uclust[uclust_order, ]
combined_data <- rbind(dclust_sorted, uclust_sorted)
de.geneIds <- c(rownames(down)[dclust_order], rownames(up)[uclust_order])
cat('Total number of DE genes: ', length(de.geneIds),
'\nNumber of down-regulated genes: ', length(rownames(dclust)),
'\nNumber of up-regulated genes: ', length(rownames(uclust)), '\n')
## Total number of DE genes: 55
## Number of down-regulated genes: 14
## Number of up-regulated genes: 41
3.7 Heatmap
We use a heatmap to visualize the expression patterns of both up-regulated and down-regulated genes, with genes sorted by hierarchical clustering. This allows us to identify distinct expression patterns across conditions and highlight any clusters of genes with similar regulation.
library(RColorBrewer)
# Heatmap avec with combiend and sorted data
par(mar = c(10, 6, 1, 1))
image(x = 1:ncol(combined_data),
y = 1:nrow(combined_data),
z = t(combined_data),
xaxt = "n", yaxt = "n", xlab = "", ylab = "",
col = brewer.pal(11, "Spectral"))
axis(2, at = 1:nrow(combined_data), labels = de.geneIds, las = 2, cex.axis = 1)
axis(1, at = 1:3, labels = colnames(assay(rld))[1:3], las = 2, col.axis = "orange")
axis(1, at = 4:6, labels = colnames(assay(rld))[4:6], las = 2, col.axis = "blue")
The heatmap reveals several clusters in gene expression. Notably, there appear to be three clusters of down-regulated genes and three clusters of up-regulated genes.
4 KEGG Pathway Enrichment Analysis
Next, we perform a KEGG Pathway Enrichment Analysis to identify biological pathways that are significantly enriched in the differentially expressed genes (DEGs). KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways are widely used to gain insights into the molecular functions and biological processes of genes.
library(clusterProfiler)
library(org.Sc.sgd.db)
kegg_enrichment <- enrichKEGG(gene = de.geneIds, organism = "sce")
Below are the top enriched pathways related to metabolism in Saccharomyces cerevisiae.
The sulfur metabolism pathway (sce00920) exhibits the strongest enrichment with an extremely low p-value (1.38e-09), emphasizing its significant association with the DEGs. This pathway encompasses genes involved in the biosynthesis of sulfur-containing amino acids, which play a critical role in metabolic processes.
4.1 KEGG Enrichment Visualization
4.1.1 Dotplot
The dotplot highlights the top 5 enriched KEGG pathways, with dot size indicating the number of genes involved and color intensity reflecting the statistical significance (darker colors denote higher significance).
This plot will display the top 5 enriched KEGG pathways, allowing us to quickly assess which pathways are most significantly affected.
4.2 Biological conclusion
The Pathway Enrichment Analysis reveals significant enrichment of
pathways related to sulfur metabolism
,
sulfur amino acid biosynthesis
and
secondary metabolite biosynthesis
, indicating activation of
metabolic processes promoting cell growth. These results are in line
with the study on SPRC and H₂S production, where CYS3 plays a key role
in regulating these pathways.