Meta RNAseq analysis in S. mansoni

The following results were obtained from:

Lu Z, Berriman M (2018) Meta-analysis of RNA-seq studies reveals genes responsible for life stage-dominant functions in Schistosoma mansoni. bioRxiv 308189; doi: https://doi.org/10.1101/308189

Sample information

Type	Label	Sequence Source	Sample Accession	Biological Replicates	Reference
Gonad (testis)	bTe	PRJEB14695	ERS420096, ERS420097, ERS420098	3	Lu et al. (2016)
Gonad (ovary)	bOv	PRJEB14695	ERS420090, ERS420091, ERS420092	3	Lu et al. (2016)
Egg	Egg	PRJNA294789	SRR2245469	1	Anderson et al. (2015)
Miracidium	Mir	PRJNA209511	SRR922067	1	Wang et al. (2013)
Sporocyst (48h)	Spo	PRJNA209511	SRR922068	1	Wang et al. (2013)
Cercaria	Cer	PRJEB2350	ERR022872, ERR022877, ERR022878	3	Protasio et al. (2012)
Schistosomulum (3h)	Som	PRJEB2350	ERR022874, ERR022876	2	Protasio et al. (2012)
Adult male before pairing	sMa	PRJEB14695	ERS420103, ERS420104, ERS420105	3	Lu et al. (2016)
Adult female before pairing	sFe	PRJEB14695	ERS420108, ERS420109, ERS420110	3	Lu et al. (2016)
Adult male after pairing	bMa	PRJEB14695	ERS420093, ERS420106, ERS420107	3	Lu et al. (2016)
Adult female after pairing	bFe	PRJEB14695	ERS420099, ERS420100, ERS420101	3	Lu et al. (2016)

Note that there are variations in S. mansoni strain and definitive host in these studies.

Analysis pipeline

Reads were mapped to S. mansoni genome annotation V5.2 using HISAT2 (v2.1.0) for egg and STAR (v2.4.2a) for the rest samples. Counts per gene were summarised with featureCounts (v1.4.5-p1) on the latest annotation (GeneDB data on 10/07/2017). Read counts were normalised in edgeR (v3.16.5) using the TMM method¹ and differential expression was analysed using the GLM approach (glmFit() and glmLRT())². RPKM values were calculated based on normalised library sizes³ and mean values were used for biological replicates (indicated as “Normalised expression” in the charts).

Data Exploration

Differential gene expression

Click on each figure to see the interactive logFC-logFDR volcano chart. I did but it’s generally not recommended for samples without biological replicates.

Housekeeping and most abundant genes across all life stages

Housekeeping genes were discovered with two methods:

using the GLM approach (edgeR) and set fold-difference < 1.5 between bFe and any other life stage.
using the R package Normfinder and select those with lowest stability values.

Most abundant gene transcripts were calculated by ranking the mean expression values (before normalisation) of all samples.

You can find the interactive charts here.

Genes preferentially-expressed in certain life stage(s)

These analyses were performed for all stages excluding bTe and bOv (Details).

Preferential expression was calculated using GLM approach to compare the expression of one stage to the average of the rest (FDR < 0.01 & higher expression than any other stage).

Validation

To validate the meta-analysis, I selected differentially expressed genes (FDR cutoff 0.01) and calculated Pearson correlation for logFC between the original study and the presented meta-analysis (Details).

Some notes

TMM (Trimmed Mean of M-values) is a batch normalisation method (on a group of samples) ^[return]
Similar to pairwise comparison. From the edgeR Users’ Guide: “The glm approach to multiple groups is similar to the classic approach, but permits more general comparisons to be made.” ^[return]
I prefer RPKM over normalised counts because you can compare between genes (although it’s different from the original RPKM) ^[return]
PCA: the separation of different clusters seems to make biological sense ^[return]