Meta RNAseq analysis in S. mansoni

The following results were obtained from:

Lu Z, Berriman M (2018) Meta-analysis of RNA-seq studies reveals genes responsible for life stage-dominant functions in Schistosoma mansoni. bioRxiv 308189; doi:

Sample information

Type Label Sequence Source Sample Accession Biological Replicates Reference
Gonad (testis) bTe PRJEB14695 ERS420096, ERS420097, ERS420098 3 Lu et al. (2016)
Gonad (ovary) bOv PRJEB14695 ERS420090, ERS420091, ERS420092 3 Lu et al. (2016)
Egg Egg PRJNA294789 SRR2245469 1 Anderson et al. (2015)
Miracidium Mir PRJNA209511 SRR922067 1 Wang et al. (2013)
Sporocyst (48h) Spo PRJNA209511 SRR922068 1 Wang et al. (2013)
Cercaria Cer PRJEB2350 ERR022872, ERR022877, ERR022878 3 Protasio et al. (2012)
Schistosomulum (3h) Som PRJEB2350 ERR022876, ERR022879 2 Protasio et al. (2012)
Adult male before pairing sMa PRJEB14695 ERS420103, ERS420104, ERS420105 3 Lu et al. (2016)
Adult female before pairing sFe PRJEB14695 ERS420108, ERS420109, ERS420110 3 Lu et al. (2016)
Adult male after pairing bMa PRJEB14695 ERS420093, ERS420106, ERS420107 3 Lu et al. (2016)
Adult female after pairing bFe PRJEB14695 ERS420099, ERS420100, ERS420101 3 Lu et al. (2016)

Note that there are variations in S. mansoni strain and definitive host in these studies.

Analysis pipeline

Reads were mapped to S. mansoni genome annotation V5.2 using HISAT2 (v2.1.0) for egg and STAR (v2.4.2a) for the rest samples. Counts per gene were summarised with featureCounts (v1.4.5-p1) on the latest annotation (GeneDB data on 10/07/2017). Read counts were normalised in edgeR (v3.16.5) using the TMM method1 and differential expression was analysed using the GLM approach (glmFit() and glmLRT())2. RPKM values were calculated based on normalised library sizes3 and mean values were used for biological replicates (indicated as “Normalised expression” in the charts).

Data Exploration

Principle Component Analysis (DESeq2)4

Differential gene expression

Click on each figure to see the interactive logFC-logFDR volcano chart. I did but it’s generally not recommended for samples without biological replicates.

Housekeeping and most abundant genes across all life stages

Housekeeping genes were discovered with two methods:

Most abundant gene transcripts were calculated by ranking the mean expression values (before normalisation) of all samples.

You can find the interactive charts here.

Genes preferentially-expressed in certain life stage(s)

These analyses were performed for all stages excluding bTe and bOv (Details).

Preferential expression was calculated using GLM approach to compare the expression of one stage to the average of the rest (FDR < 0.01 & higher expression than any other stage).


To validate the meta-analysis, I selected differentially expressed genes (FDR cutoff 0.01) and calculated Pearson correlation for logFC between the original study and the presented meta-analysis (Details).

Some notes

  1. TMM (Trimmed Mean of M-values) is a batch normalisation method (on a group of samples) [return]
  2. Similar to pairwise comparison. From the edgeR Users’ Guide: “The glm approach to multiple groups is similar to the classic approach, but permits more general comparisons to be made.” [return]
  3. I prefer RPKM over normalised counts because you can compare between genes (although it’s different from the original RPKM) [return]
  4. PCA: the separation of different clusters seems to make biological sense [return]