In the context of the FeMAI project (Federated Microbiome AI for human health), this dataset was created to assess various machine learning classification methods for colorectal cancer risk stratification.
Cohort overview
This dataset gathers 2340 human stool samples characterized by shotgun metagenomic sequencing from 15 public cohorts spanning 10 countries, aiming at studying composition of the gut microbiota in healthy controls and patients with adenoma or colorectal cancer.
The BioProjects associated with the cohorts are :
PRJDB4176 (JPN, 645 individuals, 286 CRC patients)
PRJEB10878 (CHN, 128 individuals, 74 CRC patients)
PRJEB12449 (USA, 104 individuals, 52 CRC patients)
PRJEB27928 (GER, 82 individuals, 22 CRC patients)
PRJEB6070 (FRA, 156 individuals, 53 CRC patients - GER, 43 individuals, 38 CRC patients)
PRJEB7774 (AUT, 156 individuals, 46 CRC patients)
PRJNA389927 (USA, 56 individuals, 26 CRC patients - CAN, 28 individuals, 2 CRC patients)
PRJNA397112 (IND, 110 individuals, no patients)
PRJNA447983 (ITA, 140 individuals, 61 CRC patients)
PRJNA531273 (IND, 30 individuals, 30 CRC patients)
PRJNA608088 (CHN, 18 individuals, 6 CRC patients)
PRJNA429097
(CHN, 193 individuals, 98 CRC patients)
PRJNA763023 (CHN, 200 individuals, 100 CRC patients)
PRJNA731589 (CHN, 161 individuals, 76 CRC patients)
PRJNA961076 (BRA, 90 individuals, 30 CRC patients)
Data processing
Sequencing data was downloaded from the European Nucleotide Archive.
Reads were quality trimmed and filtered from sequencing adapters using fastp. Remaining contamination by the host genome was filtered out by mapping reads against the human reference genome (T2T-CHM13v2.0) with bowtie2.
Microbial species identification and quantification was estimated according to both human gut reference gene catalogue (IGC2, 10.4M genes) and human oral gene catalogue (8.4M genes) clustered into Metagenomic Species Pangenomes taxonomically and functionally annotated.
Data provided
The data associated with the cohorts are :
MetaGenomic Species abundance/count tables among samples and associated taxonomy (GTDB version RS214)
Functional modules abundance among samples and associated annotation (KEGG version 92)
Manually curated metadata : All but 6 gut metagenomic samples from the 14 public projects are listed (96 virome samples from PRJNA389927 and 6 samples from PRJEB12449 not described in the associated paper and with no health status were discarded). A quality check was performed and 104 samples were identified as contaminated. They are listed in the metadata file but proposed to be suppressed.
Comparison table between Meteor, Metaphlan2 and Metaphlan4.