A Dataset for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy

DOI

Data sources

Raw data used during the study can be found in corresponding references.

VACV: Yakimovich A, Andriasyan V, Witte R, Wang IH, Prasad V, Suomalainen M, Greber UF. Plaque2.0-A High-Throughput Analysis Framework to Score Virus-Cell Transmission and Clonal Cell Expansion. PLoS One. 2015 Sep 28;10(9):e0138760. doi: 10.1371/journal.pone.0138760. PMID: 26413745; PMCID: PMC4587671.
HADV: Andriasyan V, Yakimovich A, Petkidis A, Georgi F, Witte R, Puntener D, Greber UF. Microscopy deep learning predicts virus infections and reveals the mechanics of lytic-infected cells. iScience. 2021 May 15;24(6):102543. doi: 10.1016/j.isci.2021.102543. PMID: 34151222; PMCID: PMC8192562.
HSV, IAV, RV: Olszewski, D., Georgi, F., Murer, L. et al. High-content, arrayed compound screens with rhinovirus, influenza A virus and herpes simplex virus infections. Sci Data 9, 610 (2022). https://doi.org/10.1038/s41597-022-01733-4

Data organisation

For each virus (HADV, VACV, IAV, RV and HSV) we provide the processed data in a separate directory, divided into three subdirectories: train, val and test, containing the proposed data split. Each of the subfolders contains two npy files: x.npy and y.npy, where x.npy contains the fluorescence or brightfield signal (both for HADV, as separate channels) of the cells or nuclei and y.npy contains the viral signal. The data is already processed as described in the Data preparation section.

Additionally, Cellpose masks are made available for the test data in separate masks directory. For each virus except for VACV, there is a subdirectory test containing nuclei masks (nuc.npy). For HADV cell masks are also available (cell.npy).

Data preparation

Each of VACV plaques was imaged to produce 9 files per channel, that need to be stitched to recreate the whole plaque. To achieve this, multiview-stitcher toolbox has been used. The stitching was first performed on the third channel, representing the brightfield microscopy image of the samples. Then, the parameters found for this channel were used to stitch the rest of the channels. VACV dataset represents a timelapse, from which timesteps 100, 108 and 115 have been selected to produce the data then used in the experiments. Images have been center-cropped to 5948x6048 to match the size of the smallest image in the dataset (rounded down to the closest multiple of 2). The data was additionally manually filtered to remove the samples that constituted only uninfected cells (C02, C07, D02, D07, E02, E07, F02, F07). The HAdV dataset is also a timelapse, from which only the last timestep (49th) has been selected.

For the rest of the datasets (HSV, IAV, RV) only the negative control data was used, which was selected in the following way: from the data collected at the University of Zürich, from the Screen samples only the first 2 columns were selected and from the ZPlates and prePlates samples only the first 12 columns. All of the datasets were divided into training, validation and test holdouts in 0.7:0.2:0.1 ratios, using random seed 42 to ensure reproducibility. For the time-lapse data, it was ensured that the same sample from different timesteps only exists in one of the holdouts, to prevent information leakage and ensure fair evaluation. All of the samples were normalised to [-1, 1] range, by subtracting the 3rd percentile and dividing by the difference between percentile 99.8 and 3, clipping to [0, 1] and scaling to [-1, 1] range. For the brightfield channel of HAdV, percentiles 0.1 and 99.9 were used. These cutoff points were selected based on the analysis of the histograms of the values attained by the data, to make the best use of the available data range. Specific values used for the normalization are summarized in Figure 3 of the manuscript in Related/alternate identifiers.

To prepare the cell nuclei masks, Cellpose model with pre-trained weights cyto3 has been used on the fluorescence channel. The diameter was set to 7 for all the datasets except for HAdV, for which the automatic estimation of the diameter was employed. Cell masks were prepared using Cellpose with pre-trained weights cyto3 with a diameter set to 70 on brightfield images stacked with fluorescence nuclei signal. The data preparation can be reproduced by first downloading the datasets and then running scripts that are located in scripts/data_processing directory of the VIRVS repository, first modifying the paths in them:

for HAdV data: `preprocess_hadv.py`
for VACV data: `stitch_vacv.py` + `preprocess_vacv.py`
for the rest of the viruses: `preprocess_other.py`
to prepare Cellpose predictions: `prepare_cellpose_preds.py` (for cells) and `prepare_cellpose_preds_nuc.py` (for nuclei)
Identifier
DOI https://doi.org/10.14278/rodare.3130
Related Identifier IsIdenticalTo https://www.hzdr.de/publications/Publ-39523
Related Identifier IsPartOf https://doi.org/10.14278/rodare.3129
Related Identifier IsPartOf https://rodare.hzdr.de/communities/health
Related Identifier IsPartOf https://rodare.hzdr.de/communities/rodare
Metadata Access https://rodare.hzdr.de/oai2d?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:rodare.hzdr.de:3130
Provenance
Creator Wyrzykowska, Maria ORCID logo; della Maggiora, Gabriel ORCID logo; Deshpande, Nikita; Mokarian, Ashkan; Yakimovich, Artur ORCID logo
Publisher Rodare
Publication Year 2024
Rights Creative Commons Attribution 4.0 International; Open Access; https://creativecommons.org/licenses/by/4.0/legalcode; info:eu-repo/semantics/openAccess
OpenAccess true
Contact https://rodare.hzdr.de/support
Representation
Language English
Resource Type Dataset
Version Version 1
Discipline Life Sciences; Natural Sciences; Engineering Sciences