Usage

Pipeline overview

STIMP is a deep learning-based method that accurately imputes and predicts Chl_a in the coastal oceans. The inputs for STIMP include observations of Chl_a from coastal oceans, denoted as \(\mathbf{X}^{ob}\), and a spatial graph, \(\mathbf{G}\), that contains the geographic coordinates of the observations. STIMP simultaneously outputs a complete Chl_a dataset, \(\mathbf{X}\), and accurately predicts Chl_a, \(\tilde{\mathbf{Y}}\), based on \(\mathbf{X}\). STIMP and is formulated as:

\[p(\tilde{\mathbf{Y}}|\mathbf{X}^{ob})=\int_{\mathbf{X}}p_\Phi(\tilde{\mathbf{Y}}|\mathbf{X})p_\theta(\mathbf{X}|\mathbf{X}^{ob})d\mathbf{X}.\]

Prepare data

All data used in this work are publicly available through online sources. The chlorophyll-a observation datasets were 8-day averaged Level 3 mapped products from Moderate Resolution Imaging Spectroradiometer (MODIS) Aqua projects with a spatial resolution of 4 km MODIS. You can select the data with *.8D.*.4km.nc as filter.

We also uploaded the datasets on Zenodo at https://doi.org/10.5281/zenodo.14724760. Then,

mv data.zip /path/to/STIMP/
mkdir /path/to/STIMP/data
unzip data.zip -d /path/to/STIMP/data

We generate the 4 datasets, including Pearl River Estuary, the Northern of Mexico, Chesapeake Bay and Yangtze River Estuary, following this tutorials. The generated datasets are also included in the data.zip

Train STIMP

Step1. train the imputation function

The imputation step intends to reconstruct multiple potential complete spatiotemporal Chl_a distributions from partial observations. Due to the collected remote sensing Chl_a data does not contain ground truth for unobserved data, we can ramdomly select part of observation as imputation target to train the imputation function \(p_\theta(\mathbf{X}|\mathbf{X}^{ob})\). To assess the performance of STIMP, we randomly selected nine different rates of missing data, ranging from 10% to 90%, for choosing observed data as imputation targets. Four coastal ocean regions, including Pearl River Estuary, Northern Gulf of Mexico, Chesapeake Bay and Yangtze River Estuary, are selected to conduct the experiments.

for area in {"PRE","MEXICO","Chesapeake","Yangtze"}
do
   for i in {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}
   do
      python imputation/train_stimp.py --missing_ratio $i --area $area
   done
done

Step.2 Generate multiple imputations for Chl_a:

According to Rubin’s rules [Rub04], the final Chl_a prediction is obtained by averaging the outcomes of multiple imputation and prediction processes. Hence, we use the following script to generate multiple imputations for Chl_a

# We generate 10 imputations here
for area in {"PRE","MEXICO","Chesapeake","Yangtze"}
do
   python dataset/generate_data_with_stimp.py --area $area
done

Step.3 Train prediction function based on each imputation

for area in {"PRE","MEXICO","Chesapeake","Yangtze"}
do
   for i in {0..9}
   do
      python prediction/train.py --index $i --area $area
   done
done