mia-rapport-2024/Rcodes/real_data/application_dore.Rmd

# Application to \cite{doreRelativeEffectsAnthropogenic2021} data
\label{sec:application-to-dorerelativeeffectsanthropogenic2021-data}

```{r, setup, include=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = FALSE, dpi = 300)
```

```{r}
# import fix
if (getwd() == "/home/polarolouis/Nextcloud/Documents/APT/CEI/Stage Recherche Mathématiques/Depuis PC Portable/Stage MIA 2023/rapport-MIA-2023") {
    path_to_add <- "Rcodes/real_data/"
} else {
    path_to_add <- ""
}
```

```{r require_lib, echo = FALSE, include=FALSE, warning=FALSE}
require("tidyverse")
require("knitr")
require("colSBM")
require("ggplot2")
require("patchwork")
source(paste0(path_to_add, "temporary_plot.R"))
```

```{r better_collection_extraction, echo = FALSE, warning=FALSE}
extract_unlist_reorder <- function(clustering_data_path) {
    clustering <- readRDS(clustering_data_path)
    if (!is.list(clustering)) {
        clustering <- list(clustering)
    }
    best_partition <- extract_best_bipartite_partition(clustering)
    best_partition_unlist <- unlist(best_partition)
    if (!is.list(best_partition_unlist)) {
        best_partition_unlist <- list(best_partition_unlist)
    }
    size <- length(best_partition_unlist)
    return(setNames(lapply(best_partition_unlist, function(collection) {
        reorder_parameters(collection)
    }), paste(rep("Collection", size), seq_len(size))))
}
```

```{r load data, echo = FALSE, include = FALSE, warning=FALSE}
# All results
iid_unlist <- extract_unlist_reorder(paste0(path_to_add, "data/dore_collection_clustering_nb_run1_iid_123networks_24-05-23-21:40:42.Rds"))
```

Here we apply the network clustering procedure (we refer to it as *netclustering*)
to the data from \cite{doreRelativeEffectsAnthropogenic2021}. These data are
plant-pollinator bipartite networks from differents areas and times.

In a second part we will use additional information for the networks to
try to identify the impact and correlations with the observed structures.

## Netclustering with the $iid\text{-}colBiSBM$ model
We obtained the more interpretable results with $iid\text{-}colBiSBM$ model.
This resulted in `r length(iid_unlist)` collections to partition the $M = 123$
networks.

```{r meso-plots, echo = FALSE, results='asis'}
#| fig.cap=sapply(seq_along(iid_unlist), function(idx) paste0("Collection N°", idx)),
##| fig.cap = "Structure of the collections in the partition and respective proportions of blocks",
##| fig.subcap = sapply(seq_along(iid_unlist), function(idx) paste0("Collection N°", idx)),
#| fig.asp = 0.5


meso_print <- function(unlisted_partition) {
    for (idx in seq_along(unlisted_partition)) {
        print(plot(unlisted_partition[[idx]], type = "meso", mixture = TRUE) + ggtitle(paste("Collection ", idx)))
        cat("\\newline")
    }
}
meso_print(iid_unlist)

```

In all the obtained collections the structure is the classical nested structure.
As this is a well-known structure for plant-pollinator data this tends to
indicate that we are not going in a wrong direction.

The \nth{3} collection consists of only one network, indicating that for this
model, the small76 network was really different of all the others.
One reason might be that it's the oldest network and maybe the data collection
protocol is different.

## Comparison with additional information

```{r supinfo, echo = FALSE}
supinfo <- readxl::read_xlsx(paste0(path_to_add, "data/supinfo.xlsx"), sheet = 2)
interaction_data <- read.table(file = paste0(path_to_add, "data/interaction-data.txt"), sep = "\t", header = TRUE)

seq_ids_network_aggreg <- unique(interaction_data$id_network_aggreg)
incidence_matrices <- readRDS(file = paste0(path_to_add, "data/dore-matrices.Rds"))
names_aggreg_networks <- names(incidence_matrices)
vectorClusteringNet <- numeric(nrow(supinfo))
for (k in 1:length(iid_unlist)) {
    idclust <- match(iid_unlist[[k]]$net_id, names_aggreg_networks)
    supinfoclust <- match(seq_ids_network_aggreg[idclust], supinfo$Idweb)
    vectorClusteringNet[supinfoclust] <- k
}
```

Using supplementary information we obtain the following boxplots.


```{r boxplot-function, echo = FALSE}
supinfo_boxplot <- function(parameter, pretty_name) {
    return(ggplot(supinfo) +
        aes(
            x = vectorClusteringNet, y = parameter,
            fill = as.factor(vectorClusteringNet), group = as.factor(vectorClusteringNet)
        ) +
        geom_boxplot() +
        labs(
            x = "Collection number", y = pretty_name,
            fill = "Collection number"
        ))
}
```

```{r boxplots_annual_timespan, echo = FALSE}
#| fig.cap = "\\label{fig:boxplot-annual-time-span}Boxplot of annual time span in function of the collection number"
ggplot(supinfo) +
    aes(x = vectorClusteringNet, y = Annual_time_span,
    fill = as.factor(vectorClusteringNet), group = as.factor(vectorClusteringNet)) +
    geom_boxplot() +
        labs(
            x = "Collection number", y = "Annual time span",
            fill = "Collection number"
        )
```

The annual time span is the number of days the sampling period lasted. So we can
thus see in figure \ref{fig:boxplot-annual-time-span} that collections 1 and 4 were
sampled for a larger period of time than collections 2 and 5.
This could explain observed differences in the structure detected : the
"checkerboard" appearance of the alpha matrices representations may represent
interactions that only occurs at a given period of time.
Thus the shorter time period doesn't capture such interactions.

```{r boxplot_rainfall, echo = FALSE}
#| fig.cap = "\\label{fig:boxplot-total-rainfall}Boxplot of total rainfall in function of the collection number"
supinfo_boxplot(supinfo$Tot_Rainfall_IPCC, "Total rainfall")
```

There seems to be the same trend for the total rainfall.

```{r boxplot_sampling_effort, echo = FALSE}
#| fig.cap = "\\label{fig:boxplot-sampling-effort}Boxplot of the sampling effort in function of the collection number"
supinfo_boxplot(supinfo$Sampling_effort, "Sampling effort")
```

The sampling effort seems to be quite higher for collection 5 and a little
higher for collection 2. And collection 1 and 4 have the inverse trend. The
separation between collections 1,4 and 2,5 seems to still hold. And the sampling
effort is related to the sampling time that is why it's higher for the
collections that were sampled for a shorter time period.