1 Objective

The primary aim of this hands-on exercise is to produce wall-to-wall maps of aboveground biomass (AGB; Mg ha⁻¹) for a selected area of interest (AOI) using AI-based modelling approaches in R. The workflow follows the same strategy introduced in Exercise 04. Participants will compute LiDAR-derived metrics, train a Random Forest model using field plot data, and generate a continuous AGB prediction map. Unlike the previous exercise, the analysis is performed using a LAScatalog composed of multiple LiDAR tiles rather than a single point cloud. The data are already normalised (Z = height above ground, m), so terrain modelling steps such as ground classification, DTM generation, and height normalisation are not repeated. The same methodology is therefore applied operationally over a larger area.

2 Methodology

The modelling logic is consistent with Exercise 04:

  1. Read normalised LiDAR tiles as a LAScatalog to enable memory-safe, chunk-based processing.
  2. Read field plots with measured AGB and harmonise CRS with the LiDAR catalog.
  3. Clip plot-level point clouds from the catalog (write one LAZ per plot) for consistent plot metrics.
  4. Compute plot-level LiDAR metrics using the same f_metrics() function used for wall-to-wall mapping (training–prediction consistency).
  5. Train a Random Forest (ranger) model and evaluate performance with an independent test set.
  6. Compute wall-to-wall metric tiles with catalog_apply() and predict AGB per tile.
  7. Mosaic predicted AGB tiles into a single GeoTIFF and export deliverables.

3 Input data

  1. Normalised ALS tiles (.laz; Z already normalised to height above ground)
  2. Plots (.gpkg / .geojson) containing plot ID and AGB

4 Expected output

  1. Random Forest model (.rds)
  2. Training table (.csv)
  3. Accuracy table (.csv)
  4. Wall-to-wall metric tiles (GeoTIFF)
  5. AGB prediction tiles (GeoTIFF)
  6. Final AGB mosaic (GeoTIFF) + map figure (.png)

5 R packages

  • lidR: ALS/LiDAR processing, LAScatalog, chunk processing, and metric extraction
  • terra: raster processing, prediction, mosaicking, and export
  • sf: vector data I/O and CRS transformations
  • dplyr: data wrangling for training tables
  • ranger: fast Random Forest implementation
  • future: parallel execution (multisession) for scalable chunk processing
  • tcltk: interactive folder selection (tile folder)
  • ggplot2: map figure export

6 Hands-on Steps

6.1 Loading R packages

packages <- c("lidR","terra","sf","dplyr","ranger","future","tcltk","ggplot2")
missing_pkgs <- packages[!packages %in% installed.packages()[, "Package"]]
if (length(missing_pkgs) > 0) install.packages(missing_pkgs, dependencies = TRUE)
suppressMessages(lapply(packages, library, character.only = TRUE))

options(lidR.progress = FALSE)

terra_temp <- file.path(tempdir(), "terra_tmp")
dir.create(terra_temp, recursive = TRUE, showWarnings = FALSE)
terra::terraOptions(tempdir = terra_temp)

6.2 User inputs (interactive)

Select (1) the folder containing normalised LAZ tiles, and (2) the plots file containing AGB.

cat("Select the folder containing NORMALIZED LAZ tiles (Z already normalized)...\n")
## Select the folder containing NORMALIZED LAZ tiles (Z already normalized)...
norm_dir <- tcltk::tk_choose.dir(caption = "Select NORMALIZED LAZ tile folder")
if (is.na(norm_dir) || norm_dir == "") stop("No normalized folder selected.")

cat("Select the plot file (GeoPackage / Shapefile / GeoJSON) containing AGB...\n")
## Select the plot file (GeoPackage / Shapefile / GeoJSON) containing AGB...
plots_path <- file.choose()
if (is.na(plots_path) || plots_path == "") stop("No plots file selected.")

6.3 User parameters

These parameters control plot geometry handling, CRS metadata assignment, mapping resolution, chunk buffering, height filtering, parallelisation, and Random Forest settings.

id_col  <- "plotID_best"
agb_col <- "AGB"

plots_are_points <- FALSE
plot_radius_m <- 20

# If CRS missing in tiles, assign it here (metadata only)
epsg_target <- 32617

# Metrics resolution for mapping
metrics_res   <- 40
metrics_start <- c(0, 0)

# Catalog chunking
chunk_size   <- 0
chunk_buffer <- 30

# Height filtering (normalized heights)
max_norm_height <- 80

# Parallel
use_parallel <- TRUE
n_workers   <- 4
omp_threads <- 1   # IMPORTANT with future

# RF
rf_num_trees <- 600
train_frac   <- 0.70

# Metrics to EXCLUDE (problematic bands)
metrics_to_exclude <- c("TotalReturns")

6.4 Output folder structure (single root folder)

All products are saved under a single root folder:

  • ForestSAT2026_05_Exercises/01_plot_clips_norm/
  • ForestSAT2026_05_Exercises/02_metrics_tiles/
  • ForestSAT2026_05_Exercises/03_agb_tiles_from_metrics/
out_dir <- file.path(dirname(norm_dir), "ForestSAT2026_05_Exercise")

dir.create(out_dir,  showWarnings = FALSE, recursive = TRUE)

dir_clips   <- file.path(out_dir, "01_plot_clips_norm")
dir_metrics <- file.path(out_dir, "02_metrics_tiles")
dir_pred    <- file.path(out_dir, "03_agb_tiles_from_metrics")

dir.create(dir_clips,   showWarnings = FALSE, recursive = TRUE)
dir.create(dir_metrics, showWarnings = FALSE, recursive = TRUE)
dir.create(dir_pred,    showWarnings = FALSE, recursive = TRUE)

model_rds   <- file.path(out_dir, "final_model_rf.rds")
train_csv   <- file.path(out_dir, "training_table.csv")
metrics_csv <- file.path(out_dir, "rf_accuracy_metrics.csv")

6.5 Parallel plan (chunk-based)

Parallelism is handled by future at the chunk/tile level.
To avoid nested parallelism, keep lidR threads to 1 (omp_threads <- 1).

if (use_parallel) {
  future::plan(future::multisession, workers = as.integer(n_workers))
} else {
  future::plan(future::sequential)
}
lidR::set_lidr_threads(omp_threads)

6.6 Read LAScatalog from normalised tiles

A LAScatalog enables scalable reading and processing of many tiles. Filters are applied to remove withheld/overlap points and enforce a valid height range.

ctg_norm <- lidR::readLAScatalog(norm_dir)

if (is.na(sf::st_crs(ctg_norm))) {
  message("Catalog CRS is missing. Assigning EPSG:", epsg_target)
  sf::st_crs(ctg_norm) <- sf::st_crs(epsg_target)
}

if (!lidR::is.indexed(ctg_norm)) {
  message("No .lax index found. Creating .lax files (recommended)...")
  tryCatch(lidR:::catalog_laxindex(ctg_norm),
           error = function(e) message("Could not create .lax indexes. Continue anyway."))
}

lidR::opt_select(ctg_norm) <- "xyzicrn"
lidR::opt_filter(ctg_norm) <- paste(
  "-drop_withheld -drop_overlap",
  sprintf("-drop_z_below %g", 0),
  sprintf("-drop_z_above %g", max_norm_height)
)
lidR::opt_chunk_size(ctg_norm)   <- chunk_size
lidR::opt_chunk_buffer(ctg_norm) <- chunk_buffer
lidR::opt_wall_to_wall(ctg_norm) <- TRUE

6.7 Read plots and harmonise CRS

Plots are read and transformed to match the catalog CRS. If plots are points, they can be buffered into circular plots.

plots <- sf::st_read(plots_path, quiet = TRUE)
if (!all(c(id_col, agb_col) %in% names(plots))) {
  stop("Plots file must contain columns: ", id_col, " and ", agb_col)
}

plots <- sf::st_transform(plots, sf::st_crs(ctg_norm))

if (plots_are_points) {
  plots <- sf::st_buffer(plots, dist = plot_radius_m)
}

plots <- plots %>% dplyr::select(dplyr::all_of(c(id_col, agb_col)), geometry)

6.8 Clip normalised point clouds by plots

Each plot polygon is used to clip the catalog and write plot-level LAZ/ LAS files. These plot clips are later used to compute plot-level metrics consistently.

plots_clip <- plots %>% dplyr::rename(ID = !!id_col)

message("Clipping normalized point clouds by plot geometry (writing files)...")

ctg_clip <- ctg_norm
lidR::opt_output_files(ctg_clip) <- file.path(dir_clips, "plot_{ID}")
lidR::opt_laz_compression(ctg_clip) <- TRUE

lidR::clip_roi(ctg_clip, plots_clip)

## Chunk 1 of 10 (10%): state ✓
##                                                                                 Chunk 2 of 10 (20%): state ✓
##                                                                                 Chunk 3 of 10 (30%): state ✓
##                                                                                 Chunk 4 of 10 (40%): state ✓
##                                                                                 Chunk 5 of 10 (50%): state ✓
##                                                                                 Chunk 7 of 10 (60%): state ✓
##                                                                                 Chunk 6 of 10 (70%): state ✓
##                                                                                 Chunk 8 of 10 (80%): state ✓
##                                                                                 Chunk 9 of 10 (90%): state ✓
##                                                                                 Chunk 10 of 10 (100%): state ✓
##                                                                                 
## class       : LAScatalog (v1.3 format 3)
## extent      : 541864, 542652.3, 4134595, 4136540 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 17N 
## area        : 15997.72 m²
## points      : 654.5 thousand points
## type        : airborne
## density     : 40.9 points/m²
## density     : 14.5 pulses/m²
## num. files  : 10
clip_files <- list.files(dir_clips, pattern = "\\.(las|laz)$", full.names = TRUE, ignore.case = TRUE)
if (length(clip_files) == 0) stop("No plot clips were written. Check CRS and plot overlap with tiles.")

6.9 Data preparation and harmonization

This step ensures consistency between model training and spatial prediction. For each field plot, LiDAR point clouds are summarized into structural metrics describing canopy height and vertical structure, using the same definitions later applied to the wall-to-wall data. Plots with insufficient or unreliable information are excluded, and the remaining plots are linked to their field-measured aboveground biomass (AGB). Records with missing or invalid values are removed, and only informative predictors are retained. The resulting harmonized dataset contains observed AGB and corresponding LiDAR metrics and is used to calibrate the biomass model.

plot_metrics_one <- function(las, min_pts = 30) {
  if (is.null(las) || lidR::is.empty(las)) return(NULL)
  d <- las@data
  d <- d[is.finite(d$Z) & d$Z >= 0 & d$Z <= max_norm_height, , drop = FALSE]
  if (nrow(d) < min_pts) return(NULL)
  
  intensity <- if ("Intensity" %in% names(d)) d$Intensity else rep(NA_real_, nrow(d))
  m <- f_metrics(d$X, d$Y, d$Z, intensity)
  as.data.frame(as.list(m))
}

message("Computing plot-level metrics from plot clips...")

metrics_list <- lapply(clip_files, function(f) {
  las <- lidR::readLAS(f)
  mdf <- plot_metrics_one(las, min_pts = 30)
  if (is.null(mdf)) return(NULL)
  pid <- sub("^plot_", "", tools::file_path_sans_ext(basename(f)))
  mdf$plot_id <- as.character(pid)
  mdf
})
metrics_df <- dplyr::bind_rows(metrics_list)
if (nrow(metrics_df) == 0) stop("All plot clips were empty or produced no metrics.")

plots_tbl <- sf::st_drop_geometry(plots) %>%
  dplyr::transmute(
    plot_id = as.character(.data[[id_col]]),
    y       = as.numeric(.data[[agb_col]])
  )

train_df <- plots_tbl %>%
  dplyr::inner_join(metrics_df, by = "plot_id") %>%
  dplyr::filter(is.finite(y)) %>%
  dplyr::select(where(~ !all(is.na(.))))

write.csv(train_df, train_csv, row.names = FALSE)
message("Training table saved: ", train_csv)

6.10 Modeling: data splitting, training, and evaluation

In this step, LiDAR-derived plot metrics are used as predictors to model plot-level aboveground biomass (AGB). The dataset is first restricted to numeric predictors (as required by XGBoost), cleaned of missing or non-finite values, and then split into 70% training and 30% test sets to evaluate performance on independent data. Random Forest (RF) models were implemented using the ranger package. RF was chosen because it is robust to noise, captures non-linear relationships, handles correlated predictors effectively, and generally provides stable performance with minimal parameter tuning. Predictions were generated for both the training and test sets, and model performance was visually summarized to assess generalization. The best-performing model, primarily based on test performance, was then selected and carried forward to the subsequent steps of the workflow.

pred_cols <- setdiff(names(train_df), c("plot_id", "y"))
is_num <- sapply(train_df[, pred_cols, drop = FALSE], is.numeric)
pred_cols <- pred_cols[is_num]

# Exclude problematic metrics from modeling
pred_cols <- setdiff(pred_cols, metrics_to_exclude)

dat_use <- train_df[, c("plot_id", "y", pred_cols)]
dat_use <- dat_use[stats::complete.cases(dat_use[, c("y", pred_cols)]), , drop = FALSE]

n <- nrow(dat_use)
if (n < 4) stop("Not enough plots for training. Need at least 4 complete cases.")

idx <- sample(seq_len(n))
n_train  <- max(2, floor(train_frac * n))
train_id <- idx[1:n_train]
test_id  <- setdiff(idx, train_id)

train <- dat_use[train_id, , drop = FALSE]
test  <- dat_use[test_id,  , drop = FALSE]

target   <- "y"
features <- pred_cols

rf_fit <- ranger(
  dependent.variable.name = target,
  data       = train[, c(target, features)],
  num.trees  = rf_num_trees,
  mtry       = max(1, floor(sqrt(length(features)))),
  importance = "impurity",
  seed       = 42
)

pred_train <- predict(rf_fit, data = train[, features, drop = FALSE])$predictions
pred_test  <- predict(rf_fit, data = test[,  features, drop = FALSE])$predictions

metrics_out <- dplyr::bind_rows(
  extract_metrics(train$y, pred_train, model_name = "RF_ranger", part_name = "train") |>
    dplyr::mutate(n = nrow(train), .before = 1),
  extract_metrics(test$y,  pred_test,  model_name = "RF_ranger", part_name = "test")  |>
    dplyr::mutate(n = nrow(test),  .before = 1)
)

write.csv(metrics_out, metrics_csv, row.names = FALSE)
saveRDS(rf_fit, model_rds)

message("RF model saved: ", model_rds)
message("Accuracy metrics saved: ", metrics_csv)

# Safety: ensure excluded metrics are not in 'features'
features <- setdiff(rf_fit$independent.variable.names, metrics_to_exclude)

6.11 Accuracy assessment

Model performance is evaluated by comparing predicted and observed plot-level aboveground biomass (AGB) for both training and independent test datasets. Accuracy is quantified using complementary metrics that describe goodness of fit, prediction error magnitude, and systematic bias. Specifically, we report the coefficient of determination (R²; Eq1), Root Mean Square Error (RMSE;Eq2-3), bias (Eq3) and Mean Absolute Error (MAE;Eq5-6), along with their relative forms normalized by the mean observed AGB. Using both absolute and relative metrics provides a robust assessment of model accuracy and generalization performance, supporting objective comparison between the Random Forest and XGBoost models.

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}\left(y_i - \hat{y}_i\right)^2}{\sum_{i=1}^{n}\left(y_i - \bar{y}\right)^2} \qquad \text{(Eq1)} \] \[ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(\hat{y}_i - y_i\right)^2} \qquad \text{(Eq2)} \] \[ RMSE\% = \frac{RMSE}{\bar{y}} \times 100 \qquad \text{(Eq3)} \]

\[ MD = \frac{1}{n}\sum_{i=1}^{n}\left(\hat{y}_i - y_i\right) \qquad \text{(Eq4)} \]

\[ MAE = \frac{1}{n}\sum_{i=1}^{n}\left|\hat{y}_i - y_i\right| \qquad \text{(Eq5)} \] \[ MAE\% = \frac{MAE}{\bar{y}} \times 100 \qquad \text{(Eq6)} \]

print(metrics_out)
##   n     model  part        r2     rmse    rmseR       bias     biasR      mae
## 1 7 RF_ranger train 0.9034344 15.60472 14.36903  0.1845695  0.169954 14.35551
## 2 3 RF_ranger  test 0.3100952 27.67790 23.22882 -7.2091154 -6.050287 23.34014
##       maeR
## 1 13.21873
## 2 19.58833

6.12 Wall-to-wall LiDAR metrics

Wall-to-wall predictor tiles are generated by computing pixel_metrics() per catalog chunk. The chunk buffer reduces edge effects; outputs are cropped back to the core chunk extent to remove overlap.

metrics_chunk <- function(chunk, res, start, max_h) {
  
  las <- lidR::readLAS(chunk)
  if (is.null(las) || lidR::is.empty(las)) return(NULL)
  
  las <- lidR::filter_poi(las, Z >= 0 & Z <= max_h)
  if (lidR::is.empty(las)) return(NULL)
  
  m <- lidR::pixel_metrics(
    las   = las,
    func  = ~f_metrics(X, Y, Z, Intensity),
    res   = res,
    start = start,
    pkg   = "terra",
    geom  = "bbox"
  )
  if (is.null(m)) return(NULL)
  
  # Crop to chunk extent (remove buffer overlap)
  m <- terra::crop(m, terra::ext(chunk))
  m
}

message(" computing METRICS tiles with catalog_apply()...")

lidR::opt_output_files(ctg_norm) <- file.path(dir_metrics, "METRICS_{XLEFT}_{YBOTTOM}")
lidR::opt_progress(ctg_norm) <- FALSE

options_engine_metrics <- list(
  need_output_file = TRUE,
  need_buffer      = TRUE,
  automerge        = FALSE,
  overwrite        = TRUE
)

metrics_files <- lidR::catalog_apply(
  ctg_norm,
  metrics_chunk,
  res      = metrics_res,
  start    = metrics_start,
  max_h    = max_norm_height,
  .options = options_engine_metrics
)

metrics_files <- unlist(metrics_files, use.names = FALSE)
metrics_files <- metrics_files[file.exists(metrics_files)]
if (length(metrics_files) == 0) stop("No METRICS tiles were written.")

message("METRICS tiles written to: ", dir_metrics)

6.13 Generation of aboveground biomass (AGB) maps by tiles

Each metrics tile is read, aligned to the model predictors (drop excluded metrics; pad missing predictors with NA), and predicted using the trained RF model. Outputs are saved as AGB GeoTIFF tiles.

message("Step F2: predicting AGB tiles from METRICS tiles using terra::predict()...")

agb_files <- character(0)
#remind this information  #urgent
target = "y"
features= pred_cols

for (f in metrics_files) {
  
  metrics_r <- terra::rast(f)
  
  # Drop excluded metric bands if present (e.g., TotalReturns)
  drop_now <- intersect(metrics_to_exclude, names(metrics_r))
  if (length(drop_now) > 0) {
    metrics_r <- metrics_r[[setdiff(names(metrics_r), drop_now)]]
  }
  
  # Pad missing predictors (if needed)
  missing <- setdiff(features, names(metrics_r))
  if (length(missing) > 0) {
    na_layer <- metrics_r[[1]]
    na_layer[] <- NA_real_
    for (m in missing) {
      metrics_r <- c(metrics_r, na_layer)
      names(metrics_r)[terra::nlyr(metrics_r)] <- m
    }
  }
  
  x_r <- metrics_r[[features]]
  names(x_r) <- features
  
  out_path <- file.path(dir_pred, paste0("AGB_", tools::file_path_sans_ext(basename(f)), ".tif"))
  
  terra::predict(
    x_r,
    model     = rf_fit,
    na.rm     = TRUE,
    cores     = 1,
    filename  = out_path,
    overwrite = TRUE
  )
  
  agb_files <- c(agb_files, out_path)
}

agb_files <- agb_files[file.exists(agb_files)]
if (length(agb_files) == 0) stop("No AGB tiles were written from metrics.")

message("AGB tiles written to: ", dir_pred)

6.14 Mosaicking, visualization, and export of the aboveground biomass (AGB) map

All predicted tiles are mosaicked into a single AGB map. The mosaic is exported as GeoTIFF and visualised as a PNG figure.

agb_coll   <- sprc(agb_files)
agb_mosaic <- mosaic(agb_coll, fun = "mean")

# Save final mosaic GeoTIFF (single root folder)
out_mosaic_tif <- file.path(out_dir, "AGB_40m_mosaic.tif")
writeRaster(agb_mosaic, out_mosaic_tif, overwrite = TRUE)
message("Final AGB mosaic saved: ", out_mosaic_tif)

# ggplot map
df_pred <- as.data.frame(agb_mosaic, xy = TRUE, na.rm = TRUE)
names(df_pred) <- c("x", "y", "AGB")

p_agb <- ggplot(df_pred) +
  geom_raster(aes(x = x, y = y, fill = AGB)) +
  scale_fill_viridis_c(option = "C", name = "AGB (Mg/ha)") +
  coord_equal() +
  labs(title = "Predicted Aboveground Biomass", x = NULL, y = NULL) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(face = "bold"),
    legend.position = "right"
  )

print(p_agb)

# Save ggplot figure (single root folder)
out_png <- file.path(out_dir, "AGB_40m_mosaic.png")
ggsave(out_png, plot = p_agb, width = 8, height = 6, dpi = 400)
message("AGB mosaic figure saved: ", out_png)

# Save tile list (helpful for QA)
writeLines(agb_files, con = file.path(out_dir, "AGB_tile_list.txt"))

message("ALL outputs are under: ", out_dir)

7 Reference

  • Roussel, J.-R., Auty, D., Coops, N. C., Tompalski, P., Goodbody, T. R. H., Meador, A. S., & Achim, A. (2020). lidR: An R package for analysis of Airborne Laser Scanning (ALS) data. Remote Sensing of Environment, 251, 112061.
  • Hijmans, R. J., et al. (2022). terra: Spatial data analysis. CRAN.
  • Wright, M. N., Wager, S., Probst, P., & Wright, M. M. N. (2019). ranger: A fast implementation of Random Forests. CRAN.

8 Key takeaways

  1. LAScatalog enables scalable processing for large AOIs.
    Chunk-based reading, buffering, and tile outputs allow wall-to-wall mapping across hundreds of files.

  2. Training–mapping consistency is essential.
    Using the same f_metrics() function for plot metrics and wall-to-wall metrics improves robustness and reduces predictor mismatch.

  3. Most mapping artefacts come from predictor rasters and edge handling.
    Buffers, cropping to chunk extent, and quality checks of metric/prediction tiles are critical before interpretation.