1 Objective

The primary aim of this hands-on exercise is to produce wall-to-wall maps of aboveground biomass (AGB; m³ ha⁻¹) for a selected area of interest (AOI) using AI-based modelling approaches. To achieve this, the entire processing workflow is implemented in R, providing a fully reproducible and open-source environment. Participants will work with a suite of R packages designed to process, analyze, and model airborne LiDAR data for forest and ecological applications.

2 Methodology

The exercise follows a four-step workflow. First, ALS grid-based metrics are extracted at plot locations. Second, these metrics are filtered based on spatial autocorrelation and collinearity. Third, predictive models are built and tuned using linear regression, machine learning, and deep learning algorithms in R, with hyperparameter optimization (if possible) to identify the best-performing models. The results are then compared across methods. Finally, the best model from each learning approach is applied to generate spatial prediction maps.

3 Input data

ALS data (.laz/las)
Plots (.geojson/gpkg)

4 Expected output

AI model (.rds)
Accuracy table (.xlsx)
Observed–predicted scatter figure (.png),
Raster products (i.e., DTM, DSM, CHM; .GeoTIFF)
AGB map (GeoTIFF)

5 R packages

lidR: Offers tools for reading, processing, visualising, and extracting metrics from airborne LiDAR (ALS) point cloud data, especially for forestry and ecological uses (Roussel et al., 2015)
RCSF – Cloth Simulation Filter (CSF) is an airborne LiDAR (Light Detection and Ranging) ground points filtering algorithm which is based on cloth simulation (Zhang et al., 2016).
raster – Supports reading, writing, and manipulating raster spatial data used in legacy geospatial workflows (Hijmans et al., 2015)
terra – A modern framework for handling large raster and vector datasets, successor to raster (Hijmans et al., 2022)
sf – Handles vector spatial data using simple features, fully integrated with tidy R.
dplyr – Facilitates fast, readable data manipulation like filtering, grouping, and summarising.
ggplot2 – Creates layered, high-quality, customizable data visualisations.
plot3D – Provides functions for static 3D visualisations, helpful for exploring LiDAR point clouds.
ranger – A fast Random Forest implementation for high-dimensional, large datasets (Wright et al. 2019).
xgboost – Implements gradient boosting decision trees for high-performance predictive modelling (Chen et al., 2019).

6 Hands-on Steps

6.1 Loading R packages

packages <- c(
  "lidR",
  "RCSF",
  "raster",
  "terra",
  "sf",
  "dplyr",
  "ggplot2",
  "plot3D",
  "ranger",
  "xgboost",
  "patchwork"
)

# install only missing packages
missing_pkgs <- packages[!packages %in% installed.packages()[, "Package"]]

if (length(missing_pkgs) > 0) {
  install.packages(missing_pkgs, dependencies = TRUE)
}

# load packages quietly
suppressMessages(
  lapply(packages, library, character.only = TRUE)
)

6.2 Loading raw LiDAR data

LiDAR point clouds are commonly distributed in several formats, including LAS, LAZ, PLY, and others. Here, users can upload LiDAR datasets by selecting the desired file path. Once loaded, key LiDAR features such as coordinate system, point density, number of points, memory usage, and acquisition type are printed to support informed processing decisions.

# ============================================================
# 1. LiDAR File Selection
# ============================================================
las_path <- file.choose()  # Select LAS/LAZ file interactively

# ============================================================
# 2. LiDAR Data Import with Fallback Reader
# ============================================================
las <- tryCatch(
  readALSLAS(las_path),
  error = function(e) {
    message("readALSLAS() failed. Using lidR::readLAS() instead.")
    lidR::readLAS(las_path)
  }
)  # Read LiDAR file with robust error handling

# ============================================================
# 3. Input Data Validation
# ============================================================
if (is.empty(las)) stop("LAS file is empty")  # Stop if no points are present

# ============================================================
# 4. Coordinate Reference System Assignment
# ============================================================
lidR::projection(las) <- sf::st_crs(32633)  # Assign projected CRS (UTM zone 33N)

# ============================================================
# 5. LiDAR Object Inspection
# ============================================================
# ============================================================
# 6. Point Density and Point Spacing (Global Metrics)
# ============================================================
# Point density (pts/m²) represents the average number of LiDAR points
# per unit area over the entire dataset. By default, it considers ALL
# returns stored in the LAS file (first, intermediate, and last returns).

# Point spacing (m or cm) is the theoretical average distance between
# neighboring LiDAR points, derived from total area and total point count.
# It is a global summary metric and does NOT describe local point spacing
# or pulse spacing.

#las
#cat("------------\n")
#A <- area(las)
#cat("Point spacing (cm):", sqrt(A/npoints(las))*100, "\n")
#cat("Point density (pts/m^2):", npoints(las)/A, "\n")

6.3 Small 3D visualization (didactic example)

This step extracts a small circular subset of the LiDAR point cloud and displays it in 3D to visually illustrate canopy structure, height variation, and point distribution before further processing.

6.4 Ground classification using Cloth Simulation Filter (CSF)

Ground classification separates terrain points from vegetation and objects, essential for terrain models and normalising LiDAR heights. Ground points are identified using the Cloth Simulation Filter (CSF: Zhang et al., 2016), which simulates a virtual cloth over the inverted point cloud. Parameters control cloth resolution, rigidity, and convergence, adapting to terrain roughness. The classified point cloud offers a reliable ground surface for terrain and canopy analysis.

las_g <- classify_ground(las,
                         csf(sloop_smooth=FALSE, class_threshold=0.5,
                             cloth_resolution=0.5, rigidness=1L,
                             iterations=500L, time_step=0.65))

6.5 Raster product generation

Raster surfaces from classified LiDAR points depict terrain, surface height, and canopy at 0.5 m resolution: • DTM interpolates ground points for terrain elevation without vegetation. • DSM captures maximum elevation, including vegetation and structures. • CHM subtracts DTM from DSM to show vegetation height and canopy structure.

# ============================================================
# 1. Ground Point Extraction
# ============================================================
ground <- filter_ground(las_g)  # Select ground-classified LiDAR points

# ============================================================
# 2. Digital Terrain Model (DTM) Generation
# ============================================================
DTM <- grid_terrain(ground, res=0.5, algorithm=tin())  # Interpolate ground surface

# ============================================================
# 3. Digital Surface Model (DSM) Generation
# ============================================================
DSM <- grid_metrics(las_g, res=0.5, ~max(Z))  # Compute canopy top surface

# ============================================================
# 4. Canopy Height Model (CHM) Calculation
# ============================================================
CHM <- DSM - DTM  # Derive canopy height above ground

6.6 Visualization of raster products

This visualization displays the Digital Terrain Model (DTM), Digital Surface Model (DSM), and Canopy Height Model (CHM) side by side, allowing visual comparison of terrain elevation, surface structure, and vegetation height patterns across the study area.

6.7 Height normalization of LiDAR point clouds

Height normalisation is the process of transforming raw LiDAR elevation values into heights relative to the ground surface. By removing the influence of terrain variability, this step enables vegetation and object heights to be compared consistently across areas with different elevations and slopes.

Several methods can be used for height normalisation, depending on data characteristics and terrain complexity. Common approaches include:

Raster-based normalisation, where point elevations are adjusted using a Digital Terrain Model (DTM)
k-nearest-neighbour interpolation, which estimates ground height locally from nearby ground points
TIN(Triangulated Irregular Network)-based normalisation, which models the terrain as a triangulated irregular network

In this workflow, a TIN-based method is adopted because it provides a balanced and flexible representation of the terrain surface, particularly in areas with variable topography. This approach effectively preserves local ground variations while ensuring robust normalisation for canopy structure analysis.

# ============================================================
# 1. Height Normalization of LiDAR Point Cloud
# ============================================================
nlas <- normalize_height(las, tin())  # Normalize heights using TIN-based ground model

# ============================================================
# 2. Filtering of Non-Physical Points
# ============================================================
nlas <- filter_poi(nlas, Z>=0)  # Remove negative height values

6.8 Loading field plots

Field plots containing aboveground biomass (AGB) data are loaded from vector datasets (e.g., GeoJSON or GPKG) and reprojected to the same coordinate reference system as the LiDAR data (EPSG: 32633). This ensures accurate spatial alignment between field measurements and LiDAR-derived products.

# ============================================================
# 1. Plot File Selection
# ============================================================
plots_path <- file.choose()  # Select plot vector file interactively

# ============================================================
# 2. Plot Data Import
# ============================================================
plots <- st_read(plots_path, quiet=TRUE)  # Read plot spatial data

# ============================================================
# 3. Plot Identifier Assignment
# ============================================================
plots$plot_id <- seq_len(nrow(plots))  # Create unique plot IDs

# ============================================================
# 4. Target Variable Type Conversion
# ============================================================
plots$AGB_m3_ha <- as.numeric(plots$AGB_m3_ha)  # Ensure numeric AGB values

6.9 Visual inspection of CHM and field plots

Before proceeding with further analyses, the spatial overlap between the Canopy Height Model (CHM) and the field plots is visually inspected. This step verifies correct plot positioning, scale, and coverage, helping to identify potential misalignments or projection issues early in the workflow.

6.10 Plot-level LiDAR metrics

LiDAR point clouds provide detailed 3D data via X, Y, Z coordinates and return intensity, capturing the structural complexity of forests, shrubs, and vegetation (Alvites et al.,2021, 2022,2025). From these point clouds, vegetation structure is summarised using statistical metrics over spatial units such as field plots. Each plot is used to clip the normalised LiDAR point cloud, and about 69 plot/crown structural metrics are computed using rLiDAR (Silva et al., 2016), including height percentiles, intensity statistics, vertical measures, and range metrics. These variables describe canopy structure and vertical complexity. All metrics are linked to their plots for comparison with biomass measurements. The hypothesis is that denser forests have higher upper-height metrics than open forests, making these features crucial for modelling.

# ============================================================
# 1. Crown Metrics Extraction Function (Per Plot)
# ============================================================
get_plot_crownmetrics <- function(nlas, plot_sf_row, min_pts = 30){
  pid <- plot_sf_row$plot_id  # Store plot identifier
  
  las_k <- clip_roi(nlas, plot_sf_row)
  if (lidR::is.empty(las_k)) return(NULL)  # Skip empty LiDAR subsets
  
  d <- as.data.frame(las_k@data[, c("X","Y","Z","Intensity")])
  d <- d[is.finite(d$Z) & d$Z > 0, , drop = FALSE]
  if (nrow(d) < min_pts) return(NULL)  # Enforce minimum point density
  
  xyziId <- cbind(d$X, d$Y, d$Z, d$Intensity, rep(pid, nrow(d)))
  m <- CrownMetrics(xyziId)  # Compute crown-level structural metrics
  names(m)[1] <- "plot_id"
  m
}

# ============================================================
# 2. Crown Metrics Computation for All Plots
# ============================================================
metrics_list <- suppressWarnings(lapply(seq_len(nrow(plots)), function(k){
  get_plot_crownmetrics(nlas, plots[k, ], min_pts = 30)
}))  # Apply crown metric extraction to each plot

## .........

# ============================================================
# 3. Assembly of Plot-Level Crown Metrics Table
# ============================================================
drid_metric_plots <- dplyr::bind_rows(metrics_list)  # Combine results into a single data frame

6.11 Data preparation and harmonization

LiDAR-derived plot metrics are harmonised with field-based biomass measurements to create a single dataset. Plot geometries are removed, identifiers standardised, and variables converted to numeric format for consistency. The metrics are joined with biomass values using plot identifiers, excluding records with missing or invalid targets. This prepares a clean, aligned dataset for machine learning analyses.

# ============================================================
# 1. Target Variable Extraction from Plot Data
# ============================================================
targets <- st_drop_geometry(plots) %>% 
  dplyr::select(plot_id, AGB_m3_ha)  # Extract plot-level AGB targets

# ============================================================
# 2. LiDAR Metric Table Cleaning
# ============================================================
drid_metric_plots <- drid_metric_plots %>%
  mutate(plot_id = as.integer(unlist(plot_id))) %>%
  mutate(across(-plot_id, ~as.numeric(.)))  # Ensure numeric LiDAR metrics

# ============================================================
# 3. Target Data Type Harmonization
# ============================================================
targets <- targets %>%
  mutate(plot_id = as.integer(plot_id),
         AGB_m3_ha = as.numeric(AGB_m3_ha))  # Align target data types

# ============================================================
# 4. Merge LiDAR Metrics with Field Targets
# ============================================================
dat_ml <- drid_metric_plots %>%
  left_join(targets, by = "plot_id") %>%
  filter(is.finite(AGB_m3_ha))  # Join metrics and retain valid targets

# ============================================================
# 5. Data Sufficiency Check for Modelling
# ============================================================
if (nrow(dat_ml) < 5) stop("Too few plots with metrics/target to train a model.")  # Minimum sample check

6.12 Modeling: data splitting, training, and model selection

In this step, LiDAR-derived plot metrics are used as predictors to model plot-level aboveground biomass (AGB). The dataset is first restricted to numeric predictors (required by XGBoost), then cleaned of missing or non-finite values, and finally split into 70% training and 30% test sets to evaluate model performance on independent data. Two widely adopted tree-based regression methods are trained: Random Forest (RF) using the ranger package and Gradient Boosting (XGBoost) using the xgboost package.e. 1. RF is selected because it is robust to noise, captures non-linear relationships, handles many correlated predictors well, and provides stable performance with limited tuning. 2. XGBoost is included because boosting can improve predictive accuracy by sequentially correcting errors, often achieving strong results in remote sensing and ecological regression, while early stopping helps reduce overfitting. Predictions are generated for both the training and test sets, and performance is visually summarised to compare generalization behavior. The best-performing model (primarily based on test performance) is selected and carried forward for the next steps of the workflow.

# ============================================================
# 1. Reproducibility
# ============================================================
set.seed(42)  # Ensure reproducible results

# ============================================================
# 2. Target and Predictor Definition
# ============================================================
target <- "AGB_m3_ha"
features <- setdiff(names(dat_ml), c("plot_id", target))  # Exclude ID and target

# ============================================================
# 3. Numeric Predictor Filtering
# ============================================================
is_num <- sapply(dat_ml[, features, drop = FALSE], is.numeric)
features <- features[is_num]  # Keep only numeric predictors

# ============================================================
# 4. Dataset Assembly and Cleaning
# ============================================================
dat_use <- dat_ml[, c("plot_id", target, features)]
dat_use <- dat_use %>% filter(is.finite(.data[[target]]))
dat_use <- na.omit(dat_use)  # Remove missing and invalid values

# ============================================================
# 5. Train–Test Split Check
# ============================================================
n <- nrow(dat_use)
if (n < 4) stop("Not enough plots for 70/30 split. Need at least 4 rows.")  # Minimum sample size

# ============================================================
# 6. 70/30 Train–Test Split
# ============================================================
idx <- sample(seq_len(n))
n_train <- max(2, floor(0.70 * n))
train_id <- idx[1:n_train]
test_id  <- setdiff(idx, train_id)  # Random partitioning

# ============================================================
# 7. Training and Testing Sets
# ============================================================
train <- dat_use[train_id, ]
test  <- dat_use[test_id, ]  # Create train and test datasets

# ============================================================
# 8. Random Forest Model (ranger)
# ============================================================
rf_fit <- ranger(
  dependent.variable.name = target,
  data = train[, c(target, features)],
  num.trees = 600,
  mtry = max(1, floor(sqrt(length(features)))),
  importance = "impurity",
  seed = 42
)  # Train Random Forest regression model

# ============================================================
# 9. Random Forest Predictions
# ============================================================
rf_pred_train <- predict(rf_fit, data = train[, features])$predictions
rf_pred_test  <- predict(rf_fit, data = test[, features])$predictions  # Predict AGB

# ============================================================
# 10. Random Forest Prediction Table
# ============================================================
df_rf <- bind_rows(
  data.frame(part = "train", obs = train[[target]], pred = rf_pred_train),
  data.frame(part = "test",  obs = test[[target]],  pred = rf_pred_test)
)  # Store RF observations and predictions

# ============================================================
# 11. XGBoost Data Preparation
# ============================================================
x_train <- as.matrix(train[, features])
y_train <- train[[target]]
x_test  <- as.matrix(test[, features])
y_test  <- test[[target]]  # Convert data for XGBoost

# ============================================================
# 12. XGBoost DMatrix Creation
# ============================================================
dtrain <- xgb.DMatrix(x_train, label = y_train)
dtest  <- xgb.DMatrix(x_test,  label = y_test)  # Optimized data containers

# ============================================================
# 13. XGBoost Hyperparameters
# ============================================================
params <- list(
  objective = "reg:squarederror",
  eval_metric = "rmse",
  eta = 0.05,
  max_depth = 6,
  subsample = 0.8,
  colsample_bytree = 0.8
)  # Define model settings

# ============================================================
# 14. XGBoost Training with Early Stopping
# ============================================================
xgb_fit <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 1500,
  watchlist = list(train = dtrain, test = dtest),
  verbose = 0,
  early_stopping_rounds = 40
)  # Train XGBoost model with validation control

# ============================================================
# 15. XGBoost Predictions
# ============================================================
xgb_pred_train <- predict(xgb_fit, dtrain)
xgb_pred_test  <- predict(xgb_fit, dtest)  # Predict AGB

# ============================================================
# 16. XGBoost Prediction Table
# ============================================================
df_xgb <- bind_rows(
  data.frame(part = "train", obs = y_train, pred = xgb_pred_train),
  data.frame(part = "test",  obs = y_test,  pred = xgb_pred_test)
)  # Store XGB observations and predictions

# ============================================================
# 17. Model Comparison and Visualization
# ============================================================
preds_list <- list(
  RF = df_rf,
  XGB = df_xgb
)
out <- plot_all_models(preds_list, ncol_models = 1)  # Compare RF and XGB performance

6.13 Accuracy assessment

Model performance is evaluated by comparing predicted and observed plot-level aboveground biomass (AGB) for both training and independent test datasets. Accuracy is quantified using complementary metrics that describe goodness of fit, prediction error magnitude, and systematic bias. Specifically, we report the coefficient of determination (R²; Eq1), Root Mean Square Error (RMSE;Eq2-3), bias (Eq3) and Mean Absolute Error (MAE;Eq5-6), along with their relative forms normalized by the mean observed AGB. Using both absolute and relative metrics provides a robust assessment of model accuracy and generalization performance, supporting objective comparison between the Random Forest and XGBoost models.

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}\left(y_i - \hat{y}_i\right)^2}{\sum_{i=1}^{n}\left(y_i - \bar{y}\right)^2} \qquad \text{(Eq1)} \]

\[ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(\hat{y}_i - y_i\right)^2} \qquad \text{(Eq2)} \]

\[ RMSE\% = \frac{RMSE}{\bar{y}} \times 100 \qquad \text{(Eq3)} \]

\[ MD = \frac{1}{n}\sum_{i=1}^{n}\left(\hat{y}_i - y_i\right) \qquad \text{(Eq4)} \]

\[ MAE = \frac{1}{n}\sum_{i=1}^{n}\left|\hat{y}_i - y_i\right| \qquad \text{(Eq5)} \]

\[ MAE\% = \frac{MAE}{\bar{y}} \times 100 \qquad \text{(Eq6)} \]

print(out$metrics)

##   model  part        r2      rmse    rmseR        bias       biasR      mae
## 1    RF train 0.9287721  34.28933 22.42212   5.1396661  3.36087706 28.34703
## 2    RF  test 0.8688067  84.65584 49.92148 -14.2515692 -8.40413806 74.17449
## 3   XGB train 0.9279838  60.06200 39.27512   0.1183597  0.07739652 46.15307
## 4   XGB  test 0.8405684 102.67372 60.54661 -16.1782197 -9.54028219 93.23584
##       maeR
## 1 18.53640
## 2 43.74063
## 3 30.17994
## 4 54.98110

6.14 Wall-to-wall LiDAR metrics

In this step, LiDAR-derived metrics are computed across the entire study area using a regular grid. The same 69 structural and intensity metrics used in the plot-level modelling stage are calculated at the pixel level to ensure consistency between training and prediction. Metric extraction uses the pixel_metrics() function from the lidR package, which summarises the normalised point cloud within each grid cell. The resulting raster layers represent spatially continuous maps of canopy structure and serve as predictor variables for wall-to-wall biomass estimation.

metrics_r <- pixel_metrics(nlas, ~f_metrics(X,Y,Z,Intensity), res=23,
  progress = FALSE)

6.15 Final model selection

Model selection is based on an objective comparison of the evaluated algorithms, using relative RMSE (RMSE%) computed on the independent test dataset. The model with the lowest RMSE% and the most stable generalisation performance is selected as the best-performing model. This selected model is then used to predict aboveground biomass across the full study area.

# ============================================================
# 1. Extract Model Performance Metrics
# ============================================================
m <- out$metrics  # Retrieve evaluation metrics from model comparison

# ============================================================
# 2. Test RMSE (%) for Each Model
# ============================================================
rf_rmseR  <- m$rmseR[m$model == "RF"  & m$part == "test"][1]
xgb_rmseR <- m$rmseR[m$model == "XGB" & m$part == "test"][1]  # Extract test RMSE for RF and XGB

# ============================================================
# 3. Best Model Selection Based on Test Performance
# ============================================================
best <- if (is.finite(xgb_rmseR) && xgb_rmseR < rf_rmseR) "xgb" else "rf"
# ============================================================
# 4. Final Model Training Using All Available Plots
# ============================================================
if (best == "rf") {
  final_model <- ranger(
    dependent.variable.name = target,
    data = dat_ml[, c(target, features)],
    num.trees = 600,
    mtry = max(1, floor(sqrt(length(features)))),
    seed = 42
  )  # Train final Random Forest model on full dataset
} else {
  d_all <- xgb.DMatrix(as.matrix(dat_ml[, features]), label = dat_ml[[target]])
  final_model <- xgb.train(
    params = params,
    data = d_all,
    nrounds = xgb_fit$best_iteration
  )  # Train final XGBoost model on full dataset
}

6.16 Generation of aboveground biomass (AGB) map

The aboveground biomass (AGB) map is generated by applying the selected model to grid-level LiDAR metrics. Biomass estimates are produced at the same spatial resolution as the field plots (23 m × 23 m) to maintain consistency between observations and predictions. Final AGB values are expressed in m³/ha or Mg/ha, resulting in a spatially explicit map of forest biomass across the study area.

6.17 Export outputs

This step exports the main workshop deliverables to a single output folder: the final trained model (.rds), the accuracy table (.csv), the observed–predicted scatter figure (.png), and the raster products (DTM, DSM, CHM, and the predicted AGB map as GeoTIFFs) for further analysis and sharing.

out_dir <- file.path(dirname(las_path), "ForestSAT2026_04_Exercise")
dir.create(out_dir, showWarnings=FALSE)
saveRDS(final_model, file.path(out_dir, "final_model.rds"))
write.csv(out$metrics, file.path(out_dir, "model_accuracy_metrics.csv"), row.names = FALSE)
ggsave(file.path(out_dir, "scatter_models_obs_pred.png"), plot = out$plot, width = 150, height = 150, unit="mm",dpi = 300)
writeRaster(DTM, file.path(out_dir,"DTM_0p5m.tif"), overwrite=TRUE)
writeRaster(DSM, file.path(out_dir,"DSM_0p5m.tif"), overwrite=TRUE)
writeRaster(CHM, file.path(out_dir,"CHM_0p5m.tif"), overwrite=TRUE)
writeRaster(pred_agb_rf, file.path(out_dir,"AGB_23m.tif"), overwrite=TRUE)

6.18 Reference

Alvites, C., Marchetti, M., Lasserre, B., & Santopuoli, G. (2022). LiDAR as a tool for assessing timber assortments: A systematic literature review. Remote Sensing, 14(18), 4466.
Alvites, C., Santopuoli, G., Maesano, M., Chirici, G., Moresi, F. V., Tognetti, R., & Lasserre, B. (2021). Unsupervised algorithms to detect single trees in a mixed-species and multilayered Mediterranean forest using LiDAR data. Canadian Journal of Forest Research, 51(12), 1766–1780.
Alvites, C., O’Sullivan, H., Francini, S., Marchetti, M., Santopuoli, G., Chirici, G., & Bazzato, E. (2025). Canopy height Mapper: A Google Earth Engine application for predicting global canopy heights combining GEDI with multi-source data. Environmental Modelling & Software, 183, 106268.
Silva, C. A., Crookston, N. L., Hudak, A. T., Vierling, L. A., Klauberg, C., & Silva, M. C. A. (2017). rLiDAR: An R package for LiDAR data processing. CRAN.
Roussel, J.-R., Auty, D., Coops, N. C., Tompalski, P., Goodbody, T. R. H., Meador, A. S., & Achim, A. (2020). lidR: An R package for analysis of Airborne Laser Scanning (ALS) data. Remote Sensing of Environment, 251, 112061.
Hijmans, R. J., Van Etten, J., Cheng, J., Mattiuzzi, M., Sumner, M., & Greenberg, J. A. (2015). raster: Geographic data analysis and modeling. CRAN.
Hijmans, R. J., Bivand, R., Forner, K., Ooms, J., Pebesma, E., & Sumner, M. D. (2022). terra: Spatial data analysis. CRAN.
Wright, M. N., Wager, S., Probst, P., & Wright, M. M. N. (2019). ranger: A fast implementation of Random Forests. CRAN.
Chen, T., He, T., Benesty, M., & Khotilovich, V. (2019). xgboost: Extreme Gradient Boosting. CRAN.
Zhang, W., Qi, J., Wan, P., Wang, H., Xie, D., Wang, X., & Yan, G. (2016). An easy-to-use airborne LiDAR data filtering method based on cloth simulation. Remote sensing, 8(6), 501.

7 Key takeaways

Combining vertical and horizontal LiDAR metrics improves AGB modelling.
Metrics on canopy height, structure, and variability offer a more complete forest biomass picture than any single LiDAR metric.
Most AGB mapping artefacts originate from the predictor layers, not the model itself.
Edge effects and hotspots often stem from errors in LiDAR raster metrics, emphasizing the need for visual inspection and quality control before interpreting AGB maps.
Train–test splits and cross-validation provide complementary insights into model performance.
A train–test split assesses model generalisation to independent plots, while cross-validation offers a more reliable performance estimate across the dataset, especially with small samples.

Exercise 04 – AGB Modelling and Mapping Using Random Forest and XGBoost

Cesar Alvites - Forest Biometrics, Remote Sensing and Artificial Intelligence Lab (SilvaLab). School of Forest, Fisheries and Geomatics Science, University of Florida, PO Box 110410 Gainesville, FL, USA 32611

2026-02-09