Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: http://dx.doi.org/10.25673/121194
Titel: Overfitting due to data leakage in soil sensor calibration : examples from lab-based and in-situ soil NIR spectroscopy
Autor(en): Correa, José
Tavakoli, Hamed
Vogel, Sebastian
Gebbers, RobinIn der Gemeinsamen Normdatei der DNB nachschlagen
Erscheinungsdatum: 2025
Art: Artikel
Sprache: Englisch
Zusammenfassung: Sensor-based soil analysis methods, particularly optical spectroscopy, have gained as efficient alternatives to labor-intensive laboratory analyses for assessing soil properties. Especially in-situ measurements with mobile sensors streamline data collection, reducing both time and costs. This approach hinges on correlating sensor signals with laboratory-derived soil physico-chemical properties using mathematical calibration models. The models are trained and their parameters fine-tuned using a training dataset. It is best practice to evaluate the performance of calibration models by a test dataset, which is independent from the training dataset. However, certain commonly applied data preprocessing procedures can unintentionally introduce unwanted dependencies between the training and the test dataset. This is called data leakage. A model trained on these datasets will perform very well on both the training and test sets. However, it will show much poorer performance when tested on a truly independent dataset. Thus, the calibration model overfits via data leakage. In this study, we illustrate the consequences of data leakage by two common preprocessing procedures in soil sensing, namely principle component analysis (PCA) and spatial interpolation through ordinary kriging, on the prediction of soil properties by near infrared (NIR) spectroscopy. The NIR spectra were obtained in the laboratory and in the field. Laboratory measurements by standard wet-chemistry methods of soil pH value, total organic carbon (TOC) and total nitrogen (TN) content of 159 soil samples were used as target variables. Based on the results of this study, PCA and spatial interpolation led to data leakage when executed before data splitting. To avoid data leakage, we encourage researchers to carefully design leak-free data processing pipelines. These pipelines should encapsulate preprocessing methods, model fitting, and (if needed) spatial interpolation, ensuring that training and test sets are completely independent.
URI: https://opendata.uni-halle.de//handle/1981185920/123147
http://dx.doi.org/10.25673/121194
Open-Access: Open-Access-Publikation
Nutzungslizenz: (CC BY 4.0) Creative Commons Namensnennung 4.0 International(CC BY 4.0) Creative Commons Namensnennung 4.0 International
Journal Titel: Computers and electronics in agriculture
Verlag: Elsevier Science
Verlagsort: Amsterdam [u.a.]
Band: 239
Heft: Part A
Originalveröffentlichung: 10.1016/j.compag.2025.110920
Seitenanfang: 1
Seitenende: 20
Enthalten in den Sammlungen:Open Access Publikationen der MLU

Dateien zu dieser Ressource:
Datei Beschreibung GrößeFormat 
1-s2.0-S0168169925010269-main.pdf12.99 MBAdobe PDFMiniaturbild
Öffnen/Anzeigen