SYnthetic Samples GENerator (SYSGEN): an approach to increase the size of incidence samples in coffee leaf rust modelling Articles uri icon

publication date

  • August 2021

start page

  • 1

end page

  • 12

International Standard Serial Number (ISSN)

  • 1868-6478

Electronic International Standard Serial Number (EISSN)

  • 1868-6486


  • Rust is declared as big problem for coffee farmers. Several rust attacks were occurred in Latin American countries as Colombia, Mexico, Peru, Ecuador and Salvador. Due to damage caused by coffee rust, several regression models have been proposed to estimate the rust from weather variables. However, these models lack real rust samples because the recollection process of samples requires large expenses of money and time. Considering this issue, we propose in this paper a mechanism called SYnthetic Samples GENerator (SYSGEN). This proposal is based on cubic spline interpolation to increase the size of rust incidence samples (RIS) and expert knowledge to adjust the rust progress curve in Colombian coffee crops. In order to demonstrate the reliability of SYSGEN, we built 132 regression models from synthetic incidence samples (dependent variable) and weather observations (independent variables). To do this, we considered three Colombian coffee regions, five experiments and four regression models. Besides, we used Recursive Feature Elimination (RFE) to select the relevant weather variables. The analysis of these models and RFE are promising since several aspects and effects related with the rust development are revealed. One of these aspects is that the regression models used frequently temperature (maximum, minimum and average) and relative humidity variables. In this sense, it is important to highlight that these meteorological variables are considered by experts as key drivers in germination, penetration, colonization and sporulation phases. In terms of performance, our experiments allow us to conclude that random forest (RF) and bagging trees (BT) reached the lowest Root Mean Square Error (RMSE). Finally, it is important to consider that different datasets produce different performance. For example, if we consider those experiments that involve flowering periods datasets, the lowest RMSE was reached by RF. However, in datasets of coffee harvest periods, BT reached lowest RMSE.


  • Computer Science


  • expert knowledge; interpolation; regression models; feature selection