APPLYING MACHINE LEARNING FOR ANALYSIS AND FORECASTING OF AGRICULTURAL CROP YIELDS

: Analysis and improvement of crop productivity is one of the most important ar - eas in precision agriculture in the world, including Kazakhstan. In the context of Kazakhstan, agriculture plays a


Introduction
In the face of a growing global population and the escalating demand for food, traditional agricultural practices are facing unprecedented challenges.To meet this demand while ensuring sustainable resource utilization, the adoption of innovative technologies is imperative.Machine learning (ML), a subset of artificial intelligence, has emerged as a powerful tool capable of revolutionizing agriculture [1].ML algorithms can analyze vast amounts of data to identify patterns and make predictions, offering valuable insights for optimizing crop production and resource management.One of the most critical aspects of agricultural decision-making is crop yield prediction, which influences resource allocation, market strategies, and food security.Accurate yield predictions enable farmers to make informed decisions regarding planting schedules, irrigation requirements, and fertilizer application, leading to increased productivity and reduced environmental impact [2][3][4].ML-powered crop prediction models can harness the power of data to process vast amounts of agricultural information, including historical yield records, weather patterns, soil characteristics, and satellite imagery, to identify complex relationships between these factors and crop yield.By analyzing this data, ML models can accurately predict crop yields, providing farmers with valuable decision-making tools.While ML holds immense promise for agriculture, its adoption faces challenges primarily related to data quality and availability.Acquiring and maintaining high-quality, consistent agricultural data is crucial for training and validating ML models [5].
ML algorithms are adept at processing and analyzing vast amounts of agricultural data, including historical yield records, weather patterns, soil characteristics, and satellite imagery.By extracting valuable insights from this data, ML models can identify complex relationships and patterns that influence crop yield.ML-powered crop prediction models form the foundation of precision agriculture initiatives.By providing precise predictions at the field or even sub-field level, these models enable farmers to optimize resource allocation, such as water, fertilizers, and pesticides, based on specific crop requirements, soil conditions, and environmental factors [6][7].

Literature review
The field of agricultural analysis, particularly in relation to weather conditions, has been extensively studied, with numerous researchers exploring the intersection of climatology, agriculture, and data science.The following literature review highlights key scholarly contributions that provide a foundation and context for this research.
In [8] author's comprehensive review delves into the effects of climate change on crop yields.Their study synthesizes findings from various global research efforts, emphasizing the complex relationship between changing weather patterns and agricultural productivity.The review highlights that temperature fluctuations and altered precipitation regimes significantly affect crop growth cycles and yields.
The [9] reference explores the application of machine learning techniques in predicting agricultural yields.Authors demonstrated the efficacy of models like Random Forest and Support Vector Machines in forecasting crop productivity based on weather data.Their findings suggest a high correlation between weather variables and yield outcomes, underscoring the potential of machine learning in agricultural planning.
Reference [10] focuses on the role of data analytics in developing climate-resilient farming practices.They discuss how big data and predictive analytics can empower farmers to make informed decisions, particularly in the context of adapting to climate variability.The paper also addresses the challenges in integrating diverse data sources for effective analysis.
Addressing this aspect, reference [11] underscores the versatility of ML applications in smart farming, extending beyond crop yield prediction to encompass various facets of agriculture.Notably, the authors discuss the integration of ML in livestock management, water conservation strategies, soil health assessment, and crop management.This holistic approach emphasizes the potential of ML to address multiple challenges in agriculture, leading to more efficient and sustainable practices.The significance of accurate crop yield prediction is highlighted as a cornerstone for informed decision-making in the agricultural sector.By leveraging ML algorithms, farmers gain insights into optimal planting times, irrigation needs, and fertilizer usage, ultimately maximizing productivity while minimizing environmental impact.The authors argue that the adoption of ML crop prediction models represents a paradigm shift in agriculture, offering a data-driven approach to precision farming.
The reference [12] present case studies on technology adoption in agriculture under the challenges posed by climate change.They highlight how advancements in remote sensing and information technologies are revolutionizing farming practices.The paper argues for a more integrated approach that combines traditional knowledge with modern technology to enhance agricultural sustainability.
These studies collectively provide a comprehensive overview of the current state of research at the intersection of climatology, agriculture, and data science.They underscore the importance and potential of using advanced data analytics and machine learning techniques in understanding and predicting agricultural outcomes, which is central to our project.
The intersection of agro-technology and machine learning has gained significance due to advancements in data methodologies and high-performance computing.Machine learning classifiers have emerged as vital tools in crop prediction, playing a crucial role in agriculture.The authors of reference [13] proposes the advanced stacking ensemble learning approach for crop prediction, addressing the need for accurate predictions within a short processing time.The primary objective of their approach is to meet the demand for accurate predictions while also ensuring that the processing time remains low.Ensemble learning involves combining multiple machine learning models to improve prediction accuracy.Stacking, in particular, is a method where the predictions of several base models are used as input for a meta-model, which then produces the final prediction.By utilizing this advanced stacking ensemble learning technique, Sethy et al. aim to achieve higher accuracy in crop prediction compared to individual models while also maintaining efficiency in terms of processing time.In reference [14] discussed the rise of agro-technology and machine learning, facilitated by significant advancements in data methodologies and computing capabilities.The authors delve into the intersection of agro-technology and machine learning, highlighting the emergence of this interdisciplinary field.They emphasize that this fusion has been made possible by notable advancements in data methodologies and computing capabilities.In their discussion, Zhai and colleagues likely explore how advancements in data methodologies, such as data collection, preprocessing, and analysis techniques, have paved the way for leveraging machine learning in agriculture.Moreover, they may address the role of improved computing capabilities, including faster processors, parallel computing architectures, and cloud computing infrastructure, in handling large agricultural datasets and executing complex machine learning algorithms efficiently.The authors of studies [15], [16], [17] underscores the increasing importance of machine learning classifiers in crop prediction, highlighting their role as a crucial component in modern agriculture.By leveraging machine learning techniques within the context of agriculture, the proposed approach aims to enhance crop prediction accuracy and efficiency, contributing to advancements in the field of agro-technology.
These studies collectively provide a comprehensive overview of the current state of research at the intersection of climatology, agriculture, and data science.They underscore the importance and potential of using advanced data analytics and machine learning techniques in understanding and predicting agricultural outcomes, which is central to our research.The references [18] and [19] utilizes decision tree analysis to identify the key factors influencing changes in crop yield and to predict crop yield outcomes.Their approaches contribute valuable insights to the field of agricultural research and helps stakeholders in optimizing agricultural practices and maximizing crop productivity.By combining random forest with wheat yield data, meteorological variables, and satellite images, [20] aimed to develop a robust and accurate predictive model for wheat yield in southeastern Australia.This approach allows for comprehensive analysis and prediction of crop yields, enabling better agricultural planning and decision-making in the region.In their studies, the authors emphasized the effectiveness of the random forest method in comparison with SVM and linear regression methods for China's main rapeseed-producing area.The authors of reference [21] also used the decision tree algorithm to evaluate crop yield by combining soil and climate data.By leveraging the ensemble nature of random forest, the authors likely achieve accurate and robust predictions of crop yield by integrating soil and climate data, thereby contributing valuable insights to agricultural research and decision support systems.
Analyzing the research of the above-mentioned authors, it can be confirmed that the forecast of agricultural yields largely depends on territorial conditions.In this regard, digitalization of precision agriculture in the North Kazakhstan region is a relevant area.While wheat is predominant, North Kazakhstan also cultivates other crops such as barley, oats, and potatoes.In North Kazakhstan, where crop productivity may be influenced by complex interactions between weather patterns, soil properties, and agricultural practices, the ability of these methods to handle nonlinear relationships is crucial for developing accurate prediction models.This interpretability is particularly valuable for agricultural stakeholders and policymakers in North Kazakhstan, as it allows them to gain insights into the factors driving yield variations and make informed decisions based on the model's outputs.Many studies focus on specific crops grown in Central Asia, such as wheat, cotton, and barley.These crop-specific analyses allow for tailored approaches to yield prediction based on the unique characteristics and requirements of each crop.
To summarize, this study selected the most effective methods of Decision tree, random forest and linear regression to analyze potato productivity in North Kazakhstan region.

Purpose and Objectives of Research
The purpose of this study is to analyze and forecast the yield of agricultural crops in the North Kazakhstan region for 1990-2023.
The object of the study is time series of agroclimatic data in the North Kazakhstan region over the past 33 years.
To achieve the goal, the following tasks were set: -prepare an appropriate array of data on agroclimatic data of the North Kazakhstan region for 1990-2023; -make a linear trends of the main factors influencing crop yields; -using machine learning algorithms to make a yield forecast for the last years.
-to perform comparative analysis for different test and training data; -to make a conclusion based on the applied methods.

Materials and Research Methods
Dataset Description: We have multiple CSV data files, primarily using the Pandas library for easy processing.The largest file, our training data, exceeds 100 MB, but our computing power should handle it without requiring special techniques.The training file has 12346 rows, 45 columns: datetime, feelslikemin, feelslikemax, tempmax, tempmin, humidity, and more, with various data types such as numerical, categorical, and datetime values.It includes foreign keys pointing to other CSV files, like one containing metadata about the stores.
Data Preprocessing: Rigorous data cleaning procedures were executed to handle missing values, ensuring the dataset's integrity for meaningful analysis.Feature scaling techniques were applied to normalize the data, a crucial step for the effectiveness of machine learning models in time series forecasting.
Modeling and evaluation: Multivariate time series models were chosen to exploit the internal dependencies between different factors.The correlation matrix was used to identify the strengths of certain factors, thereby increasing the accuracy of prediction.In addition, we used a linear regression method on the pooled data set for univariate time series forecasting.Evaluation metrics for a Linear Regression model are chooses ccoefficient of Determination (R2).
Time Series Analysis: Time Series Analysis were used to analyze long-term source data.Time series analysis employs various statistical and mathematical models to make predictions or forecasts based on historical data.Common techniques include moving averages, autoregressive integrated moving average (ARIMA) models, exponential smoothing methods, and more advanced methods like machine learning algorithms, including recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks [22].
Decision Tree Regressor: A Decision Tree Regressor is a type of machine learning algorithm used for regression tasks.It belongs to the family of decision tree algorithms, which are commonly used for both classification and regression problems.While decision trees for classification partition the data into discrete classes, decision trees for regression predict continuous values.The algorithm starts with the entire dataset and selects the best feature to split the data based on some criterion (commonly mean squared error or variance reduction).It repeats this process recursively for each resulting subset until a stopping criterion is met, such as reaching a maximum tree depth, having too few samples in a node, or other conditions.Once the tree is constructed, to make a prediction for a new instance, it traverses down the tree from the root node to a leaf node.At each node, it follows the decision rule based on the feature value of the instance until it reaches a leaf node, which contains the predicted value for regression tasks.In decision tree regression: -Features of the input data are used to make decisions at each node of the tree.
-The tree is recursively partitioned based on these features until some stopping criteria are met.
-At the leaf nodes of the tree, the model predicts the continuous output based on the features of the input data that lead to that leaf node.In Figure 1 illustrated implementation of the Decision Tree Regressor algorithm code in Python.In the syntax of Random Forest and Decision Tree algorithms commonly used next hyperparameters [23]: n_estimators: This parameter determines the number of decision trees that will be used in the random forest model.Increasing the number of trees generally improves the performance of the model, but it also increases computational complexity.
max_depth: This parameter sets the maximum depth of each decision tree in the random forest.A deeper tree can capture more complex patterns in the data, but it can also lead to overfitting.Setting an appropriate max_depth is crucial for balancing model complexity and generalization.
max_features: This parameter determines the maximum number of features considered for splitting a node in each decision tree.Random forests typically consider a random subset of features at each split, which helps to reduce overfitting and increase model diversity.
The random_state parameter in scikit-learn's RandomForestRegressor or RandomForest-Classifier (and in many other scikit-learn models) is used to ensure reproducibility of the results.In machine learning models, randomness may be introduced during certain operations such as bootstrapping, feature sampling, or initialization of weights.
Random Forest Regression: Random Forests train a collection of decision trees, where each tree is trained on a random subset of the dataset.Random Forest is considered a meta-estimator because it aggregates the predictions of multiple base estimators (individual decision trees) to make a final prediction.It doesn't directly learn relationships from the data like traditional estimators, but rather combines the outputs of simpler models.In Random Forest, multiple decision trees are trained, each on a random subset of the dataset.These trees are typically trained independently of each other.Each decision tree in the Random Forest is trained on a different subset of the original dataset.This process, known as bootstrapping, involves sampling the dataset with replacement to create multiple subsets.Instead of relying on the prediction of a single decision tree, Random Forest combines the predictions of all the trees in the forest.For classification tasks, it uses majority voting, while for regression tasks, it typically takes the average of the predictions.By training each decision tree on a random subset of the data and considering only a random subset of features at each split, Random Forest introduces randomness into the learning process, which helps prevent overfitting.Random Forests are widely used and highly effective for a variety of machine learning tasks, including classification and regression, due to their ability to handle high-dimensional data, mitigate overfitting, and provide robust predictions [24].Support Vector Regression (SVR): SVR is indeed a type of Support Vector Machine (SVM) algorithm used for regression analysis.While traditional SVMs are primarily used for classification problems, SVR extends the SVM framework to handle regression tasks.Instead of finding a hyperplane that best separates classes, SVR aims to find a hyperplane that best fits the data points within a specified margin of error.The hyperplane in SVMs represents the decision boundary that separates different classes in the feature space (Figure 3).It is defined by a set of parameters (weights and bias) learned during the training process.The margin is the distance between the hyperplane and the closest data points from each class.SVM aims to find the hyperplane that maximizes this margin, as it generalizes better to unseen data and improves the model's robustness [25].
Figure 4 illustrates the main steps in solving the objectives of this study.

Results and Discussion
For the research, a database of agroclimatic data and data on the yield of the North Kazakhstan region for the period 1990-2023 was prepared (Fig. 1).This dataset consists of 12346 rows and 45 columns that describe agro climatic conditions over the last 33 years in the North Kazakhstan region.Time Series Analysis of the main factors influencing crop yields was carried out.Moreover, it is generated line plots for mean, minimum, and maximum humidity over the years (Fig. 2).The top line (green) represents the maximum humidity for each year, showing less fluctuation and maintaining high values throughout the period.The middle line (blue) indicates the mean or average annual humidity, which shows a slight downward trend with some year-toyear variation.The bottom line (orange) shows the minimum annual humidity, which displays more pronounced fluctuations compared to the mean and max lines.The x-axis represents the years, while the y-axis represents the humidity level, which, although not explicitly labeled, is likely in percentage given typical humidity measurements.From the plot, we can observe that while the maximum humidity remains relatively stable, the mean and minimum humidity levels exhibit more variability.Notably, the minimum humidity shows several dips, which could correspond to particularly dry periods or years.This visualization helps in understanding the overall humidity trends and can be particularly useful when analyzing the impact of varying humidity levels on agricultural productivity, climate patterns, or other environmental factors.
Furthermore, we created line plots for mean, minimum, and maximum temperature across years.The plot shows significant spikes in maximum annual precipitation, suggesting years with extreme precipitation events.There is considerable variability from year to year, with some years experiencing very high maximum precipitation, while others remain closer to what may be the region's average.The most noticeable peaks occur around the early 1990s and the mid-2000s, which may be indicative of particularly heavy rainfall or snowfall events during those times.
The second plot displays three lines representing different statistical measures of temperature over the same period: The top line (green) represents the maximum mean temperature for each year, showing a slight fluctuation but generally maintaining higher values, which suggests warmer average conditions in the region.
The middle line (blue) depicts the mean of the mean annual temperatures, which remains relatively stable over the period, indicating consistent average temperatures from year to year.
The bottom line (orange) shows the minimum mean annual temperature, with noticeable variability and a slight overall increase in the latter years, which could suggest a warming trend.
Next, it is visualized the yield of potatoes over the years using a line plot.Overall, there is an upward trend in potato yields over the years, which may indicate improvements in agricultural practices, technological advancements, or favorable climatic conditions for potato farming in Kazakhstan.There are noticeable fluctuations within the trend, with some years showing sharp declines in yield.These could correspond to adverse weather events, pest outbreaks, or other agricultural challenges faced in those years.The most significant dips occur around the mid-1990s and early 2000s.These years could be of particular interest for further investigation to determine the causes of these declines.After the early 2000s, the trend is predominantly upward, suggesting a period of growth and possibly increased efficiency or resilience in potato production.
Furthermore, we generated a heat map to visualize correlations between all variables in the merged dataset (Fig. 5).Values close to 1 or -1 indicate a strong positive or negative correlation, respectively, while values near 0 suggest no correlation.The crop yields (grains, potato, open_veg, melons, sugar) show strong positive correlations with each other, indicated by the deep red color.This suggests that when the yield of one crop type is high, others are likely to be high as well, which could be due to favorable weather conditions or effective agricultural practices across the board.
Crop yields generally have low to moderate positive correlations with temperature (temp_ mean, temp_min, temp_max) and humidity (humidity_mean, humidity_min, humidity_max) factors.This can imply that certain levels of temperature and humidity are beneficial for the crops but do not necessarily increase yield linearly.Interestingly, there is a noticeable negative correlation between maximum temperature (temp_max) and the crop yields, suggesting that extremely high temperatures might negatively impact the yields.Maximum precipitation (precip_max) shows a moderate negative correlation with crop yields, hinting that excessive rainfall might be detrimental to the crops or could indicate flooding events.The Decision Tree Regressor, Random Forest Regressor and Support Vector Machine Regressor machine learning algorithms were applied during the research (Figure 7).

Comparative analysis
The results of several iterations of the above algorithms for forecasting potato yields for 2024 in the North Kazakstan region are presented below (table 1): The Random Forest, Support Vector Machine, Decision Tree algorithms were applied to predict potato yield in the study area for a given growing season.Random Forest Regressor algorithm showed the best performance with the best R 2 =0.97865.
The performance of the Decision Tree algorithm exhibits sensitivity to changes in the train/ test size, particularly affecting classification metrics.Random Forest exhibits less sensitivity to changes in the test size compared to the Decision Tree.Regression metrics for Random Forest, including RMSE and R 2 , demonstrate an improvement over the Decision Tree.The RMSE ranges from 0.25 to 0.46, and R 2 values are generally positive, indicating better predictive performance.
Overall Observations: 1. Best Overall Performance: Random Forest generally outperforms both Decision Tree and SVM in regression task.
2. Train/Test Size Trade-off: The sensitivity to test size suggests a trade-off that impacts model performance, and the optimal split may vary between algorithms.
3. Regression Challenges: Challenges in predicting the target variable are evident, as indicated by consistently negative R2 values.Further investigation into model complexity may be needed.
4. Fine-tuning Opportunities: Hyperparameter tuning and experimentation with different features could contribute to enhanced model performance.This analysis provides valuable insights for refining the models based on the specific characteristics of the dataset.

Conclusion
Throughout our paper, we conducted an in-depth analysis using sophisticated data processing and machine learning techniques to unravel the complex interactions between weather conditions and agricultural outputs in Kazakhstan.
We distilled large datasets into actionable insights, revealing strong correlations between various weather factors and crop yields.Our exploratory data analysis, visualized through line plots and heat maps, provided a clear depiction of trends and helped identify factors that could significantly influence agricultural productivity.
One of the critical challenges we faced was the integration of diverse datasets, which required careful preprocessing to ensure data integrity.We also had to navigate the inherent complexities of agricultural data, which presented both a methodological challenge and an opportunity to refine our analytical approaches.The research's conclusion highlights several key takeaways: • The importance of rigorous data cleaning and preprocessing to enable accurate modeling.• The potential of using weather data to predict agricultural yields, offering valuable insights for farmers and policymakers.• The realization that agricultural data analysis is complex and multifaceted, necessitating a nuanced approach to model building and evaluation.In future iterations of this work, we could explore the integration of additional data sources, such as satellite imagery or soil quality data, to further refine our predictions.Moreover, delving into more advanced machine learning models and expanding our hyper parameter tuning could potentially yield even more accurate forecasts.
In conclusion, this research stands as a testament to the power of data science in agriculture.By blending traditional statistical methods with modern machine learning techniques, we have made strides in predicting agricultural yields, contributing valuable knowledge to the field and setting the stage for further research and innovation.

Figure 1 .
Figure 1.Implementation of the Decision Tree Regressor algorithm code in Python

Figure 2 .
Figure 2. Implementation of the Random Forest Regressor algorithm code in Python

Figure 3 .
Figure 3. Graphical representation of Support Vector Regression

Figure 4 .
Figure 4.The main steps to solve the research

Figure 1 .
Figure 1.Agroclimatic and crop yield data for the North Kazakhstan region for 1990-2023

Figure 2 .
Figure 2. Time Series of humidity of North Kazakhstan region for 1990-2023

Figure 3 .
Figure 3. Precipitation and temperature trends North Kazakhstan region for 1990-2023

Figure 4 .
Figure 4. Potato yields of North Kazakhstan region

Figure 5 .
Figure 5. Correlation matrix between agroclimatic parameters and data on agricultural crop yields in the North Kazakhstan region

Figure 6 .
Figure 6.Decision Tree visualization for potato yield prediction task

Figure 7 .
Figure 7. Code fragment for potato yield forecasting

Table 1 .
Results of Random Forest, Decision Tree, Support Vector Machine algorithms with different test and training data