COMPARATIVE EFFECTIVENESS OF RULE-BASED AND MACHINE LEARNING METHODS IN SENTIMENT ANALYSIS OF KAZAKH LANGUAGE TEXTS

: Sentiment analysis is increasingly pivotal in natural language processing (NLP), crucial for deciphering public opinions across diverse sectors. This research conducts a com - parative examination of rule-based and machine learning (ML) methods in sentiment analysis, specifically targeting the Kazakh language. Given the Kazakh language’s limited exposure in computational linguistics, the study meticulously evaluates datasets from news articles, liter - ature, and Amazon product reviews, aiming to compare the efficiency, adaptability, and overall performance of these distinct approaches. Employing a detailed set of evaluation metrics such as


Introduction
Sentiment analysis is the technique of determining a text's emotional tone and classifying it as positive, negative, or neutral [1][2].Numerous studies have been carried out for a number of years on sentiment analysis across a variety of languages, including languages that are agglutinative [3][4][5].On the other hand, academic studies concentrating on sentiment analysis in the Kazakh language are limited.In the field of natural language processing, sentiment anal-tracting insights from vast and dynamic social media datasets, thereby shaping the evolving landscape of sentiment analysis in the era of big data.
Effective preprocessing is essential for accurate sentiment analysis of user comments, which often feature diverse languages and spelling errors.Niyazmetova K. et al. [9] focus on sentiment analysis of Tashkent restaurant reviews from Google Maps, emphasizing dataset preprocessing as pivotal for algorithm optimization.They employ logistic regression models for their robust statistical foundation in binary results, well-suited for sentiment categorization tasks.Specifically, they integrate preprocessing techniques like stemming, tailored for agglutinative languages characterized by complex word formation.Evaluation results demonstrate the system's effectiveness, underscoring the benefits of preprocessing-especially stemmingfor agglutinative languages, as it standardizes text and enhances emotion discernment.The study highlights the critical role of preprocessing in sentiment analysis, particularly in diverse linguistic contexts like Tashkent restaurant reviews.It advances sentiment analysis techniques by leveraging logistic regression models and language-specific preprocessing, showcasing the reliability of their method in obtaining sentiment analysis insights.
Tussupov J. et al. [10] investigate word normalization algorithms and morphological models tailored to the unique linguistic features of Kazakh.They explore synthesizing normalized forms and identifying word bases in Kazakh, offering guidelines for handling non-dictionary concepts and nonexistent terms.This inclusive approach accommodates the dynamic nature of language, suitable for languages with evolving vocabularies.Their development of a Kazakh thesaurus for scientific and technical terms in information technology demonstrates the algorithm's flexibility and reliability, particularly in specialized fields.The study enhances textual data processing, especially in morphologically complex languages like Kazakh, through the formulation of normalization rules.By addressing Kazakh's linguistic complexities, the research advances linguistic tools and underscores the importance of specialized methods for unique morphological structures.The creation of a customized thesaurus exemplifies the algorithm's utility in specific language contexts.Zhumabekova A.K. et al. [11] explore the intricacies of this translation process.They highlight the significant role of Russian as a mediator between English and Kazakh and discuss how linguistic and cultural differences between the languages influence the translation outcome.Through a comparative analysis of translated texts and examination of semantic shifts and stylistic adaptations, the authors elucidate the challenges faced by translators in preserving meaning and cultural nuances.They also propose strategies for mitigating these challenges, emphasizing the importance of linguistic competence and cross-cultural sensitivity.This study contributes to the understanding of indirect translation practices and provides insights for translators and researchers working in multilingual contexts.
Sentiment analysis, an essential part of natural language processing, is concerned with polarity-based text classification.Opinion mining is essential in this field, especially when it comes to figuring out what people think about movies or goods.It is impossible to overestimate the importance of user views on purchase decisions; for example, movie star ratings have a big influence on prospective audiences and shape their tastes.Similar to this, consumers' perceptions and decisions are greatly influenced by product reviews.The Naive Bayes classifier technique, a probabilistic method often employed in sentiment research, is utilized for classification in Surya, P. P. et al. [12] research.The Amazon product review dataset, which has about 600 entries, is the dataset that was taken from the UCI repository.Every record is analyzed using the Naïve Bayes technique, which ultimately yields a probabilistic matrix.Next, an accuracy matrix is used to evaluate the suggested strategy.This study explores the use of sentiment analysis and offers important new information on how well the Naive Bayes classifier distinguishes between different textual datasets' feelings.
The importance of customer reviews is examined in this review of the literature with reference to Shopee, an online store.Shopee's ongoing purchasing and selling activity generates a growing number of user evaluations, which are essential product references.Customers are allowed to express their opinions in these comments on the Shopee website, both favorable and bad.In order to deal with this situation, Hariguna, T. et al. [13] suggest using sentiment analysis, which combines a naïve Bayes classifier with the K-means clustering method.K-means is used to help classify comments into different groups, and the naive Bayes classifier is used to evaluate how well these groups are classified.According to the study, K-means clustering achieves an accuracy of 77.12% by identifying 116 negative and 37 positive comments in product evaluations.This accuracy is noteworthy since it is higher than the 56.86% accuracy that was attained using K-means, the Naive Bayes classifier, and manual data.The results highlight the frequency of unfavorable remarks, which are especially noticeable when it comes to the product «High Heels Women Knot Ribbon Ikat FX18» by Spatuafa.As a result, the study emphasizes how crucial sentiment analysis is to comprehending customer feedback and the possible influence of unfavorable remarks on how products are seen and assessed in the e-commerce space.
The literature review raises a pertinent question regarding the choice of methods for sentiment analysis, language processing, and translation tasks: Which methods are better to use?This question emerges from the diverse approaches adopted in the studies reviewed, ranging from rule-based methods to machine learning models.While some studies demonstrate the effectiveness of rule-based approaches, others showcase the potential of machine learning techniques.The comparison between these methodologies prompts further investigation into their respective strengths and weaknesses, considering factors such as accuracy, scalability, computational efficiency, and adaptability to different linguistic contexts.Ultimately, the choice of method may depend on the specific requirements of the task at hand, the availability of labeled data, computational resources, and the desired level of interpretability.Therefore, exploring the comparative advantages of rule-based and machine learning approaches becomes essential in determining the most suitable method for achieving optimal results in sentiment analysis, language processing, and translation tasks.
The aim of the study is to determine the comparative effectiveness of rule-based methods against machine learning algorithms for sentiment analysis.
Many studies have been conducted on sentiment analysis in natural languages, and notable progress has been made in well-known languages such as English, Turkish, and Russian.However, due to a lack of necessary tools and resources, the Kazakh language is still largely underdeveloped in this field.Proposed methodology incorporates three datasets: D. Chapaev's Sentiment Analysis Dataset, Serek's Agglutinative Language Sentiment Dataset, and Kaggle Amazon Sentiment labeled Dataset.Two methods were compared: Rule-based sentiment analysis, focusing on Kazakh language adjectives and Machine Learning models included Logistic Regression, Multinomial Naive Bayes, Decision Trees, Random Forest, and XGBoost.
This study underscores the importance of dataset-specific considerations in sentiment analysis tasks in Kazakh, highlighting the complementary strengths of both rule-based and ML approaches.

A. Dataset
In this research, three distinct datasets were utilized to conduct a comprehensive analysis.The details of each dataset are outlined below: 1. D. Chapaev's Sentiment Analysis Dataset: Source: Dauren Chapaev's sentiment analysis dataset on github [14].
Composition: The dataset comprises 20,014 sentences extracted from various news websites.Within this dataset, 5,993 sentences are classified as negative, 4,422 as positive, and the remaining 9,599 are labeled as neutral.
This dataset consists of 732 sentences, categorized into 231 positive, 228 negative, and 273 neutral sentiments.The selection of sentences is based on the emotionally charged content of the book, focusing on emotions such as anger, fear, resentment, sadness, hopelessness (considered negative), and inspiration, anticipation, joy, euphoria, delight, interest, admiration, satisfaction (considered positive), with all other cases designated as neutral.

Kaggle Amazon Sentiment labeled Dataset:
Acquired from Kaggle, this dataset was originally in English [16] and was subsequently translated to Kazakh.Created for the research 'From Group to Individual Labels using Deep Features' [17].
The dataset contains sentences labeled with positive (score 1) or negative (score -1) sentiment.Initially sourced from three different websites/fields-imdb.com, amazon.com,and yelp.com-eachwebsite provides 500 positive and 500 negative sentences.For this research, sentences specifically selected from the amazon.comdomain were utilized.
For the rule-based sentiment analysis component of this study, a set of Kazakh language adjectives was compiled from Sozdik Qor [18].Sozdik Qor is a comprehensive platform that facilitates access to words and stable phrases from diverse industry dictionaries and encyclopedias.It encompasses ancient words in the Kazakh language, input words, and the meanings of newly emerging technological terms in the realm of regional and information technologies.The portal's search engine allows users to explore word definitions, synonyms, antonyms, homonyms, and their occurrences in phraseological phrases or within sentences, all conveniently presented on a single page.
Specifically, the focus was on gathering adjectives in the Kazakh language from this platform.However, these adjectives were initially unlabeled.Sentiment labels in the range of -1 to 1 were manually assigned to address this issue.The final dataset contains 5,539 adjectives, with 1,902 tagged as near to positive sentiment, 1,657 as close to negative sentiment, and 1,980 as close to neutral sentiment.

B. Rule-Based Sentiment analysis
The following principles have been developed in order to formalize the rules guiding the evaluation of sentiment in Kazakh language phrases: 1. Adjective Tonality: Let T a stand for an adjective's tone.The tone of the adjective determines the tone of the phrase (T p ) directly: T a = T p Example: jaqsy adam -> Positive (tonality is 1) Explanation: → jaqsy -positive adjective with a sentiment label of 1.
→ adam -a noun 2. Adverb Influence: Let T a represent the adjective's initial tonality.Let T p to be the tonality of the phrase.If an adverb (A) precedes the adjective: Example: ote jaqsy adam -> Positive (tonality is 2) Explanation: → ote -an adverb → jaqsy -positive adjective with a label 1 → adam -a noun 3. Negation Impact: Let T a be the original tonality of the adjective.Let T p be the tonality of the phrase.If a negation (N) appears after the adjective: T p = -T a Example: jaqsy adam emes -> Negative (tonality is -1) Explanation: → jaqsy -a positive adjective with a label 1 → adam -a noun → emes -a negation Cumulative Adjective Tonality: When a sentence has more than one adjective, the overall tone of the phrase is derived from the tonal aggregate of all the adjectives.For instance, the combined tonality of 'ademi +0.7' and ' ote jaman -2' in a phrase is determined to be 0.7 -2 = -1.3,indicating a Negative (-) attitude.
These principles provide the fundamental components for a deliberate approach to analyzing sentiments in Kazakh.A richer, more complex understanding of the emotions conveyed in the text can be gained by exploring the complex relationship between adjectives, adverbs, and negations.

C. Data Processing and Machine Learning Approaches • Data Preprocessing:
To ensure that the data flowing into the models was in the best possible state, thorough cleaning of the text was conducted before the utilization of the machine learning models.Several processes were involved in this cleanup: → Lowercasing: To maintain consistency and get rid of case-related differences, convert all text to lowercase.
→ HTML Tag Removal: Remove HTML tags in order to eliminate any unnecessary content.→ Special Character, Number, and Punctuation Removal: For a writing that is clearer and more concentrated, remove punctuation, special characters, and number values.
→ Stopword Removal: To cut down on noise and highlight words that provide meaning, omit frequently used stop words.
→ Stemming: Utilize stemming to break down words into their most basic form, which will enable more efficient feature extraction.

• Vectorization Techniques:
The processed text data was vectorized using two widely-used methods to provide numerical characteristics for machine learning models: 1. TF-IDF Vectorizer: The Frequency-Inverse Document Frequency (TF-IDF) method is employed to measure a term's significance in relation to a group of documents.It takes into account the inverted document frequency over the whole dataset in addition to the term's frequency in a document.The selection of TF-IDF stems from its capacity to draw attention to important terms in a document while minimizing common terms.This strengthens the sentiment analysis features' ability to discriminate [19].
Mathematical Formulation: For a term t in a document d, TF-IDF is calculated as in ( 1): ( where IDF is the inverse frequency and TF is the term frequency.This procedure minimizes common terms while highlighting the significance of phrases in a text.

Count Vectorizer:
The text is transformed into a sparse matrix that displays the number of each phrase in the document using count vectorization.Count vectorization is selected due to its simplicity of usage and efficacy in figuring out the word frequency in a document.It provides a straightforward representation of word occurrences, which is helpful for some types of sentiment analysis tasks [20].
Mathematical Formulation: The count of each term in a document is represented as a matrix element (2):

Count(t,d) = number of occurrences of term t in document d
(2) that produces a sparse matrix that highlights word occurrences.

• Machine Learning Models:
The vectorization algorithms were used in combination with two commonly used classifiers: 1. Logistic Regression: A linear model that works well for binary classification applications is Logistic Regression.It is useful for sentiment analysis tasks and predicts the likelihood that a sample belongs to a specific class.The simplicity, effectiveness, and interpretability of logistic regression make it a preferred option.In tasks involving text categorization, it frequently does well [21].
Mathematical Formulation: Logistic regression predicts the probability that a sample belongs to a specific class using the sigmoid function ( 3): (3) where b 0 , b 1 ,…, b n are coefficients and x 0 , x 1 ,…, x n are features.

Multinomial Naive Bayes:
The probabilistic classifier Multinomial Naive Bayes is based on the Bayes theorem.It works especially well for text classification jobs since it makes the assumption that the characteristics are conditionally independent given the class.The selection of Naive Bayes is based on its efficacy and efficiency in managing sparse and high-dimensional data.Tasks involving word frequencies in documents are a good fit for it [22].
Mathematical Formulation: Naive Bayes calculates the probability of a document belonging to a class given its features (4): The assumption of conditional independence simplifies the computation.

Decision Trees:
A versatile model for classification and regression that segments the dataset into branches, making it highly interpretable.Decision Trees can be applied effectively in categorizing textual data based on feature thresholds.
Mathematical Formulation: Decision trees use criteria such as Gini impurity or information gain to split data, aiming to create subsets that are as pure as possible at each node.

Mathematical Formulation:
For classification tasks, the Random Forest model takes the majority vote from its decision trees.In regression tasks, it averages the outputs.The randomness injected into the model building process helps in reducing overfitting.
5. XGBoost: Stands for eXtreme Gradient Boosting, a scalable and efficient implementation of gradient boosting.It is known for its performance and speed in data competitions.XG-Boost is particularly useful in handling structured data for both classification and regression tasks.
Mathematical Formulation: XGBoost optimizes a loss function by iteratively adding trees that predict the residuals or errors of prior trees, with an objective to minimize these errors across all predictions.
The inclusion of Decision Trees, Random Forest, and XGBoost alongside Logistic Regression and Multinomial Naive Bayes enriches the analysis by covering a broad spectrum of machine learning approaches, ranging from simple to complex models.This diverse set of models was selected for their complementary strengths in handling different aspects of sentiment analysis, with considerations for factors such as interpretability, efficiency, and effectiveness in managing high-dimensional data.

• Data splitting:
The datasets were each separated into training and testing sets to facilitate the evaluation of the proposed models' performance across multiple data sources.Specifically, for each dataset, 80% of the data was allocated for training purposes, while the remaining 20% was reserved for testing.This consistent partitioning approach ensures a uniform evaluation framework for all three datasets.Prior to model training and testing, extensive data preprocessing procedures were implemented to ensure that the data entering the models was in the best possible condition.This preparation involved several cleaning processes tailored to each of the datasets.
• Computational Environment: Google Colab, a cloud-based platform for machine learning and Python programming, was used for the analysis.An easy and scalable environment for running code is offered by Google Colab, which is especially helpful for resource-intensive activities like machine learning.This platform ensures that the method described may be executed with ease and that collaboration is made easier.

Results and Discussion
The F1 score for the applied model on each individual dataset is used to illustrate the results, giving a thorough picture of how different vectorizers, classifiers, and the rule-based method performed (Tables 1, 2, and 3).
formance underscores the potential of these models in analyzing texts with nuanced language structures.The Rule-Based method's continued success emphasizes its capacity to grasp the subtleties of literary expressions effectively.

Amazon Sentiment Labeled Sentences Dataset Results:
The Amazon Sentiment Labeled Sentences Dataset shows the strengths of ensemble methods, with Random Forest leading in performance.This suggests their robustness in adapting to the varied linguistic patterns typical of online reviews.Decision Trees and XGBoost also offer strong alternatives, highlighting the diversity of effective approaches available for sentiment analysis.

Comparative Analysis:
• Vectorization Techniques: The consistent performance of TF-IDF and Count Vectorizer across all datasets underscores their reliability in capturing textual features, regardless of the linguistic context.• Classifier Performance: Logistic Regression and Multinomial Naive Bayes demonstrate versatility across the datasets, suggesting their general applicability to sentiment analysis.The decision on which model to use may depend on specific project needs related to interpretability and computational demands.• Dataset-Specific Observations: The models exhibit a capacity to adapt to the varied language patterns encountered, from the broad scope of Dauren Chapaev's dataset to the specific challenges posed by Azamat Serek's literary work.The Rule-Based method's performance in the latter case points to its effectiveness in contexts where capturing emotional depth is crucial.• Model Selection: The study illustrates the efficacy of both machine learning models and Rule-Based approaches in their respective domains.Ensemble methods, in particular, show promise for their adaptability and robust performance across different text types.

Considerations for Sentiment Analysis in Kazakh:
The findings emphasize the critical role of model and preprocessing technique selection in sentiment analysis.While ensemble methods like Random Forest prove to be highly effective across various text types, the value of Rule-Based approaches in capturing nuanced emotional content should not be overlooked.The diversity of the Kazakh language, with its range of literary and journalistic expressions, necessitates flexible and nuanced analysis strategies.
This revised discussion integrates Decision Trees, Random Forest, and XGBoost as foundational elements of the study, providing a nuanced comparison of their performance against traditional models and a Rule-Based approach across diverse datasets.

Conclusion
The study's findings provide factual support for the continuing debates in the area and mark a substantial advancement in our knowledge of sentiment analysis techniques.Additionally, professionals looking to utilize sentiment analysis methodologies might benefit from the practical insights gained from this comparison investigation.
The findings conclusively show that rule-based methods excel in identifying the nuanced emotional content typical of literary works.Conversely, machine learning (ML) models exhibit remarkable flexibility in addressing the language variations often encountered in news articles and reviews.This distinction underscores the effectiveness of rule-based strategies in specific emotional contexts within literature, and the adaptability of ML models to a wide range of linguistic patterns prevalent in news and review datasets.
Nevertheless, the analysis revealed notable limitations in the performance of the rulebased model, particularly evident with the Sentiment Labeled Sentences Data Set.This highlights the challenges in accurately capturing a wide spectrum of sentiment expressions and language nuances, particularly in contexts beyond literary texts.Such findings underscore the urgent need for further research to enhance and develop current sentiment analysis techniques.
Prospective future paths for study might involve integrating sentiment dictionaries or domain-specific lexicons that are designed to fit the variety of language styles that are common outside of literary fields.It is possible that these efforts will improve sentiment analysis models' ability to function in the complex linguistic environment of Kazakh, making them more flexible to a wider range of sentiment expressions and linguistic subtleties in non-literary instances.