BUDDI Health - Class Imbalance Based Deep Learning Platform

A CLARITRICS COMPANYClass Imbalance Learning BUDDI - Deep Learning Platform Mr. S. Sudarsun is Chief Data Scientist for BUDDI Health and a proven Applied Research Machine Learning has now become an avenue for absorbing and modeling what one knows and Scientist in the field of Text Mining and Deep Learning and a PhD scholar at Indian Institute of doesn’t know, to enable more targeted information to help individuals learn. Platform companies— like Technology, Madras since 2013. His award winning prior work in artificial intelligence includes Google, LinkedIn, and Amazon are building algorithms that represent us and our needs for data driven analytics platforms for Israeli government, Google, Several European governments and 30+ decisions. Learning lifelong from the data generated by individuals will allow machines to know U.S State governments. Sudarsun has over 12+ years of hands-on deep learning expertise and practicing end-to-end Product Development using Statistical Natural Language Processing. enough about individuals, and make recommendations that adapt to our changing contexts. At the core of this opportunity is data, tonnes of it, and algorithms that convert data into a matrix of comparisons for decision making. Prof. B. Ravindran (Profile) an Ai Advisor for BUDDI Health, is currently an associate “Customers who bought this item also bought” is a popular phrase in Amazon’s online retail, which has professor in Computer Science at Indian Institute of Technology, Madras. He has over two been using data driven approaches to recommend products. With LinkedIn’s Economic Graph[2], decades of research experience in machine learning and specifically reinforcement learning. He collaborates on several Ai projects across several Ivy League universities and is considered the “Father of Reinforcement insights such as changing nature of work, skills that comprise specific jobs, work-related behaviors that Learning”. Currently his research interests are centered on learning from and through interactions pre-signal a change of job and recognizing skills-gap have become readily available for informed decision and span the areas of data mining, social network analysis, and reinforcement learning. Dr. Ravi making. On medical imaging front, machine learning models[3] are used to detect melano- Abstract commands a solid 120+ research publications in this field. Data classification task assigns labels to data points using a model that is learned from a collection of ma from the images of pigmented skin lesions. Android powered mobile phones allow Google to have pre-labeled data points. The Class Imbalance Learning (CIL) problem is concerned with the per- access to data generated by a very large population of individuals and provide assistance on formance of classification algorithms in the presence of under-represented data and severe class dis- recommendation and decision making in several day-to-day activities such as travel, food, shopping etc. tribution skews. Due to the inherent complex characteristics of imbalanced datasets, learning from Autonomous cars use digital imaging and image classification models to identify the type of obstacle such data requires new understandings, principles, algorithms, and tools to transform vast amounts of for being humans, animals or objects to negotiate appropriately. In agriculture, models for crops and soil raw data efficiently into information and knowledge representation. It is important to study CIL conditions are used to decide which crop to grow given a particular soil and climatic conditions. In because it is rare to find a classification problem in real world scenarios that follows balanced class speech recognition space, machine learning models are used to identify speaker based on the unique distributions. In this article, we have presented how machine learning has become the integral part of characteristics learned about every speaker during the model training phase. modern lifestyle and how some of the real-world problems are modeled as CIL problems. We have also provided a detailed survey on the fundamentals and solutions to class imbalance learning. We conclude Data Classification the survey by presenting some of the challenges and opportunities with class imbalance learning. A typical machine learning classification task comprises of a training set, an evaluation set, a testing set of data points and a performance evaluation metric[4]. The datasets can be either unlabeled or labeled. Introduction A labeled dataset is the one where for every input data point there is an associated output label Machine learning is a core sub-area of artificial intelligence that enables computers to learn without (categorical value) assignment. The labeled datasets are expensive to construct as it requires human being explicitly programmed. When exposed to new data, computer programs are enabled to learn, labor to verify the truth of the output labels. grow, change, and develop by themselves. Machine learning is a method of data analysis that auto- mates analytical model building to find hidden insights without being explicitly programmed where to Mathematically, the set of labels is represented as Y={y1, y2, …, ym}, the ith input data point is repre- look. The iterative aspect of machine learning allows models to independently adapt to new data sented as Xi and its associated output label is represented as yi, where yi → Y . A typical dataset is exposure. The algorithms learn from previous computations to produce reliable, repeatable deci- represented as D=⟨X,y⟩ , where X is typically a matrix of N input row vectors of P-dimensions and y is a Class Imbalance Learning sions and results. column vector of length N. The objective of a machine learning classifier is to learn a function f Class Imbalance Learning : X → y, that minimizes the misclassification error where ˆyi and yi are the predicted A computer program is said to learn from experience E with respect to some task T and some per- and true label for a data point Xi respectively. formance measure P, if its performance on T, as measured by P, improves with experience E.[1] The training set is the dataset population from which one or more statistical machine learning mod- Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) 2 3 Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) els are learned. The evaluation dataset is used to select the best machine learning algorithm or strategy tistical Classification of Diseases and Related Health Problems (ICD)[5], a medical classification list by by measuring the performance of the learned models on the evaluation dataset using the evaluation the World Health Organization (WHO). It contains codes for diseases, signs and symptoms, abnormal metric(s). Some of the popular point based performance metrics are accuracy, precision-recall, F1-score findings, complaints, social circumstances, and external causes of injury or diseases. ICD-10 Clinical and curve-based performance metrics are ROC, PRC, AUC, cost curves. Once the statistical model is Modification has about 69,833 codes, and ICD-10 Procedure Coding System has 71,918 codes, which chosen and its parameters are tuned using the evaluation process, the performance on test dataset is make up for a perfect challenge for class imbalance learning, where the coding system is both multi-label measured using the same evaluation metric and reported. classification, with thousands of classes and severely skewed in terms of class distribution. Companies like BUDDI Health[6] provide deep learning and knowledge based hybrid solutions to the extreme Supervised classification algorithms learn from labeled examples, where the training data comprises of classification problem, while companies like 3M, Optum, Nuance, Dolbey and Opera provide statistical an input data representation along with the desired output label. For example, a quality test of a NLP and rule based solutions[7]. machinery could have data points labeled either “F” (fail) or “P” (pass). The learning algorithm receives a set of inputs represented predominantly as a vector of feature values, along with the corresponding Suicide Prevention Tools[8]: Facebook has introduced a suicide-prevention feature that uses AI to correct output labels, and the algorithm learns by comparing its predicted output with correct identify posts indicating suicidal or harmful thoughts. The AI scans the posts and their associated outputs to find the misclassification error. It then modifies the model accordingly so as to minimize the comments, compares them to others that merited intervention, and, in some cases, passes them along error. Supervised learning is commonly used in applications where historical data is used to learn a to its community team for review. The company plans to pro-actively reach out to users it believes model and predict the likely future events. For example, it can predict which insurance customer is are at risk, showing them a screen with suicide-prevention resources including options to contact a likely to file a claim for, or whether a user would buy a product or not, or if the cricket team would win helpline or reach out to a friend. Algorithms, trained on report data from the network’s close to two the next match and so forth. billion users, are constantly on the lookout for warning signs in content that users post, as well as replies that are received. When a red flag is raised, a team of human reviewers is alerted, and the user can be Semi-supervised classification algorithms are used for the same applications as supervised learning. But contacted and offered help. it uses a combination of both labeled and unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data as unlabeled data is easy to acquire and less Fraud Management: is an example of big data with class imbalance characteristics, where the suspi- expensive. Semi-supervised learning is useful when the cost associated with labeling is too high to cious behaviors are “fortunately” rare events. Big data analytics link heterogeneous information from allow for a fully labeled training process. transaction data, which enables the service provider to pick up these behaviors automatically. For example, a series of cash-in transactions to the same account, from different locations, might be an Class Imbalance Learning attempt to avoid paying for domestic transfers, or several cash-ins immediately followed by a cash- out The Class Imbalance Learning (CIL) problem is concerned with the performance of classification could indicate money laundering. Best practice states that no actions are purely automated; the fraud algorithms in the presence of under-represented data and severe class distribution skews. Due to the analyst always has the final say. As a population’s behavior evolves over time, the parameters of the inherent complex characteristics of imbalanced datasets, learning from such data requires new fraud detection models must adapt to remain optimal. In response to this, machine learning algorithms understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into predict a natural evolution of behavior based on historical data as well as previous actions taken by fraud information and knowledge representation. It is necessary to study CIL because it is apparently rare to managers in decision making, and also propose modifications to the detection model for future anomaly find classification problems in real-world scenarios that follow balanced class distributions. We present detection. some real-world use cases, where the data distribution is naturally imbalanced and the data distribution of interest is typically the minority class data points. Churn Prediction: One of the most important business metrics for executives is churn rate—the rate at Class Imbalance Learning Class Imbalance Learning which your customers stop doing business with you. Today, data driven companies use data science Deep Learning based Auto Medical Coder: A deep learning coding system (DLC) is a software to effectively predict which customers are likely to churn. Such programs allow companies to pro- system that analyzes healthcare documents (medical charts) with a deep parser technology and actively protect revenue by incentivizing the potential churners to continue staying with them. predicts appropriate medical codes (compliant to Medicare/Medicaid/SnoMed) for specific phrases Networking and communication companies typically have business level data, usage logs, and call and terms within the document. ICD10 is the 10th revision of the International Sta- Sudarsun Santhiappan and Balaraman Ravindran 4 5 Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) center/support tickets, among other data assets. This data is generated from their consumer and it is possible that the acquired dataset can be imbalanced in which case the dataset would be an business customer interactions, and these datasets vary in terms of volume and user behavior. The extrinsic imbalanced dataset attained from a balanced data space. In addition to intrinsic and ex- machine learning task of interest is to predict if a customer would become a churner or not, which is trinsic imbalance [2, 3], it is important to understand the difference between relative imbalance and relatively a very small subset of the entire customer population. imbalance due to rare instances (or “absolute rarity”). Buyer Prediction: Millions of consumers visit e-commerce sites, but only few thousand visitors buy the Noise Samples Noise Samples products, which makes the imbalance ratio in the order of 1000:1 or more. A typical e-retailer wants f2 f2 to improve the customer experience and would like to improve the conversion rate. The objective is to A: General Majority Class Concept identify the potential buyers based on their demographics, historical transaction pattern, clicks pattern, browsing pattern in different pages, etc. Deep data analysis reveals the buyers’ buying behaviors which B: General Minority Class Concept are highly dependent on their activities like number of clicks, session duration, previous session, purchase session, clicks-rate per session etc. By applying machine learning and predictive analytics C: Minority Class Subconcept methods, the propensity score of each visitor can be estimated. This leads to multiple benefits for the retailer to offer right and targeted product for the customers at the right time, increase conversion rate, D: Majority Class Subconcept and improve customer satisfaction. (a) f1 (b) f1 Foundations of Imbalanced Learning Fig. 1. A data set with a between-class imbalance. (b) A high-complexity dats set with both between-class and within class imbalances, multiple concepts, Any dataset that exhibits an unequal distribution among its classes can be considered imbalanced. overlapping, noise, and lack of representative data. However, the common understanding in the community is that imbalanced data correspond to datasets exhibiting significant, and in some cases extreme, imbalances. Specifically, this form of imbalance Data complexity is a broad term that comprises of issues such as overlapping, lack of representative is referred to as a between class imbalance; not uncommon are between-class imbalances in the order data, small disjuncts, and others. In a simple example, consider the depicted distributions in Figure 1, of 100:1, 1000:1, and 10000:1, where in each case, one class severely out represents another. In order to where the stars and circles represent the minority and majority classes, respectively. By inspection, we highlight the implications of the imbalanced learning problem in the real world, let’s consider a real-life see that both the distributions in Figures 1a and 1b exhibit relative imbalances. However, notice how problem of classifying a visitor to be a buyer or a non-buyer on an online retail portal, which typically has Figure 1a has no overlapping examples between its classes and has only one concept pertaining to each a ratio of 1:1000 or more. In reality, we find that classifiers tend to provide a severely imbalanced degree class, whereas Figure 1b has both multiple concepts and severe overlapping. Also of interest is of accuracy [1], with the majority class having close to 100 percent accuracy and the minority class subconcept C in the distribution of Figure 1b. This concept might go unlearned by some inducers due having accuracies of 0-5 percent. We require a classifier that will provide high accuracy for the minority to its lack of representative data; this issue embodies imbalances due to rare instances, which we class without severely jeopardizing the accuracy of the majority class. proceed to explore. Imbalance due to rare instances is representative of domains where minority class examples are very limited, i.e., where the target concept is rare. In such a situation, the lack of Intrinsic imbalance is a direct result of the nature of the data space. However, imbalanced data are not representative data will make learning difficult regardless of the between-class imbalance. Further- solely restricted to the intrinsic variety. Variable factors such as time and storage also give rise to more, the minority concept may additionally contain a sub-concept with limited instances, amounting to datasets that are imbalanced. Imbalances of this type are considered extrinsic, i.e., the imbalance is not diverging degrees of classification difficulty. This, in fact, is the result of another form of imbalance called directly related to the nature of the data space. Extrinsic imbalances are equally as interesting as their within-class imbalance [4, 5, 6], which concerns itself with the distribution of representative data for Class Imbalance Learning Class Imbalance Learning intrinsic counterparts since it may very well occur that the data space from which an extrinsic sub-concepts within a class. These ideas are again highlighted in our simplified example in Figure 1. In imbalanced dataset is attained may not be imbalanced at all. For instance, suppose a dataset is Figure 1b, cluster B represents the dominant minority class concept and cluster C represents a procured from a continuous data stream of balanced data over a specific interval of time, and if during subconcept of the minority class. Cluster D represents two sub-concepts of the majority class and this interval, the transmission has sporadic interruptions where data are not transmitted, then cluster A (anything not enclosed) represents the dominant majority class concept. For Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 6 7 Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) both classes, the number of examples in the dominant clusters significantly outnumber the examples in their respective sub-concept clusters, so that this data space exhibits both within-class and between-class imbalances. Moreover, if we completely remove the examples in cluster B, the data space would then have a homogeneous minority class concept that is easily identified (cluster C), but can go unlearned due to its severe underrepresentation. The last issue to consider is the combination of imbalanced data and the small sample size problem. In many of today’s data analysis and knowledge discovery applications, it is often unavoidable to have data with high dimensionality and small sample size; some specific examples include face recognition and gene expression data analysis, among others. Traditionally, the small sample size problem has been studied extensively in the pattern recognition community. Dimensionality reduction methods have been widely adopted to handle this issue, e.g., principal component analysis (PCA) and various extension methods [8]. However, when the representative datasets’ concepts exhibit imbalances of the forms described earlier, the combination of imbalanced data and small sample size presents a new challenge to the community. In this situation, there are two critical issues that arise The existence of within-class imbalances is closely inter-twined with the problem of small disjuncts, simultaneously. First, since the sample size is small, all of the issues related to absolute rarity and within which has been shown to greatly degrade the classification performance [4, 5, 6]. Briefly, the problem of class imbalances are applicable. Second and more importantly, learning algorithms often fail to small disjuncts can be understood as follows: A classifier will attempt to learn a concept by creating generalize inductive rules over the sample space when presented with this form of imbalance. In this multiple disjunct rules that describe the main concept [7] . In the case of homogeneous concepts, the case, the combination of small sample size and high dimensionality hinders learning because of classifier will generally create large disjuncts, i.e., rules that cover a large portion (cluster) of examples difficulty involved in forming conjunctions over the high degree of features with limited samples. If the pertaining to the main concept. However, in the case of heterogeneous concepts, small disjuncts, i.e., sample space is sufficiently large enough, a set of general (albeit complex) inductive rules can be rules that cover a small cluster of examples pertaining to the main concept, arise as a direct result of defined for the data space. However, when samples are limited, the rules formed can become too under-represented sub-concepts . Moreover, since classifiers attempt to learn both majority and [7] specific, leading to overfitting. minority concepts, the problem of small disjuncts is not only restricted to the minority concept. On the contrary, small disjuncts of the majority class can arise from noisy misclassified minority class examples Furthermore, this also suggests that the conventional evaluation practice of using singular assess- or under-represented sub-concepts. However, because of the vast representation of majority class data, ment criteria, such as the overall accuracy or error rate, does not provide adequate information in the this occurrence is infrequent. A more common scenario is that noise may influence disjuncts in the case of imbalanced learning. Therefore, more informative assessment metrics, such as the receiver minority class. In this case, the validity of the clusters corresponding to the small disjuncts becomes an operating characteristics curves, precision recall curves, and cost curves, are necessary for conclu- sive important issue, i.e., whether these examples represent an actual subconcept or are merely attributed evaluations of performance in the presence of imbalanced data [9]. Class Imbalance Learning to noise. For example, in Figure 1b, suppose a classifier generates disjuncts for each of the two noisy Class Imbalance Learning minority samples in cluster A, then these would be illegitimate disjuncts attributed to noise compared to cluster C, for example, which is a legitimate cluster formed from a severely under represented subconcept. Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) 8 9 Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) Solutions to the problem of Class Imbalance Learning actually provide the same proportion of balance. However, this commonality is only superficial; each method introduces its own set of problematic consequences that can potentially hinder learning [12, 13]. When standard learning algorithms are applied to imbalanced data, the induction rules that describe the In the case of under-sampling, the problem is relatively obvious: removing examples from the majority minority concepts are often fewer and weaker than those of majority concepts, since the minority class class may cause the classifier to miss important concepts pertaining to the majority class. In regards to is often both outnumbered and underrepresented. To provide a concrete understanding of the direct oversampling, the problem is a little more opaque: since oversampling simply appends replicated data effects of the imbalanced learning problem on standard learning algorithms, let’s consider the popular to the original dataset, multiple instances of certain examples become “tied,” leading to overfitting [12]. decision tree learning algorithm, where imbalanced datasets exploit inadequacies in the splitting In particular, overfitting in oversampling occurs when classifiers produce multiple clauses in a rule for criterion at each node. multiple copies of the same example which causes the rule to become too specific; although the training accuracy will be high in this scenario, the classification performance on the unseen testing data Decision trees use a recursive, top-down greedy search algorithm that uses a feature selection is generally far worse [14]. scheme (e.g., information gain) to select the best feature as the split criterion at each node of the tree; a successor (leaf) is then created for each of the possible values corresponding to the split feature. As a Informed Undersampling based on EasyEnsemble and BalanceCascade [15] is proposed to over- result, the training set is successively partitioned into smaller subsets that are ultimately used to form come the deficiency of information loss introduced in the traditional random undersampling method. disjoint rules pertaining to class concepts. These rules are finally combined so that the final hypothesis Another example of informed undersampling uses the K-nearest neighbor (KNN) classifier to achieve minimizes the total error rate across each class. The problem with this procedure in the presence of undersampling. Based on the characteristics of the given data distribution, four KNN undersampling imbalanced data is two-fold. First, successive partitioning of the data space results in fewer and fewer methods were proposed in [16], namely, NearMiss-1, NearMiss2, NearMiss-3, and the “most distant” observations of minority class examples, resulting in fewer leaves describing minority concepts and method. Another method, the one-sided selection (OSS) method [17] selects a representative subset of successively weaker confidences estimates. Second, concepts that have dependencies on different the majority class and combines it with the set of all minority examples to form a preliminary set, which feature space conjunctions can go unlearned by the sparseness introduced through partitioning. Here, is further refined by using data cleaning techniques. the first issue correlates with the problems of relative and absolute imbalances, while the second issue best correlates with the between-class imbalance and the problem of high dimensionality. In both Synthetic sampling is a powerful method that has shown a great deal of success in various appli- cases, the effects of imbalanced data on decision tree classification performance are detrimental. In the cations [18]. The SMOTE algorithm creates artificial data based on the feature-space similarities following sections, we evaluate the solutions proposed to overcome the effects of imbalanced data. between existing minority examples. To create a synthetic sample around a minority point in vector space, randomly select one of the K-nearest neighbors, then multiply the corresponding feature vector A battery of methods to address the class imbalance condition is available in the literature [10, 11]. difference with a random number between [0, 1], and finally, add this vector to the minority point vector. Typically, the methods are categorized into sampling methods, cost-sensitive methods, kernel methods Though it has many promising benefits, the SMOTE algorithm also has its drawbacks, including over and active learning methods. generalization and variance [19], which is largely attributed to the way in which it creates synthetic samples. Specifically, SMOTE generates the same number of synthetic data samples for each original Sampling methods The mechanics of random oversampling follow naturally from its description by minority example and does so without consideration to neighboring examples, which increases the adding a set sampled from the minority class: for a set of randomly selected minority examples, occurrence of overlapping between classes. To overcome these issues, only selected sub-samples of augment the original set by replicating the selected examples and adding them to the original set. In the minority class are subjected to synthetic sample generation. Borderline-SMOTE this way, the number of total examples is increased and the class distribution balance is adjusted [20] uses only the minority samples near the decision boundary to generate new synthetic samples. accordingly. This provides a mechanism for varying the degree of class distribution balance to any Class Imbalance Learning MWMOTE [21] identifies the hard-tolearn informative minority class samples and assigns them weights desired level. While oversampling appends data to the original dataset, random undersampling re- Class Imbalance Learning according to their euclidean distance from the nearest majority class samples. It then generates the moves data from the original dataset. At first glance, the oversampling and undersampling methods synthetic samples from the weighted informative minority class samples using a clustering approach. appear to be functionally equivalent since they both alter the size of the original dataset and can SCUT [22] over samples minority class examples through the generation of synthetic examples and employs cluster analysis in order to under sample majority classes. In addition, it handles Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 10 11 Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) both within-class and between class imbalance. To this end, various adaptive sampling methods have the underlying data distribution, and can adaptively shift the decision boundary toward difficult-to- been proposed to overcome this limitation; some representative work includes Adaptive Synthetic learn minority and majority class instances by using a hypothesis assessment procedure. Sampling (ADASYN) [23], ProWSyn [24] and R-SMOTE [25]. RUSBoost [33] is a modification to AdaBoost.M1 for solving between class imbalance problems by un- Data cleaning techniques, such as Tomek links [26], have been effectively applied to remove the dersampling from the majority class. It is documented to perform better[33] than SMOTEBoost[29] overlapping that is introduced from sampling methods. If two instances form a Tomek link then either that solves class imbalance by oversampling minority class. RUSBoost algorithm performs random one of these instances is noise or both are near a border. Therefore, one can use Tomek links to undersampling from the majority class at every AdaBoost iteration to match the population size of the “cleanup” unwanted overlapping between classes after synthetic sampling where all Tomek links are minority class, prescribed by the data sample distribution computed based on misclassification error removed until all minimally distanced nearest neighbor pairs are of the same class. By removing and exponential loss estimates. overlapping examples, one can establish well-defined class clusters in the training set, which can, in turn, lead to well-defined classification rules for improved classification performance. Some representative Cost-Free Learning (CFL) [34] is a type of learning that does not require the cost terms associated with work in this area includes the OSS method [17], the condensed nearest neighbor rule and Tomek Links the misclassifications and/or rejects as the inputs. The goal of this type of learning is to get optimal CNN+Tomek Links integration method [27], the neighborhood cleaning rule (NCL) [28] based on the classification results without using any cost information. Based on information theory, CFL seeks to edited nearest neighbor (ENN) rule—which removes examples that differ from two of its three nearest maximize normalized mutual information of the targets and the decision outputs of classifiers, which neighbors, and the integrations of SMOTE with ENN (SMOTE+ENN) and SMOTE with Tomek links could be binary/multi-class classifications with/without abstaining. Another advantage of the method is (SMOTE+Tomek) [28]. its ability to derive optimal rejection thresholds for abstaining classifications and the “equivalent” costs in binary classifications, which can be used as a reference for cost adjustments in costsensitive Cluster-based sampling algorithms are particularly interesting because they provide an added ele- learning. ment of flexibility that is not available in most simple and synthetic sampling algorithms, and accordingly can be tailored to target very specific problems. Cluster-based oversampling (CBO) algorithm An adaptive sampling with optimal cost [35] for class imbalance learning is proposed to adaptively [5] is proposed to effectively deal with the within-class imbalance problem in tandem with the be- oversample the minority positive examples and under sample the majority negative examples, forming tween-class imbalance problem. The CBO algorithm makes use of the K-means clustering technique. different sub-classifiers by different subsets of training data with the best cost ratio adaptively chosen, Empirical results of CBO are very suggestive into the nature of the imbalanced learning problem; and combining these sub-classifiers according to their accuracy to create a strong classifier. The sample namely, that targeting within-class imbalance in tandem with the between-class imbalance is an weights are computed based on the prediction probability of every sample, by a pair of induced SVM effective strategy for imbalanced datasets. classifiers built on two equal sized partitions of the training instances. The integration of sampling strategies with ensemble learning techniques has also been studied in the Weighted Extreme Learning Machines (ELM) [36, 37] is proposed as a generalized cost-sensitive community. For instance, the SMOTEBoost [29] algorithm is based on the idea of integrating SMOTE learning method to deal with imbalanced data distributions, where weights are assigned to every with Adaboost.M2. Specifically, SMOTEBoost introduces synthetic sampling at each boosting iteration. training instance based on user’s needs. Although per sample weights are possible, the authors pro- In this way, each successive classifier ensemble focuses more on the minority class. Since each classifier posed to use class proportion as the common weight of every sample from a class. They also proposed an ensemble is built on a different sampling of data, the final voted classifier is expected to have a alternate weighting scheme that uses golden ratio[9] in computing the common weights for the broadened and well-defined decision region for the minority class. Another integrated approach, the majority classes. An adaptive semi-unsupervised weighted oversampling (A-SUWO) method [38] is DataBoost-IM [30] method, combines the data generation techniques introduced in proposed for imbalanced datasets, which clusters the minority instances using a semi-unsupervised Class Imbalance Learning Class Imbalance Learning [31] with AdaBoost.M1 to achieve high predictive accuracy for the minority class without sacrificing hierarchical clustering approach and adaptively determines the size to oversample each sub-cluster accuracy on the majority class. Briefly, DataBoost-IM generates synthetic samples according to the using its classification complexity and cross validation. The minority instances are weighted based on ratio of difficult-to-learn samples between classes. RAMOBoost [32] adaptively ranks minority class their Euclidean distance to the majority class based on which they are oversampled. instances at each learning iteration according to a sampling probability distribution that is based on An inverse random undersampling [39] method is proposed for class imbalance learning, where Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 12 13 Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) several distinct training sets are constructed by severely undersampling the majority class to sizes use of more relevant data samples in each hypothesis, providing for a more robust form of classification. smaller than the minority class, to bias the learned decision boundaries towards the minority class. Another cost-sensitive boosting algorithm that follows a similar methodology is AdaCost [46]. AdaCost, like AdaC1, introduces cost sensitivity inside the exponent of the weights-update formula of Cost-Sensitive-methods While sampling methods attempt to balance distributions by considering the Adaboost. However, instead of applying the cost items directly, AdaCost uses a cost-adjustment representative proportions of class examples in the distribution, cost-sensitive learning methods function that aggressively increases the weights of costly misclassifications and conservatively de- consider the costs associated with misclassifying examples [40]. Instead of creating balanced data creases the weights of high-cost examples that are correctly classified. distributions through different sampling strategies, cost-sensitive learning targets the imbalanced learning problem by using different cost matrices that describe the costs for misclassifying any par- Though these cost-sensitive algorithms can significantly improve classification performance, they take ticular data example. Recent research indicates that there is a strong connection between costsen- for granted the availability of a cost matrix and its associated cost items. In many situations, an explicit sitive learning and learning from imbalanced data; therefore, the theoretical foundations and algo- description of misclassification costs is unknown, i.e., only an informal assertion is known, such as rithms of cost-sensitive methods can be naturally applied to imbalanced learning problems [41, 15]. misclassifications on the positive class are more expensive than the negative class [47]. With respect to cost-sensitive decision trees, cost-sensitive fitting can take three forms: first, cost-sensitive adjustments There are many different ways of implementing cost-sensitive learning, but, in general, the majority of can be applied to the decision threshold; second, cost-sensitive considerations can be given to the split techniques fall under three categories. The first class of techniques apply misclassification costs to the criteria at each node; and lastly, cost-sensitive pruning schemes can be applied to the tree. A decision dataset as a form of data space weighting; these techniques are essentially cost-sensitive bootstrap tree threshold moving scheme for imbalanced data with unknown mis-classification costs was sampling approaches where misclassification costs are used to select the best training distribution observed in [48]. When considering cost sensitivity in the split criterion, the task at hand is to fit an for induction. The second class applies cost-minimizing techniques to the combination schemes of impurity function that is insensitive to unequal costs. For instance, traditionally, accuracy is used as the ensemble methods; this class consists of various Meta techniques where standard learning algorithms impurity function for decision trees, which chooses the split with minimal error at each node. However, are integrated with ensemble methods to develop cost-sensitive classifiers. Both of these classes have this metric is sensitive to changes in sample distributions, and thus, inherently sensitive to unequal rich theoretical foundations that justify their approaches, with cost-sensitive data space weighting costs. In [49], three specific impurity functions, Gini, Entropy, and DKM, were shown to have improved methods building on the translation theorem [42], and cost-sensitive Meta techniques building on the cost insensitivity compared with the accuracy/error rate baseline. Metacost framework [43]. In fact, many of the existing research works often integrate the Metacost framework with data space weighting and adaptive boosting to achieve stronger classification results. Cost sensitivity can be introduced to neural networks in four ways [50]: first, cost-sensitive modifica- To this end, we consider both of these classes of algorithms as one in the following section. The last tions can be applied to the probabilistic estimate; second, the neural network outputs can be made class of techniques incorporates cost-sensitive functions or features directly into classification costsensitive; third, cost-sensitive modifications can be applied to the learning rate η and fourth, the paradigms to essentially “fit” the cost-sensitive framework into these classifiers. Because many of these error minimization function can be adapted to account for expected costs. A ROC based method for techniques are specific to a particular paradigm, there is no unifying framework for this class of cost- determining the cost is proposed in [51], which allows to select the most efficient cost factor for a given sensitive learning, but in many cases, solutions that work for one paradigm can often be abstracted to dataset. A neural network based cost-sensitive is proposed in [36] uses Weighted Extreme Learning work for others. Machines to solve class imbalance problems. The authors have also showed that assigning different weights for each training sample, WELM could be generalized to a cost-sensitive learning method. Motivated by the pioneering work of the AdaBoost algorithms [44], several cost-sensitive boosting methods for imbalanced learning have been proposed. Three cost-sensitive boosting methods, AdaC1, Kernel-methods SMOTE with Different Costs (SDCs) method [52] and the ensembles of over/under- Class Imbalance Learning AdaC2, and AdaC3 were proposed in [45], which introduces cost items into the weight up- dating sampled SVMs [53], [54], [55], [56] combine Kernel methods and Sampling together to solve the Class Imbalance Learning strategy of AdaBoost. In essence, these algorithms increase the probability of sampling a costly imbalance problem. SDC algorithm uses different error costs [52] for different classes to bias the SVM example at each iteration, giving the classifier more instances of costly examples for a more targeted in order to shift the decision boundary away from positive instances and make positive instances more approach of induction. In general, it was observed that the inclusion of cost factors into the weighting densely distributed in an attempt to guarantee a more well-defined boundary. Meanwhile, scheme of Adaboost imposes a bias towards the minority concepts and also increases the Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 14 15 Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) the methods proposed in [54], [55] develop ensemble systems by modifying the data distributions of labeled samples to neighbors through their edges in order to get the predicted labels of unlabeled without modifying the underlying SVM classifier. samples. Most popular semi-supervised learning approaches are sensitive to initial label distribution happened in imbalanced labeled datasets. The class boundary will be severely skewed by the majority The Granular Support Vector Machines—Repetitive Undersampling algorithm (GSVM-RU) was pro- classes in an imbalanced classification. In [63], the authors propose a simple and effective approach to posed in [57] to integrate SVM learning with undersampling methods. The major characteristics of alleviate the unfavorable influence of imbalance problem by iteratively selecting a few unlabeled GSVMs are two-fold. First, GSVMs can effectively analyze the inherent data distribution by observing the samples and adding them into the minority classes to form a balanced labeled dataset for the learning trade-offs between the local significance of a subset of data and its global correlation. Second, GSVMs methods afterwards. The experiments on UCI datasets [64] and MNIST handwritten digits dataset [65] improve the computational efficiency of SVMs through use of parallel computation. In the context of showed that the proposed approach outperforms other existing state of the art methods. imbalanced learning, the GSVMRU method takes advantage of the GSVM by using an iterative learning procedure that uses the SVM itself for undersampling. A SSL method in [66] uses a tranductive learning approach to build upon a graph-based phase field model [67] that handles imbalanced class distributions. This method is able to encourage or penalize A kernel-boundary-alignment algorithm is proposed in [58], which considers training-data imbal- the memberships of data to different classes according to an explicit a priori model that avoids biased ance as prior information to augment SVMs to improve class-prediction accuracy. Using a simple ex- classifications. Experiments, conducted on real-world benchmarks, support the better performance of ample, we first show that SVMs can suffer from high incidences of false negatives when the training the model compared to several state of the art semi-supervised learning algorithms. instances of the target class are heavily outnumbered by the training instances of a non-target class. The remedy the authors propose is to adjust the class boundary by modifying the kernel matrix, according The problem of predicting splice sites in a genome using semi-supervised learning approach [68] is a to the imbalanced data distribution. challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of nonsplice sites. To address this challenge, the authors propose to use One example of kernel modification is the kernel classifier construction algorithm proposed in [59] ensembles of semi-supervised classifiers, specifically self-training and cotraining classifiers. The based on orthogonal forward selection (OFS) and the regularized orthogonal weighted least squares experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, (ROWLSs) estimator. This algorithm optimizes generalization in the kernel-based learning model by showed that the ensemble-based semi-supervised approaches represent a good choice, even when introducing two major components that deal with imbalanced data distributions for two-class the amount of labeled data consists of less than 1% of all training data. In particular, it was found that datasets. The first component integrates the concepts of leave-one-out (LOO) cross validation and the ensembles of co-training and self-training classifiers that dynamically balance the set of labeled area under curve (AUC) evaluation metric to develop an LOOAUC objective function as a selection instances during the semi-supervised iterations show improvements over the corresponding supervised mechanism of the most optimal kernel model. The second component takes advantage of the cost ensemble baselines. sensitivity of the parameter estimation cost function in the ROWLS algorithm to assign greater weights to erroneous data examples in the minority class than those in the majority class. A semi-supervised learning task from both labeled and unlabeled instances and in particular, self- training with decision tree learners as base learners is proposed in [69]. The authors show that SSL-methods A semi-supervised classification method for prognosis of ACLF was proposed in 2015 standard decision tree algorithm as the base learner cannot be effective in a self-training algorithm to [60], where the authors constructed an imbalanced prediction model based on small sphere and large semi-supervised learning. The main reason is that the basic decision tree learner does not produce margin approach (SSLM) [61], which classifies two classes (improved patients, death patients) of reliable probability estimation to its predictions. Therefore, it cannot be a proper selection criterion in samples by maximizing their margin. SSLM was shown to perform better than OneClass SVM and self-training. They considered the effect of several modifications to the basic decision tree learner that Class Imbalance Learning Support Vector Data Description (SVDD) methods. The authors also experimented with semisuper- produce better probability estimation than using the distributions at the leaves of the tree. They show Class Imbalance Learning vised Twin SVM [62] by adding unlabeled patients into the dataset. that these modifications do not produce better performance when used only the labeled data, but they do benefit more from the unlabeled data in self-training. The modifications that they considered are Transductive graph-based semi-supervised learning methods usually build an undirected graph utiliz- Naive Bayes Tree, a combination of No-pruning and Laplace correction, grafting, and ing both labeled and unlabeled samples as vertices. Those methods propagate the label information Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 16 17 Sudarsun Santhiappan and Dr. Ravindran Balaraman (I.I.T, Madras) using a distance-based measure. Then they extended this improvement to algorithms for ensembles of decision trees and the authors show that the ensemble learner gives an extra improvement over the Understanding the Fundamental Problems adapted decision tree learners. Majority of the imbalanced learning research works focus on improving the performance of specific algorithms paired with specific datasets, with only a very limited theoretical understanding on the In [70], the authors describe the stochastic semi-supervised learning approach that was used in their principles of the problem space and the consequences of various assumptions made. For instances, submission to all six tasks in 20092010 Active Learning Challenge. The method was designed to several algorithms for imbalance learning published in the literature claim to have taken the perfor- tackle the binary classification problem under the condition that the number of labeled data points is mance metric better by a margin as compared to the previous solutions on specific case by case extremely small and the two classes are highly imbalanced. It starts with only one positive seed given by basis in terms of datasets attempted. But there exists situations where learning from the original the contest organizer. They randomly picked additional unlabeled data points and treated them as datasets may provide better performance. This leads us to an important question: to what extent do “negative” seeds based on the fact that the positive label is rare across all datasets. A classifier was imbalanced learning methods help with learning? This question could be further refined as: trained using the “labeled” data points and then was used to predict the unlabeled dataset. They took the final result to be the average of “n” stochastic iterations. Supervised learning was used as a large number 1. When a method outperformed other methods, what are the underlying effects that led to the better of labels were purchased. Their approach was shown to work well in 5 out of 6 datasets, which ranked performance? them 3rd in the contest. 2. Does the solution provide clarity on the fundamental understanding of the problem at hand? 3. Can the solution scale to various other types of data? A framework to address the imbalanced data problem using semi-supervised learning is proposed in [71]. Specifically, from a supervised problem, they created a semi-supervised problem and then use a We believe that these fundamental questions should be studied with greater interest both theoreti- cally semi-supervised learning method to identify the most relevant instances to establish a well-defined and empirically in order to thoroughly understand the essence of imbalanced learning problems and training set. They presented extensive experimental results, which demonstrate that the proposed solutions. Furthermore, we should also find answers to the following specific questions that would framework significantly outperforms all other sampling algorithms in 67% of the cases across three allow us to gauge the solutions better: different classifiers and ranks second best for the remaining 33% of the cases. – What are the assumptions to make on imbalanced learning algorithms to work better compared to A combined co-training and random subspace generation technique is proposed in [72] for sen- learning from the original distributions? timent classification problems. There are two main advantages of this dynamic strategy over the – To what degree should one artificially balance[73, 74] the original dataset by adjusting sample static strategy in generating random subspaces. First, the dynamic strategy makes the involved distributions? subspace classifiers quite different from each other even when the training data becomes similar after – How do imbalanced data distributions affect the computational complexity of learning algorithms? some iterations. Second, considering that the most helpful features (e.g., sentimental words) for – What is the general error bound, given an imbalanced data distribution? sentiment classification usually account for a small portion, it is possible that one random subspace – Is there a general theoretical methodology that can alleviate the impediment of learning from im- might contain few useful features. When this happens in the static strategy, the corresponding sub- balanced datasets for specific algorithms and application domains? space classifier will perform badly in selecting correct samples from the unlabeled data. This makes semi-supervised learning fail. In comparison, the dynamic strategy can avoid this phenomenon. Estabrooks et al. [73] suggested that a combination of different expressions of resampling methods may be an effective solution to the tuning problem. Weiss and Provost [74] have analyzed, for a fixed Challenges in CIL training set size, the relationship between the class distribution of training data (expressed as the Class Imbalance Learning Class Imbalance Learning The availability of humongous supply of raw data in many of today’s real-world applications opens up percentage of minority class examples) and classifier performance in terms of accuracy and AUC. several opportunities of learning from imbalanced data to play an important role across different Based on a thorough analysis of 26 datasets, it was suggested that if accuracy is selected as the application domains. However, there are several new challenges [9] at the same time. Here, we briefly performance criterion, the best class distribution tends to be near the naturally occurring class present several aspects for the future research directions in this critical research domain. distribution. However, if the AUC is selected as the assessment metric, then the best class distribu- Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 18 19 Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras tion tends to be near the balanced class distribution. Based on these observations, a “budget-sen- each technique provides its own set of answers to different fundamental questions. But also, because an sitive ”progressive sampling strategy was proposed to efficiently sample the minority and majority analysis in the evaluation space of one technique can be correlated to the evaluation space of another. class examples such that the resulting training class distribution can provide the best performance. This standard would lead to increased transitivity and a broader understanding of the functional abilities of existing and future works. Uniform Benchmark Platform Class Imbalance learning researchers typically use standard multi-class datasets in one-versus-rest SSL from Imbalanced Data configuration to emulate binary class imbalance problems to report their results. Although there are The key idea of semi-supervised learning is to exploit the unlabeled examples by using the labeled currently many publicly available benchmarks for assessing the performance of classification algo- examples to modify, refine, or reprioritize the hypothesis obtained from the labeled data alone [75]. rithms, such as the UCI Repository [64] and the LIBSVM datasets[10], there are a very limited number Some pertinent questions include: of benchmarks that are solely dedicated to imbalanced learning problems. None of the existing classification data repositories identify or mention imbalance ratio as a dataset characteristics. This 1. How can we identify whether an unlabeled data example came from a balanced or imbalanced limitation can severely affect the long-term development of research in class imbalance learning in the underlying distribution? following pointers: 2. Given an imbalanced training data with labels, what are the effective and efficient methods for recovering the unlabeled data examples? 1. Lack of a uniform benchmark for standardized performance assessments. 3. What kind of biases may be introduced in the recovery process given imbalanced labeled data? 2. Lack of data sharing and data interoperability across different disciplinary domains. 3. Increased procurement costs. Standardized Evaluation Accuracy is a measure of trueness, given by (TP+TN)/N (computed using confusion matrix11). Con- sider a classifier that returns the label of the majority class for any input. When 1001 random test points (with 1:1000 imbalance ratio) are tried, the estimated accuracy is (1000+0)/1001=99%, which in turn provides a wrong interpretation that the classifier performance is high. Precision is the measure of exactness, given by TP/(TP+FP) and recall is a measure of completeness, given by TP/(TP+FN). It is apparent from the formulas that precision is sensitive to data distributions, while recall is not. An assertion based solely on recall is ambiguous, since recall provides no insight to how many examples are incorrectly labeled as positive (minority). Similarly, precision cannot assert how many positive examples are labeled incorrectly. F-score combines precision and recall effectively to evaluate classification performance in imbalanced learning scenarios. The traditional use of a point based evaluation metric such as ac-curacy, precision, and recall are not sufficient, while handling class imbalance learning problems as they show sensitivity to data Class Imbalance Learning distribution. It becomes very difficult to provide any concrete relative evaluations between different Class Imbalance Learning algorithms over varying data distributions without an accompanied curve based analysis. Therefore, it is necessary for the community to establish-as a standard-the practice of using the curve-based evaluation techniques such as ROC, Precision- Recall curve, and Cost curve. This is not only because Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 20 21 Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras Summary References [1] M. Kubat, R. C. Holte, and S. Matwin, “Machine Learning for the Detection of Oil Spills in Satellite Radar Im- We have motivated the need for Class Imbalance Learning, a special usecase of machine learning, for its ages.,” Machine Learning, vol. 30, no. 2-3, pp. 195–215, 1998. wide applicability in real world classification tasks. We did so by introducing the fundamentals of class imbalance learning and the battery of solutions available in the literature to combat it. We have also [2] G. M. Weiss,“Mining with Rare Cases.,”in The Data Mining and Knowledge Discovery Handbook (O. Maimon and L. Rokach, eds.), pp. 765–776, Springer, 2005. presented some real world problem scenarios with imbalance characteristics. We concluded by enlisting the available opportunities and challenges in the field of class imbalance learning research. [3] G. M. Weiss,“Mining with rarity: a unifying framework.,”SIGKDD Explorations, vol. 6, no. 1, pp. 7–19, 2004. [4] R. Prati, G. Batista, and M. Monard, “Class imbalances versus class overlapping: an analysis of a learning sys- Acknowledgements tem behavior,” MICAI 2004: Advances in Artificial Intelligence, pp. 312–321, 2004. This research work was partly supported by a funding grant from IIT Madras under project [5] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts.,” SIGKDD Explorations, vol. 6, no. 1, pp. 40– CSE/1415/831/RFTP/BRAV. 49, 2004. [6] R. C. Prati, G. E. A. P. A. Batista, and M. C. Monard, “Learning with Class Skews and Small Disjuncts.,” in SBIA (A. L. C. Bazzan and S. Labidi, eds.), vol. 3171 of Lecture Notes in Computer Science, pp. 296–306, Springer, 2004. [7] J. R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986. [8] W.-H. Yang, D.-Q. Dai, and H. Yan, “Feature Extraction and Uncorrelated Discriminant Analysis for HighDi- mensional Data.,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 5, pp. 601–614, 2008. [9] H. He and E. Garcia,“Learning from Imbalanced Data,”Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 1263–1284, Sept 2009. [10] H. He and Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, 1st ed., 2013. [11] P. Branco, L. Torgo, and R. P. Ribeiro, “A Survey of Predictive Modeling on Imbalanced Domains.,” ACM Com- put. Surv., vol. 49, no. 2, pp. 31:1–31:50, 2016. [12] D. Mease, A. Wyner, and a. Buja,“Boosted classification trees and class probability/quantile estimation,”The Journal of Machine Learning Research, vol. 8, pp. 409–439, 2007. [13] C. Drummond and R. Holte, “C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling,” Workshop on Learning from Imbalanced Datasets II, pp. 1–8, 2003. [14] R. C. Holte, L. Acker, and B. W. Porter, “Concept Learning and the Problem of Small Disjuncts.,” in IJCAI (N. S. Sridharan, ed.), pp. 813–818, Morgan Kaufmann, 1989. [15] X.-Y. Liu and Z.-H. Zhou,“The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study.,” in ICDM, pp. 970–974, IEEE Computer Society, 2006. Class Imbalance Learning [16] J. Zhang and I. Mani,“KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction,” in Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets, 2003. Class Imbalance Learning [17] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection,” in In Pro- ceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186, Morgan Kaufmann, 1997. Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 22 23 Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras [32] S. Chen, H. He, and E. A. Garcia, “RAMOBoost: Ranked Minority Oversampling in Boosting.,” IEEE Trans [18] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer,“SMOTE: Synthetic Minority Over-sampling Technique,” Neural Netw, vol. 21, pp. 1624–42, Oct. 2010. Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. [33] C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, and A. Napolitano,“RUSBoost: A Hybrid Approach to Alleviating [19] B. X. Wang and N. Japkowicz, “Imbalanced Data Set Learning with Synthetic Samples,” 2004. Class Imbalance.,” IEEE Trans. Systems, Man, and Cybernetics, Part A, vol. 40, no. 1, pp. 185–197, 2010. [20] H. Han, W. Wang, and B. Mao,“Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets [34] X. Zhang and B.-G. Hu, “A New Strategy of Cost-Free Learning in the Class Imbalance Problem.,” IEEE Trans. Learning.,” in ICIC (1) (D.-S. Huang, X.-P. Zhang, and G.-B. Huang, eds.), vol. 3644 of Lecture Notes in Computer Knowl. Data Eng., vol. 26, no. 12, pp. 2872–2885, 2014. Science, pp. 878–887, Springer, 2005. [35] Y. Peng, “Adaptive Sampling with Optimal Cost for Class-Imbalance Learning.,” in AAAI (B. Bonet and S. [21] S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE-Majority Weighted Minority Oversampling Tech- nique Koenig, eds.), pp. 2921–2927, AAAI Press, 2015. for Imbalanced Data Set Learning.,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, 2014. [36] W. Zong, G.-B. Huang, and Y. Chen, “Weighted extreme learning machine for imbalance learning.,” Neuro- [22] A. Agrawal, H. L. Viktor, and E. Paquet, “SCUT: Multi-Class Imbalanced Data Classification using SMOTE and computing, vol. 101, pp. 229–242, 2013. Cluster-based Undersampling.,” in KDIR (A. L. N. Fred, J. L. G. Dietz, D. Aveiro, K. Liu, and J. Filipe, eds.), pp. 226–234, SciTePress, 2015. [37] X. Gao, Z. Chen, S. Tang, Y. Zhang, and J. Li, “Adaptive weighted imbalance learning with application to abnormal activity recognition.,” Neurocomputing, vol. 173, pp. 1927–1935, 2016. [23] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning.,” in IJCNN, pp. 1322–1328, IEEE, 2008. [38] I. Nekooeimehr and S. K. Lai-Yuen, “Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets.,” Expert Syst. Appl., vol. 46, pp. 405–416, 2016. [24] S. Barua, M. M. Islam, and K. Murase, “ProWSyn: Proximity Weighted Synthetic Oversampling Technique for Imbalanced Data Set Learning.,”in PAKDD (2) (J. Pei, V. S. Tseng, L. Cao, H. Motoda, and G. Xu, eds.), vol. 7819 of [39] M. A. Tahir, J. Kittler, and F. Yan, “Inverse random under sampling for class imbalance problem and its appli- Lecture Notes in Computer Science, pp. 317–328, Springer, 2013. cation to multi-label classification.,” Pattern Recognition, vol. 45, no. 10, pp. 3738–3750, 2012. [25] Y. Dong and X. Wang, “A New Over-Sampling Approach: Random-SMOTE for Learning from Imbalanced [40] C. Elkan, “The Foundations of Cost-Sensitive Learning,” in IJCAI, pp. 973–978, 2001. Data Sets.,” in KSEM (H. Xiong and W. B. Lee, eds.), vol. 7091 of Lecture Notes in Computer Science, pp. 343–352, Springer, 2011. [41] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Editorial: special issue on learning from imbalanced data sets.,” SIGKDD Explorations, vol. 6, no. 1, pp. 1–6, 2004. [26] I. Tomek, “Two Modifications of CNN,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 7(2), pp. 679– 772, 1976. [42] B. Zadrozny, J. Langford, and N. Abe, “Cost-Sensitive Learning by Cost-Proportionate Example Weight- ing.,” in ICDM, pp. 435–, IEEE Computer Society, 2003. [27] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Bal- ancing Machine Learning Training Data,”ACM SIGKDD Explorations Newsletter – Special issue on learning from [43] P. M. Domingos,“MetaCost: A General Method for Making Classifiers Cost-Sensitive.,”in KDD (U. M. Fayyad, imbalanced datasets, vol. 6, no. 1, pp. 20–29, 2004. S. Chaudhuri, and D. Madigan, eds.), pp. 155–164, ACM, 1999. [28] J. Laurikkala, “Improving Identification of Difficult Small Classes by Balancing Class Distribution.,” in AIME (S. [44] Y. Freund and R. E. Schapire,“Experiments with a New Boosting Algorithm,”in International Conference on Quaglini, P. Barahona, and S. Andreassen, eds.), vol. 2101 of Lecture Notes in Computer Science, pp. 63–66, Springer, Machine Learning, pp. 148–156, 1996. 2001. [45] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. W. 0007,“Cost-sensitive boosting for classification of imbalanced [29] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “SMOTEBoost: Improving Prediction of the Minority Class data.,” Pattern Recognition, vol. 40, no. 12, pp. 3358–3378, 2007. in Boosting.,” in PKDD (N. Lavrac, D. Gamberger, H. Blockeel, and L. Todorovski, eds.), vol. 2838 of Lecture Notes in Computer Science, pp. 107–119, Springer, 2003. [46] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan,“AdaCost: Misclassification Cost-Sensitive Boosting.,”in ICML (I. Bratko and S. Dzeroski, eds.), pp. 97–105, Morgan Kaufmann, 1999. [30] H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: the Data- Boost-IM approach.,” SIGKDD Explorations, vol. 6, no. 1, pp. 30–39, 2004. [47] M. a. Maloof,“Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown,”Anal- Class Imbalance Learning Class Imbalance Learning ysis, vol. 21, pp. 1263–1284, 2003. [31] H. Guo and H. L. Viktor, “Boosting with Data Generation: Improving the Classification of Hard to Learn Ex- amples.,” in IEA/AIE (R. Orchard, C. Yang, and M. Ali, eds.), vol. 3029 of Lecture Notes in Computer Science, pp. 1082– [48] K. M. Ting,“An Instance-Weighting Method to Induce Cost-Sensitive Trees.,”IEEE Trans. Knowl. Data Eng., vol. 1091, Springer, 2004. 14, no. 3, pp. 659–665, 2002. Sudarsun Santhiappan and Balaraman Ravindran 24 25 Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras [49] C. Drummond and R. C. Holte, “Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria.,” in ICML (P. Langley, ed.), pp. 239–246, Morgan Kaufmann, 2000. [64] M. Lichman, “UCI Machine Learning Repository,” 2013. [50] M. Kukar and I. Kononenko, “Cost-Sensitive Learning with Neural Networks.,” in ECAI, pp. 445–449, 1998. [65] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [51] B. Krawczyk and M. Wozniak,“Cost-Sensitive Neural Network with ROC-Based Moving Threshold for Imbal- [66] A. E. Ghoul and H. Sahbi, “Semi-supervised learning using a graph-based phase field model for imbalanced data anced Classification.,” in IDEAL (K. Jackowski, R. Burduk, K. Walkowiak, M. Wozniak, and H. Yin, eds.), vol. 9375 of set classification.,” in ICASSP, pp. 2942–2946, IEEE, 2014. Lecture Notes in Computer Science, pp. 45–52, Springer, 2015. [67] M. Rochery, I. Jermyn, and J. Zerubia, “Phase Field Models and Higher-Order Active Contours.,” in ICCV, pp. 970– [52] R. Akbani, S. Kwek, and N. Japkowicz, “Applying Support Vector Machines to Imbalanced Datasets.,” in 976, IEEE Computer Society, 2005. ECML (J.-F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi, eds.), vol. 3201 of Lecture Notes in Computer Science, pp. 39–50, Springer, 2004. [68] A. Stanescu and D. Caragea, “Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.,” in BIBM (H. J. Zheng, W. Dubitzky, X. Hu, J.-K. Hao, D. P. Berrar, K.-H. Cho, Y. Wang, and D. R. Gil- bert, eds.), [53] F. Vilarinõ, P. Spyridonos, J. Vitrià, and P. Radeva,“Experiments with SVM and Stratified Sampling with an pp. 432–437, IEEE Computer Society, 2014. Imbalanced Problem: Detection of Intestinal Contractions.,” in ICAPR (2) (S. Singh, M. Singh, C. Apté, and P. Perner, eds.), vol. 3687 of Lecture Notes in Computer Science, pp. 783–791, Springer, 2005. [69] J. Tanha, M. van Someren, and H. Afsarmanesh, “Semi-supervised self-training for decision tree classifiers,” International Journal of Machine Learning and Cybernetics, 2015. [54] P. Kang and S. Cho, “EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems.,” in ICONIP (1) (I. King, J. Wang, L. Chan, and D. L. Wang, eds.), vol. 4232 of Lecture Notes in Computer Science, pp. 837– [70] J. Xie and T. Xiong, “Stochastic Semi-supervised Learning.,” in Active Learning and Experimental Design @ 846, Springer, 2006. AISTATS (I. Guyon, G. C. Cawley, G. Dror, V. Lemaire, and A. R. Statnikov, eds.), vol. 16 of JMLR Proceedings, pp. 85– 98, JMLR.org, 2011. [55] Y. Liu, A. An, and X. Huang,“Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles.,” in PAKDD (W. K. Ng, M. Kitsuregawa, J. Li, and K. Chang, eds.), vol. 3918 of Lecture Notes in Computer Science, pp. 107– [71] B. A. Almogahed and I. A. Kakadiaris, “Empowering Imbalanced Data in Supervised Learning: A Semisuper- 118, Springer, 2006. vised Learning Approach.,” in ICANN (S. Wermter, C. Weber, W. Duch, T. Honkela, P. D. KoprinkovaHristova, S. Magg, G. Palm, and A. E. P. Villa, eds.), vol. 8681 of Lecture Notes in Computer Science, pp. 523–530, Springer, 2014. [56] B. X. Wang and N. Japkowicz, “Boosting support vector machines for imbalanced data sets.,” Knowl. Inf. Syst., vol. 25, no. 1, pp. 1–20, 2010. [72] S. Li, Z. Wang, G. Zhou, and S. Y. M. Lee, “Semi-Supervised Learning for Imbalanced Sentiment Classifica- tion.,” in IJCAI (T. Walsh, ed.), pp. 1826–1831, IJCAI/AAAI, 2011. [57] Y. Tang and Y.-Q. Zhang, “Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Ho- mology Prediction.,” in GrC, pp. 457–460, IEEE, 2006. [73] A. Estabrooks, T. Jo, and N. Japkowicz,“A Multiple Resampling Method for Learning from Imbalanced Data Sets.,” Computational Intelligence, vol. 20, no. 1, pp. 18–36, 2004. [58] G. Wu and E. Y. Chang,“Aligning Boundary in Kernel Space for Learning Imbalanced Dataset,”in Proceed- ings of the Fourth IEEE International Conference on Data Mining, ICDM ’04, (Washington, DC, USA), pp. 265– 272, IEEE [74] F. J. Provost and G. M. Weiss,“Learning When Training Data are Costly: The Effect of Class Distribution on Tree Computer Society, 2004. Induction,” CoRR, vol. abs/1106.4557, pp. 315–354, 2011. [59] X. Hong, S. C. 0001, and C. J. Harris,“A Kernel-Based Two-Class Classifier for Imbalanced Data Sets.,”IEEE [75] X. Zhu, “Semi–Supervised Learning in Literature Survey,” Tech. Rep. 1530, Computer Sciences, University of Transactions on Neural Networks, vol. 18, no. 1, pp. 28–41, 2007. Wisconsin-Madison, 2005. [60] Y. Xu, Y. Zhang, Z. Yang, X. Pan, and G. Li, “Imbalanced and semi-supervised classification for prognosis of ACLF.,” Journal of Intelligent and Fuzzy Systems, vol. 28, no. 2, pp. 737–745, 2015. [61] M. Wu and J. Ye, “A Small Sphere and Large Margin Approach for Novelty Detection Using Training Data with Outliers.,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 2088–2092, 2009. Class Imbalance Learning [62] Jayadeva, R. Khemchandani, and S. Chandra, “Twin Support Vector Machines for Pattern Classification.,” IEEE Class Imbalance Learning Trans. Pattern Anal. Mach. Intell., vol. 29, no. 5, pp. 905–910, 2007. [63] F. Li, C. Yu, N. Yang, F. Xia, G. Li, and F. Kaveh-Yazdy, “Iterative Nearest Neighborhood Oversampling in Semisupervised Learning from Imbalanced Data,” The Scientific World Journal, Volume 2013, Article ID 875450, 2013, Dec. 2013. Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras 26 27 Sudarsun Santhiappan and Dr. Ravindran Balaraman I.I.T, Madras

BUDDI Health - Class Imbalance Based Deep Learning Platform

Comments

Description