# Guide to choose an Algorithm -For Beginners in Data Science

Are you a newbie to Data Science field? Are you confused to choose a Machine learning algorithms for a typical business problem??

Well, data science provides wide variety of solutions to typical business problems

Let me generalize it for you.

When a data scientist especially a fresher comes into the industry he/she will be confused on what kind of algorithms to choose for what kind of solutions today? Let me brief you about what are the different types of business problems that are occurring and how to choose an algorithm based on that problem statement

Let’s see how to choose an algorithm for a typical problem. We have several business problems in the world. Based on the types of output, we are choosing an algorithm.

If we have a problem definition where we want to calculate how much or how many? For example, if we want to calculate what is the salary for a new candidate who is joining, what is the price of a house? What is the premium amount for insurance? then these kind of problems we will be going with the **regression**.

Now, let’s have look over what is regression? Regression is to estimate or determine the strength of relationship between two variables. Say X or Y. And a series of other changing variables. Regression is a technique used to model and analyse the relationship between variables and often how they contribute or how they are related in introducing a similar or particular outcome, a regression problem is choosing when the output variable is real or continuous, like we say, a salary or weight. Many different models can be used.

The simplest is linear regression. It tries to fit the data with the best hyperplane, which goes through the points. There are different types of regressions.

The first one and the foremost and the most basic one is **linear regression.** Similarly, further we go, it will be **logistic regression**, **polynomial regression, stepwise regression**. **Ridge** and **Lasso** regressions are different types.

The next type of problem is if you want to see the new observation belongs to which **category**, then we will be going with a problem type called classification.

As the name suggests, **classification** is the task of classifying the things into subcategories. But by the machine, if it doesn’t sound like much right, imagine your computer being able to differentiate between you and a stranger who is operating your computer. Sounds interesting, right? In machine learning and statistics, classification is the problem of identifying, to which of a set of categories a new observation belongs to on the basis of training set of data containing the observations and which or whose categories the membership is known.

Now, classification is the process of finding or discovering a model or function which helps in separating the data into multiple categories. So similar like we have in examinations, the grades and categories. If marks are above 35 it’s pass below 35 its fail that, so we’re making the threshold here as 35.

The classifications are of two types. The binary classification when we have to categorize given dataset into two distinct classes.

For example, on the basis of given health condition, we see if person is fit or not. Next is, multiclass classification, where the number of classes are more than two. For example, if it’s a fruit, we are able to classify fruits, there can be any number of fruits. If we want to choose between four to five fruits for a given data, then it will be multiclass classification. So it’s similar based on type of output.

If you want to know if it’s A, B or C, yes or no, reject or select and so on, then we prefer the classification algorithms. There are various types of classifiers or we call it as classification algorithms. The first and the basic one is linear classifier, where the best example is logistic regression. Next is tree-based classifiers here decision trees are the simplest example. Next is Support Vector Machines, Artificial Neural Networks, Bayesian Regression, Naive Bayes Classifier.

And we have **Ensemble Methods/techniques** like random forests, adaboost, bagging and classifiers or bagging classifiers.

Example, spamming mail. Now we will be getting some mails and that is divided further into spam or non-spam, this is the best example of classification. Detecting a person’s cancer that also belongs to a classification category.

Now, if the problem type is we want to see to which group of particular observation belongs. Then we go for **clustering.**

It is basically a type of unsupervised learning method. Unsupervised learning method is the method in which we draw references from datasets containing of input data without labelled response. Clustering is the task of dividing the population or data points into number of groups such that data points in the same group are more similar to other data points in same group, and dissimilar to the data points in other groups.

It is basically a collection of objects on the basis of similarity and dissimilarity between them. Clustering is very much important as it determines the intrinsic grouping among the unlabeled data present. There are no criteria for good clustering. It depends on the user, what is the criteria they may use which satisfies their needs. Now there are different types of clustering methods.

Let us see one by one.

The first one is **density-based **method, these methods consider the cluster as the dense region having more, having same similarity and difference from the lower dense region of the space. These methods have good accuracy and ability to do merge two clusters. Example DB scan that explains hierarchical based methods that cluster formed in this method. Forms a tree types structure based on hierarchy. New clusters are formed using the previously formed one it is divided into two categories. Now here the clustering methods are divided into two categories.

The first one is **agglomerative.** That is bottom up approach. And second one is **Divisive.** That is top down approach. Partitioning methods these methods partition objects into **K-clusters** and each partition forms one cluster. This method is used to optimize an object created in similarity functions such as when the distance is major parameter example K-means. Well, the last one is **grid based** in this method the data space are formulated into a finite number of cells that forms the grid like structure.

All the clustering operations done on these grids are fast and independent of the number of data objects, example STING, that is statistical information grid.

Now, let’s see another kind of problem, that is if you want to see if it is weird or is this weird? We go with **anomaly detection**. So anomaly detection is the technique of identifying great events or observations which can raise suspicions by being statistically different from the rest of the observations. Such anomalous behaviour typically translates to some kind of problem, like a credit card fraud filling machine in a server, a cyber attack, etc.

Anomaly can be broadly categorized into three types. First one is **point anomaly**, a tuple in data set is said to be a point anomaly if it is far off from the rest of the data. The second one is **contextual anomaly** and observation in the contextual anomaly. If it is an anomaly because of the context of the observation, the last one is a **collective anomaly** where a set of data increases help in finding an anomaly. Anomaly detection can be done using the concepts of machine learning.

It can be done in the following ways. That is **supervised anomaly detection**. This method requires a labelled, the data-set containing both normal and anomalous samples to construct a predictive model to classify future data points. The most commonly used algorithm for this purpose are supervised neural networks, support vector machine learning, K-nearest neighbors classifiers, etc. The last one is **unsupervised anomaly detection**. This method does not require any training and instead assumes two things about the data.

That is, only a small percentage of data is anomalous, and any anomaly is statistically different from normal samples based on the above assumption that data is then clustered using similarity measure and the data points which are far off from the cluster are considered to be anomalies.

So that’s a brief introduction about how to choose an algorithm based on the problem type.