As the quote says, “Every expert was once a beginner”, every beginner starts his journey into Data Science domain with Machine Learning. Data Scientists must go through these algorithms; these are the fundamentals of Data Science. In this article, we discuss few of the most popular Machine Learning algorithms.
Machine learning is a subset of Artificial Intelligence, which is further subset of Data Science. It provides system the ability to learn automatically and improve without being explicitly programmed. It focuses on development of machine learning models that can use the data and then learn themselves. The process of learning begins with observation of data, it finds patterns and draws conclusion based on the observations. Machine learning allows the system to learn automatically itself without hard coding. Machine Learning is the stepping-stone in Data Science.
Types of Machine Learning Algorithms
There are three types of Machine Learning Algorithms:
- Supervised Machine Learning: In Supervised Machine Learning, the data is labelled. It contains a target variable (dependent variable), which is predicted using the set of predictors (independent variable). We have a function that maps inputs the desired outputs.
- Unsupervised Machine Learning: These are the models, which have only the input variables. There is no target variable or output variable in this case.
- Reinforcement Leaning: It is a type of machine learning algorithm that allows an agent to decide the best next action based on its current state by learning behaviors that will maximize a reward. It rewards for the correct decision and penalties for the wrong one.
Here are few Supervised Machine Learning algorithms:
- Linear Regression:
It is one of the most popular and most widely used Machine Learning algorithm. Regression means relationship and linear means straight. Linear regression is a supervised machine-learning algorithm that fits a straight line through a set of given points. Here we establish a relationship between the independent and dependent variables by fitting a straight line. The best line is called as regression line. We can relate this to our surrounding elements. E.g.: more or less the taller the person more is the weight, as the size of land increases the cost also increases. These are the simplest examples without considering most of the other factors affecting them.
In linear regression, the relation between the dependent variable(x) and the independent variable (y) can be expressed with the help of an equation. The equation is represented as:
y = m*x + c
In the above equation:
- y – dependent variable
- x – independent variable
- m – slope of the line
- c – Intercept.
These coefficients ‘m’ and ‘c’ are derived based on minimizing the sum of squared difference of distance between data points and regression line.
- The cost of a house depends on various features like size of land, age of building, location etc.
- Logistic Regression:
We have seen that linear regression predicts or estimates the continuous values but logistic regression predicts the discrete values. It is basically used for binary classification problems e.g. yes or no, true or false, pass or fail etc. Here there are only two possibilities: that is it occurs (1) or it does not occur (0). It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. It has many input variables (independent variables) and one output variable (dependent variable).
Logistic regression is named after the function that is used in predicting the output. It is the logistic function or the sigmoid function, which is given by the equation:
In the above equation:
- z – output of function
- y – the independent variable which is given by the equation in linear regression.
- The outcome of person being selected in an interview. Here the various independent feature are highest qualification, marks in aptitude tests, performance in interview etc. and the output variable is selected or not selected.
- Decision Trees:
Decision tree is a type of classification algorithm that is widely used for classification problems. It is used for categorical and continuous data as well. The algorithm makes use of tree like model to predict the output. It splits the features based on the question that is framed on the feature. The answer is always a true or false to the question framed. In this way, the tree is split from root nodes to the leaf nodes. The initial splits are made on the most significant features based on Gini impurity, Entropy and Information gain values. The decision tree is the one having the least number of yes/no questions to assess the probability of making a right decision.
- Consider the outcome of playing cricket based on the conditions of the day like: Weather, Humidity and Wind. Based on the questions framed on the features the splits are done on the problem.