Logistic Regression for Machine Learning
Simple explanation of the logistic regression algorithm, where to use it, & how it differs from linear regression
Whether you are new to machine learning or not, it is likely you’ve heard of logistic regression as it is used in many fields, including in machine learning. But what do machine learning practitioners and data scientists need to understand about this model? Let me give a simple introduction to what logistic regression is, including:
- How logistic regression differs from linear regression
- The three types of logistic regression
- Training data requirements for logistic regression
- Basic mathematics behind logistic regression
(the) Field of study that gives computers the ability to learn without being explicitly programmed - Arthur Samuel in Some Studies in Machine Learning Using the Game of Checkers
Simple introduction to logistic regression
Before we dive into understanding logistic regression, let us start with some basics about the different types of machine learning algorithms.
What are the differences between supervised learning, unsupervised learning & reinforcement learning?
Machine learning algorithms are broadly classified into three categories - supervised learning, unsupervised learning, and reinforcement learning.
- Supervised Learning - Learning where data is labeled and the motivation is to classify something or predict a value. Example: Detecting fraudulent transactions from a list of credit card transactions.
- Unsupervised Learning - Learning where data is not labeled and the motivation is to find patterns in given data. In this case, you are asking the machine learning model to process the data from which you can then draw conclusions. Example: Customer segmentation based on spend data.
- Reinforcement Learning - Learning by trial and error. This is the closest to how humans learn. The motivation is to find optimal policy of how to act in a given environment. The machine learning model examines all possible actions, makes a policy that maximizes benefit, and implements the policy(trial). If there are errors from the initial policy, apply reinforcements back into the algorithm and continue to do this until you reach the optimal policy. Example: Personalized recommendations on streaming platforms like YouTube.
What are the two types of supervised learning?
As supervised learning is used to classify something or predict a value, naturally there are two types of algorithms for supervised learning - classification models and regression models.
- Classification model - In simple terms, a classification model predicts possible outcomes. Example: Predicting if a transaction is fraud or not.
- Regression model - Are used to predict a numerical value. Example: Predicting the sale price of a house.
What is logistic regression?
Logistic regression is an example of supervised learning. It is used to calculate or predict the probability of a binary (yes/no) event occurring. An example of logistic regression could be applying machine learning to determine if a person is likely to be infected with COVID-19 or not. Since we have two possible outcomes to this question - yes they are infected, or no they are not infected - this is called binary classification.
In this imaginary example, the probability of a person being infected with COVID-19 could be based on the viral load and the symptoms and the presence of antibodies, etc. Viral load, symptoms, and antibodies would be our factors (Independent Variables), which would influence our outcome (Dependent Variable).
How is logistic regression different from linear regression?
In linear regression, the outcome is continuous and can be any possible value. However in the case of logistic regression, the predicted outcome is discrete and restricted to a limited number of values.
For example, say we are trying to apply machine learning to the sale of a house. If we are trying to predict the sale price based on the size, year built, and number of stories we would use linear regression, as linear regression can predict a sale price of any possible value. If we are using those same factors to predict if the house sells or not, we would logistic regression as the possible outcomes here are restricted to yes or no.
Hence, linear regression is an example of a regression model and logistic regression is an example of a classification model.
Where to use logistic regression
Logistic regression is used to solve classification problems, and the most common use case is binary logistic regression, where the outcome is binary (yes or no). In the real world, you can see logistic regression applied across multiple areas and fields.
- In health care, logistic regression can be used to predict if a tumor is likely to be benign or malignant.
- In the financial industry, logistic regression can be used to predict if a transaction is fraudulent or not.
- In marketing, logistic regression can be used to predict if a targeted audience will respond or not.
Are there other use cases for logistic regression aside from binary logistic regression? Yes. There are two other types of logistic regression that depend on the number of predicted outcomes.
The three types of logistic regression
- Binary logistic regression - When we have two possible outcomes, like our original example of whether a person is likely to be infected with COVID-19 or not.
- Multinomial logistic regression - When we have multiple outcomes, say if we build out our original example to predict whether someone may have the flu, an allergy, a cold, or COVID-19.
- Ordinal logistic regression - When the outcome is ordered, like if we build out our original example to also help determine the severity of a COVID-19 infection, sorting it into mild, moderate, and severe cases.
Training data assumptions for logistic regression
Training data that satisfies the below assumptions is usually a good fit for logistic regression.
- The predicted outcome is strictly binary or dichotomous. (This applies to binary logistic regression).
- The factors, or the independent variables, that influence the outcome are independent of each other. In other words there is little or no multicollinearity among the independent variables.
- The independent variables can be linearly related to the log odds.
- Fairly large sample sizes.
If your training data does not satisfy the above assumptions, logistic regression may not work for your use case.
Mathematics behind logistic regression
Probability always ranges between 0 (does not happen) and 1 (happens). Using our Covid-19 example, in the case of binary classification, the probability of testing positive and not testing positive will sum up to 1. We use logistic function or sigmoid function to calculate probability in logistic regression. The logistic function is a simple S-shaped curve used to convert data into a value between 0 and 1.
Conclusion
In a nutshell, logistic regression is used for classification problems when the output or dependent variable is dichotomous or categorical. There are some assumptions to keep in mind while implementing logistic regressions, such as the different types of logistic regression and the different types of independent variables and the training data available.
To read more about how Capital One is using logistic regression, check out these articles: