WWCode Talks Tech #19: Statistics For Machine Learning
Sneha: Statistics is a part of mathematics that provides us with tools and methods to handle and understand data. There are two broad applications of statistics. One is to describe the data, which is descriptive statistics. The other is to infer from data, which is inferential statistics. Statistics and machine learning are closely coupled. Outside the space of implementing, building, and deploying machine learning models or algorithms, it's important to understand why you're choosing that model. How is it handling your data? How are you interpreting the results? When you start applying these techniques within a real-world space, it becomes more and more relevant to have that understanding.
I like to think about it through five key areas of machine learning that utilize the understanding of statistics. With that, you don't have to be an expert. You don't have to study complex mathematical formulae or start deriving things. We aim to understand why you need a feel or intuition for statistics as you go through these different areas of machine learning. We all begin by just looking at data. Exploratory data analysis forms the first step in coming up with the right business question that you're going to handle with the help of machine learning. Data preparation benefits from statistical methods. This is where you begin to understand the different distributions of your variables. Do you have missing values? Do you have any outliers, and how should you handle those?
There is a portion where you look at your data and split it into test and train and how you will sample them. Sampling is another statistical concept that helps us do this in a scientifically accurate way. Model evaluation, understanding the different algorithms available to you, running those, and understanding the parameters that come out of it in your output are all related to understanding some statistical concepts. The third piece is model selection. When you look at the different model output parameters that come out, you're making inferences or interpreting the results and understanding which model better explains the variability in my data. Those concepts are deeply rooted in statistics.
The next piece is model presentation. You've built the model and selected the best one. Now, you have to go and convince someone that it works best to answer this business case. You have to condense it, summarize it, and share it intuitively and understandably. This is where statistics come to the rescue regarding sharing that information more descriptively. Last, we have a prediction. When you're running predictions with machine learning, you need to understand how to interpret the expected variability in your predictions and understand all that means when running those prediction algorithms. We have different types of data. Numerical and categorical are two broad categories that are quite popular. I want to highlight four levels of measurement which tie into those data types. Nominal, ordinal, interval, and ratio are the four measurements we categorize.
Anjali: Statistics is essentially divided into two basic parts, which are descriptive and inferential. Descriptive statistics is a practice of organizing and summarizing data using numbers and graphs. We use graphs to summarize the data. We use different modes of central tendency. We use measures of variability and other methods to describe data in a way that is easier for the audience to understand. Inferential statistics analyze the data for this. When I've given you the data information, then I've given you what the data indicates, then inferential statistics come into the picture. It helps us analyze the data and draw necessary conclusions from it.
Data helps us understand. Graphs help us understand the data better in a visual manner. Suppose you have a city with a population of 100,000 people, and we can't just calculate the whole data. If you're calculating the inferential and differential statistics modes, you're calculating different mean, median, and mode for different sample data. I want to clarify two technical terms. One is sample data, which represents the part of the population that you have surveyed and you're analyzing. You can't just go around and talk to 100,000 people in the city and ask them what their preferences are about the subject. You take a sample of people and ask them what they like. Then you conclude the whole population using the sample population you surveyed. Sample data is the part of the population you are surveying, and data points are the individual data you have in a raw data set. We have a population of 100,000 people. I'm surveying 100 people out of those 100,000 because I want to know the total number of people who have jobs in the tech sector. We have a sample size of 100. Let us suppose that 34 out of 100 have jobs in tech.
We are surveying random people, and we found out that 34 people out of 100 have jobs in tech. This leads us to the conclusion that the average of people who have jobs in tech is 0.34. That is the average, which is the measure of descriptive statistics, but then what is inferential statistics? I am concluding the whole 100,000 people in the city using this average of 0.34. I'm saying that 34% of the people have jobs in the tech sector in the city. Out of 100,000 people, there are 34,000 people, 34% of people who have jobs in the tech sector. I'm drawing the inference or conclusion about the whole city using just 100 people. Increasing the sample size gives more accurate results.
Central tendency is a mode of descriptive statistics. Central tendency is a descriptive summary of a data set through a single value representing the data distribution's center. What does it mean? I have a data distribution. The middlemost number, the centermost number, it's very self-explanatory from the center tendency itself, is the number that will define the whole data set. What single number best represents our data? We have three modes of central tendency. We have mean, median, and mode. Mean, median, and mode will give you different data points. To calculate the mean, we add all data points and divide them by the number of data points. We take the average. To calculate the median, you must arrange your data in an increasing or decreasing order. If your data has an even number of data points, you take the two middlemost numbers, add them, and divide them by two. If your data points are odd, pick the very middle number. Mode is the number of times a number has occurred in a data set.
Now you have a data set, how do you know which one to use? Does mode, mean, or median best represent my data? We have this data set: 70, 72, 74, 76, 80, and 114. We have a mean of 81 and a median of 75. We have a mean of 81, which lies between 80 and 114. And we've got a median of 75, which is the center. It's the median that best represents this data. Mean is giving us a skewed value, an outlier. It is a number different from my data set. Mean is used as a mode only when the data is scaled and not highly skewed if the data is balanced. Mean is used as a mode when there are no outliers. We have a balanced graph. Mode is used when a particular score dominates the distribution. Suppose there's a class, and I ask what flavor of ice cream you like. There are 60 people in a class, and 20 just say, "Yeah, we like the mint flavor the most." So I am getting a data set where a particular score dominates over the distribution. I'm getting 20 for mint, two for chocolates, three for strawberries, etc. The mint one is dominating. The mode is 20 because the score of mint ice cream flavored people is more than the other. That's when we use mode as a representation of a central tendency.
What happens if you have the same class where 20 people say that they prefer mint, and the rest of the 20 say they prefer chocolate? There are two kinds of domination in the whole data distribution. This is the bimodal curve. Next is variability, which indicates the data points and how they are spread out. What is the difference between them? You would like the data set to be less variable. The data is not predictable in the case of variable data. If it's highly variable data, then it's very unstable. If the data set is unpredictable, it's useless because it's highly variable. The greater the variability, the less consistent and accurate the measure of central tendency. How do we measure the variability between standard deviation? We do it through central standard deviation and variance.
Runjhun: Inferential statistics is the other type of statistics that we know. It would summarize the characteristics of a data set. What's the most frequent number that's coming in our data? That's the characteristics of the data set. We need to draw some conclusions from it. We will use inferential statistics, which helps us come to conclusions and make predictions based on our data. It has two main uses. First, it helps in making estimates of a population. The second use of inferential statistics is testing the hypothesis to conclude a population. The hypothesis is an assumption statement. It can be true, it might not be. It gets to a conclusion, then we back it up with data.
Sampling is a tool if we have a lot of data. If we have a good amount of the population to survey, use sampling. It is the process of selecting the sample from the population. The sample is a subset of the whole lot of population. You take a very small part of your data and ask that question that you were asking the whole population. If I ask how many people like blue cars, we say the population is 100,000. Take a subset of a hundred people. Twenty people out of a hundred people like blue cars. It means 20% of the sample like blue cars. Keep in mind the margin of error because we are just estimating. That's how inferential statistics work. The margin of error decreases if I consider a good amount of people in ratio to the population. Then there is organizing, analyzing, and presenting the data meaningfully. Take account of outliers and missing values. If I manage my data well, I can make a good hypothesis.