Measures of Central Tendency:
According to Professor Bowley, averages are “statistical constants which enable us to comprehend in a single effort the significance of the whole.” They give us an idea about the concentration of the values in the central part of the distribution. Plainly speaking, an average of a statistical series is the value of the variable which is representative of the entire distribution. The most common measure of central tendency used for data science is
- Mean or Arithmetic Mean
- Mean or Arithmetic Mean:
Mean or arithmetic mean is the sum of all Data values divided by a number of data points. It is the distance from the origin where the system of forces gets balanced. It is mathematically intensive. Mean may not be represented by any individual data value.
- In the case of a set of n observations mean is given by
- In the case of frequency distribution mean is given by
- In the case of grouped or continuous frequency distribution. X is taken as the mid-value of the corresponding class.
- Suppose we have a set of observations like 1,2,2,2,3,2,4,1,2,1. Sum of these observations is 20, and the number of observations is 10. Mean or arithmetic mean is given by the sum of observation/number of observations i.e. 20/10 = 2. Therefore the mean is 2. In school, the grading system which we usually refer to in a course is usually a mean.
- Suppose you want to calculate the average time spent by a person waiting for a cab once booked. In other words, we can say the average time required by a cab to arrive at my pickup location after booking a cab. For calculating these we need to create a data set for every person who will book the cab. Once the data is collected, the average or mean can be calculated. This will give us an idea of how much time is spent by the person waiting for the cab to arrive.
Calculating mean using R:
|> a = c(1,2,2,2,3,2,4,1,2,1)
> print (a)
 1 2 2 2 3 2 4 1 2 1
> mean (a)
Median is the middle Data Value when data is ranked from minimum to maximum. It is easy to compute practically. Median is usually represented by an individual data value. The Median divides the area of the histogram in half. Median is affected by extreme values. The Median is a proportional average. In the case of ungrouped data, if the number of observations is odd then the median is the middle value after the values have been arranged in ascending or descending order of magnitude. In the case of an even number of observations, there are two middle terms, and the median is obtained by taking the arithmetic mean of the middle teens. The steps involved to find the median are
- A set of odd numbers of observations is taken.
- Observations are sorted and Mean, Median, and Mode are calculated
- Suppose we have a set of observations as 2,9,5,7,3. So to find the median of the following first we need to sort the data into ascending or descending order. The above data after sorting is 2,3,5,7,9. After sorting to find the median we take the mid-value. Therefore the median for the following observations is 5.
- Suppose you are shopping online, say on Amazon. When you go on the Amazon app, you see that there are various types of options available. Now, you are supposed to select the perfect home decor that would be liked by everyone. Let the types of decor available be: sofa set (liked by people between age 20-30), curtains (liked by people between age 30-40), and teapots (liked by people between age 40-50). The ages (in years) of people who have visited this section are- 20, 25, 29, 40, and 41. If you calculate the mean or the average for the following data, it comes out to be 31, which belongs to the age group between 30 to 40. So, you opt for the curtains. But, in the family, you will find only one person to be liking it, while the others didn’t. The better alternative in such a case is to calculate the median instead of calculating the mean. The median is the middle value of the properly arranged data, i.e.29 in this case. When you decide to buy a sofa set, more people like it. The next time, you can apply the concept of median to decide which product you should purchase.
Calculating median using R:
|> a = c(2,9,5,7,3)
> print (a)
 2 9 5 7 3
> median (a)
Mode is a value that has a maximum frequency. It is easy to compute. Mode is represented by individual data values. It is the value of the highest point on the histogram. It is affected by extreme values. Preferred when most commonly occurring values appropriately represent the group.
- Consider the following set of observations 2,4,5,3,6,8,5,9. Mode is the number that is occurring most frequently. So the mode for the following set of observations is 5.
- Suppose you are an analyst at a food ordering and delivery company, say swiggy. You’ve been assigned a task to determine the most frequently ordered food category in the last month. In order to complete the task, you retrieve the data of the purchases for the last month. The data is summarized in the table below.
|Customer Id||Food Ordered|
The task can be completed by determining the mode of the food ordered in the dataset, which can be found by identifying the distinct values in the dataset. We can see that the distinct values of the food ordered in the dataset include Pizza, burgers, Fries, and momos. Calculate the frequency of each distinct value in the dataset.
From the above table, we can see that pizza is ordered more frequently than other food items in the data set. In other words, pizza is the mode in the dataset.
Image References: http://methods.sagepub.com/book/testing-and-measurement/n4.xml