Squared Error vs Absolute Values

Why do we take squared values of difference instead of absolute values while calculating variance?

square error vs absolute error meme

We know that Standard Deviation is the square root of the average of the squared differences from the Mean.

\(\sigma = \sqrt{\frac{1}{N} \sum_i (x_i – m)^2}\)

The main reason for squaring the values of difference is to keep the values of distance from the mean to be positive.
e.g. say we have a data set of x coordinates as [-2, -1, 0, 1, 2]. The Mean here is 0. So if we take the sum distance of the points from the mean. Then we will have negative and positive values that will cancel out each other and the sum of the distances will be zero, which is wrong. Hence to avoid this problem we take the square of differences. Now the things start getting interesting, why are we taking squares of differences? why not just take the absolute values?

\(\alpha =\frac{1}{N} \sum_i |x_i – m|\)

Here, \(\alpha \) is called the average absolute deviation or mean absolute deviation.

squareerror vs absoluteerror
Squared error vs absolute error

Why average absolute deviation is not as famous as the standard deviation?

There are various reasons for this. One of the major reasons is, \(x^2\) is differentiable, while \(|x|\) is not differentiable at x=0. Therefore, \(standard \: deviation/variance\) is more useful as they use square of values, unlike average absolute deviation which is not differentiable at x = 0.
Hence, in problems where quadratic terms are present, one can differentiate them to find optimal solutions analytically. On the other hand, with \(|x|\), one often has to resort to numerical schemes to handle the absolute value.

Another flip side to using quadratic terms is that the outliers (i.e. large and small x values) have a much higher influence on the \(x^2\) terms, when compared to their influence on \(|x|\). This may be good or bad depending on your application.

Also, minimizing squared error is not the same as minimizing absolute error.
The reason minimizing squared error prevents large errors better. Minimizing Squared error is more common than the absolute error.

e.g. say your employer’s payroll department accidentally pays each of a total of ten employees $50 less than required. That’s an absolute error of $500. It’s also an absolute error of $500 if the department pays just one employee $500 less. But in terms of squared error, it’s 25000 versus 250000.

But then why not use cubed values, the answer is cube makes errors in the wrong direction subtractive. So it would have to be an absolute cubed error, or stick to even powers. There is no real “good” reason that squared is used instead of higher powers (or, indeed, non-polynomial penalty functions). It’s just easy to calculate, easy to minimize, and does the job.


It’s not always better to use squared error. If you have a data set with an extreme outlier due to a data acquisition error. Then minimizing squared error will pull the fit towards the extreme outlier much more than minimizing absolute error. That being said, it’s -usually- better to use squared error.

To sum up, Standard Deviation is so much more used than average absolute deviation. Because of much better statistical (follows well-known distribution) and analytical (differentiable) properties of the Variance.

Thanks for reading. Comment below for any doubt or suggestion.

Ref: https://stats.stackexchange.com/questions/147001/is-minimizing-squared-error-equivalent-to-minimizing-absolute-error-why-squared
Image src: https://math.stackexchange.com/questions/1592151/standard-error-loss-vs-absolute-loss

Show CommentsClose Comments

Leave a comment