Impact on LC_ALL on Linux sort

Linux sort is a handy Linux command line tool to sort text files. It can sort fairly large files without consuming too much memory. The sorting behaviour can change depending upon the localisation setting set using LC_ALL environment variable.

Find/change sorting rule on Linux

To find the sorting rule set on your Linux system run the following:

$ echo 1 | sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
1
_

sort in debug mode tells us which sorting rule is being used. Above result may vary on your system depending upon the environment variables set.

Now run the following to set LC_ALL=C and then see the sorting rule used by sort:

$ LC_ALL=C; echo 1 | sort --debug
sort: using simple byte comparison
1
_

Here are some examples with different LC_ALL localisation values:

Traditional sort using byte values

To do the tradition sort set LC_ALL=C before sort. Here is an example using this:

$ LC_ALL=C; printf "a 4\na3\n" | sort
a 4
a3

Here since space comes before number 3 in simple byte camparison, The output has “a 4” before “a3”.

sort using en_US.UTF-8

This is usually default value of LC_ALL on Ubuntu Linux. So you may end up doing sort using this. But we’ll set it explicitly.

$ LC_ALL=en_US.UTF-8; printf "a 4\na3\n" | sort
a3
a 4

Notice the different outcome this time. Using en_US.UTF-8 causes sort to ignore spaces. For this reason sort treats “a 4” as “a4” and hence the above outcome.

It is a good idea to review what sorting rule sort is using on your system before using it to avoid surprises.

Share this article: share on Google+ share on facebook share on linkedin tweet this submit to reddit

Comments

Click here to write/view comments