Command line length limit: built-in vs executable

So per POSIX specification we have the following definition for *:

Expands to the positional parameters, starting from one, initially
producing one field for each positional parameter that is set. When
the expansion occurs in a context where field splitting will be
performed, any empty fields may be discarded and each of the non-empty
fields shall be further split as described in Field Splitting. When
the expansion occurs in a context where field splitting will not be
performed, the initial fields shall be joined to form a single field
with the value of each parameter separated by the first character of
the IFS variable if IFS contains at least one character, or separated
by a if IFS is unset, or with no separation if IFS is set to a
null string.

For a vast majority of people we are aware of the famous ARG_MAX limitation:

$ getconf ARG_MAX
2621440

which may lead to:

$ cat * | sort -u > /tmp/bla.txt
-bash: /bin/cat: Argument list too long

Thankfully the good people behind bash ([include all POSIX-like others]) provided us with printf as a built-in, so we can simply:

printf '%s\0' * | sort -u --files0-from=- > /tmp/bla.txt

And everything is transparent for the user.

Could someone please let me know why this is so trivial to bypass the ARG_MAX limitation using a built-in command and why it is so damn hard to provide a conforming POSIX shell interpreter which would handle gracefully * special parameter to a standalone executable:

$ cat *

Would that break something ? I am not asking bash people to provide cat as a built-in, I am solely interested in the order of operations and why is * expanded in different behavior depending whether the command is build-in or is a standalone executable.

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

The limitation is not in the shell but in the exec() family of functions.

The POSIX standard says in relation to this:

The number of bytes available for the new process’ combined argument and environment lists is {ARG_MAX}. It is implementation-defined whether null terminators, pointers, and/or any alignment bytes are included in this total.

To run utilities that are built into the shell, the shell will not need to call exec(), so it is unaffected by this limitation.

Notice, too, that it’s not simply the length of the command line that is limited, but the combination of the length of the command, its arguments, and the current environment variables and their values.

Also notice that printf is not a built in utility in e.g. pdksh (which happens to act as sh and ksh on OpenBSD). Relying on it being a built-in will need to take the specific shell which is being used into account.

Solution 2

Kusalananda’s answer explains why ARG_MAX isn’t an issue with shell built-ins.

As far as implementing cat * in a way that’s not affected by ARG_MAX, doing so is trivial: all that the cat implementation needs to do is use glob(3) to implement its own globbing, and then you’d run it using cat \* or cat '*' so that the shell doesn’t do its own globbing. You’ll find a few commands on a Linux or Unix-style system which can take care of their own globbing, at least in certain circumstances; find, tar, zip etc. Many commands with native DOS versions would at least include code to handle globbing since the shells there don’t glob external commands’ arguments themselves.

Given POSIX shell expectations, that feature would be rather surprising and hard to discover! In early Unix versions, globbing was implemented using a separate program, /etc/glob.

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply