Can you handle an argument?

TL;DR

This post explores some of the darker corners of command-line parsing that some may be unaware of.

You might want to grab a coffee.

Intro

No, I’m not questioning your debating skills, I’m referring to parsing command-lines!

Parsing command-line option is something most programmers need to deal with at some point. Every language of note provides some sort of facility for handling command-line options. All a programmer needs to do is skim read the docs or grab the sample code, tweak to taste, et voila!

But is it that simple? Do you really understand what is going on? I would suggest that most programmers really don’t think that much about it. Handling the parsing of command-line options is just something you bolt on to your codebase. And then you move onto the more interesting stuff. Yes, it really does tend to be that easy and everything just works… most of the time.

Most? I hit an interesting issue recently which expanded in scope somewhat. It might raise an eyebrow for some or be a minor bomb-shell for others.

Back-story

utfout

Back in the mists of time (~2012), I wrote a simple CLI utility in C called utfout. utfout is a simple tool that basically produces output. It’s like echo(1) or printf(3), but maybe slightly better ;)

Unsurprisingly, utfout uses the ubiquitous and venerable getopt(3) library function to parse the command-line. Specifically, utfout relies on getopt(3) to:

  • Parse the command-line arguments in strict order.
  • Handle multiple identical options as and when they occur.
  • Handle positional (non-option) arguments.

(Note: We’re going to come back to the term “in strict order” later. But, for now, let’s move on).

One interesting aspect of utfout is that it allows the specification of a repeat value so you can do something like this to display “hello” three times:

$ utfout "hello" -r 2
hellohellohello

That -r repeat option takes an integer as the repeat value. But the integer can be specified as -1 meaning “repeat forever”. Looking at such a command-line we have:

$ utfout "hello\n" -r -1

We’re going to come back to examples like this later. For now, just remember that this option can accept a numeric (negative) value.

rout

Recently, I decided to rewrite utfout in rust. Hence, rout was born (well, it’s almost been born: I’m currently writing a test suite for it, but I should be releasing it soon).

When I started working on rout, I looked for a rust command-line argument parsing crate that had the semantics of getopt(3). Although there are getopt() clones, I wanted something a little more “rust-like”. The main contenders didn’t work out for various reasons (I’ll come back to this a little later), and since I was looking for an excuse to write some more rust, I decided, in the best tradition of basically every programmer ever, to reinvent the wheel and write my own. This was fun. But more than that, I uncovered some interesting behavioural points that may be unknown to many. More on this later.

I soon had some command-line argument parsing code and sine it ended up being useful to me, I ended up publishing it as my first rust crate. It’s called ap for “argument parser”. Not a very creative name maybe, but succinct and simple, like the crate itself.

By this stage, the rout codebase was coming along nicely and it was time to add the CLI parsing. But when I added ap to rout and tried running the repeat command (-r -1), it failed. The problem? ap was assuming that -1 was a command-line option, not an option argument. Silly bug right? Err, yes and no. Read on for an explanation!

getopt minutiae

It may not common knowledge, but getopt(3), and in fact most argument parsing packages, provide support for numeric option names. If you haven’t read the back-story, this means it supports options like -7 which might be a short-hand for the long option --enable-lucky-seven-mode (whatever that means ;) And so to our first revelation:

Revelation #1:

getopt(3) supports any ASCII option name that is not -, ; or :.

In fact, it’s a little more subtle that that: although you can create an option called +, it cannot be the first character in optstring.

If you didn’t realise this, don’t feel too bad! You need to squint a bit when reading the man page to grasp this point, since it is almost painfully opaque on the topic of what constitutes a valid option character. Quoting verbatim from getopt(3):

optstring is a string containing the legitimate option characters.

Aside from the fact that the options are specified as a const char *, yep, that is your only clue! The FreeBSD man page is slightly clearer, but I would still say not clear enough personally. Yes, you could read the source, but I’ll warn you now, it’s not pretty or easy to grok!

But let this sink in: you can use numeric option names

The more astute reader may be hearing faint alarm bells ringing at this point. Not to worry if that’s not you as I’ll explain all later.

An easy way to test getopt behaviour

I’ve created a simple C program called test_getopt.c that allows you to play with getopt(3) without having to create lots of test programs, or recompile a single program constantly as you tweak it.

The program allows you to specify the optstring as the first command-line argument with all subsequent arguments being passed to getopt(3).

See the README for some examples.

Real-world evidence

If you’ve ever run the ss(1) or socat(1) commands, you may have encountered numeric options as both commands accept -4 and -6 options to denote IPv4 and IPv6 respectively. I’m reasonably sure I’ve also seen a command use -# as an option but cannot remember which.

The ap bug

The real bug in ap was that it was prioritising options over argument order: it was not parsing “in strict order”.

Parsing arguments in strict order

Remember we mentioned parsing argument “in strict order” earlier? Well “in strict order” doesn’t just mean that arguments are parsed sequentially in the order presented (first, second, third, etc), it also means that option arguments will be consumed by the “parent” (aka previous) option, regardless of whether the option argument starts with a dash or not. It’s beautifully simple and logical and crucially results in zero ambiguity for getopt(3).

To explain this, imagine your program calls getopt() like this:

getopt(argc, argv, "12:");

The program could then be invoked in any of the following ways:

$ prog -1
$ prog -2 foo

But: it could also be called like this:

$ prog -2 -1

getopt(3) parses this happily: there is no error and no ambiguity. As far as getopt(3) is concerned the user specified the -2 option passing it the value of -1. To be clear, as far as getopt(3) is concerned, the -1 option was not been specified!

Revelation #2:

In argument parsing, “in strict order” means the parser considers each argument in sequential order and if a command-line argument is an option and that option requires an argument, the next command-line argument will become that options argument, regardless of whether it starts with dash or not!

Going back to the revelation. Consuming the next argument after an option requiring a value is a brilliantly simple design. It’s also easy to implement. And since getopt(3) is part of the POSIX standard, it’s actually the behaviour you should be expecting from a command-line parser, atleast if you started out as a systems programmer. But since the details of this parsing behaviour have been somewhat shrouded in mystery, you may not be aware that you should be expecting such behaviour from other parsers!

But, alas, POSIX or not, this behaviour isn’t necessarily intuitive (see above) and indeed this is not how all command-line parsers work.

Summary of command-line argument parsers

As my curiosity was now piqued, I decided to do a quick survey of command-line parsing packages for a variety of languages. This is in no way complete and I’ve missed out many languages and packages. But it’s an interesting sample nonetheless.

The table below summarises the behaviour for various languages and parsing libraries:

language library/package strict ordering?
bash getopts Yes (uses getopt(3))
C/C++ getopt(3) Yes (POSIX standard! ;)
Go cli Yes
java apache-commons-cli Yes
lua argparse No
perl Getopt::Std Yes
python argparse No
ruby optparse Yes
rust ap Yes
rust clap (v2+v3) No
swift swift-argument-parser No
zsh getopts Yes (uses getopt(3))

Note:

The libraries that do not use strict ordering (aka “the getopt way”) are not wrong or broken, they just work slightly differently! As long as you are aware of the difference, there is no problem ;)

Why are some libraries different?

It comes down to how the command-line arguments are parsed by the package.

Assume the library has just read an argument and determined definitively that it is an option and that the option requires a value. It then reads the next argument:

  • If the library is like getopt(3), it will just consume the argument as the value for the just-seen option (regardless of whether the argument starts with a dash or not).

  • Alternatively, if this new argument starts with a dash, the library will consider it an option and then error since the previous argument (the option) was expecting a value.

    The subtlety here is that “getopt()-like” implementations allow option values to look like options, which may surprise you.

So what?

We’ve had two revelations:

  1. Most argument parsers support numeric option names.
  2. Strict argument parsing means consuming the next argument, even if it starts with a dash.

You may be envisaging some of the potential problems now:

“What if my program accepts a numeric option and also has an option that accepts a numeric argument?”

There is also the slightly more subtle issue:

"What if my program has a flag option and also has an option that can accept a free-form string value?

Indeed! Here be dragons! To make these problems clearer, we’re going to look at some examples.

Example 1: Missile control

Imagine an evil and powerful tech-savvy despot asks his minions to write a CLI program for him to launch missiles. The program uses getopt(3) with an optstring of “12n:” so that he can launch a single missile (-1), two missiles (-2), or lots (-n <count>):

Here’s how he could do his evil work:

$ fire-missile -1
Firing 1 missile!
$ fire-missile -2
Firing 2 missiles!

Unfortunately, the poor programmer who wrote this program didn’t check the inputs correctly. Here’s what happens when the despot decides to fire a single missile, but maybe in a drunken stupor / tab-complete fail, runs the following by mistake:

$ fire-missiles -n -1
Firing 4294967295 missiles!

He’s meant to run fire-missiles -1 (or indeed fire-missiles -n 1), but got confused and appears to have started Armageddon by mistake since the program parsed the -n option value as a signed integer.

Example 2: Get rich quick or get fired?

Another example. Imagine a program used to transfer money between banks by allowing the admin to specify two IBAN (International Bank Account Number) numbers, an amount and a transaction summary field. Here are the arguments the program will accept:

  • -f <IBAN>: Source account.
  • -t <IBAN>: Destination account.
  • -a <amount>: Amount of money to transfer (let’s ignore things like different currencies and exchange rates to keep it simple).
  • -s <text>: Human readable summary of the transaction.
  • -d: Dry-run mode - don’t actually send the money, just show what would be done.

We could use it to send 100 units of currency like this:

$ prog -f me -t you -a 100 -s test

For this program we specify a getopt(3) optstring of “a:df:s:t:”. Fine. But using strict ordering, if I run the program as follows, I’ll probably get fired!

$ prog -f me -t you -a 10000000000 -s -d

Oops! I meant to specify a summary, but I forgot. But hey, that’s fine as I specified to run this in dry-run mode using -d. Oh. Wait a second…

Yes, I’m in trouble because the money was sent as in fact I didn’t specify to run in dry-run mode: I specified a summary of “-d” due to the strict argument parsing semantics of getopt(3).

Example 3: Something to give you nightmares

Using the knowledge of the revelations, you can easily contrive some real horrors. Take, for example, the following abomination:

$ prog -12 3 -4 -5 -67 8 -9

How is that parsed? Is that first -12 argument a simple negative number? Or is it actually a -1 option with the option argument value 2? Or is it a -1 option and a -2 option “bundled” together?

The answer of course depends on how you’ve defined the optstring value to getopt(3). But please, please never write programs with interfaces like this! ;)

You can use the test_getopt.c program to test out various ways of parsing that horrid command-line. For example, one way to handle them might be like this:

$ test_getopt "1::45:9" -12 3 -4 -5 -67 8 -9
INFO: getopt option: '1' (optarg: '2', optind: 2, opterr: 1, optopt: 0)
INFO: getopt option: '4' (optarg: '', optind: 4, opterr: 1, optopt: 0)
INFO: getopt option: '5' (optarg: '-67', optind: 6, opterr: 1, optopt: 0)
INFO: getopt option: '9' (optarg: '', optind: 8, opterr: 1, optopt: 0)

But alternatively, it could be parsed like this:

$ test_getopt "12:4567:9" -12 3 -4 -5 -67 8 -9
INFO: getopt option: '1' (optarg: '', optind: 1, opterr: 1, optopt: 0)
INFO: getopt option: '2' (optarg: '3', optind: 3, opterr: 1, optopt: 0)
INFO: getopt option: '4' (optarg: '', optind: 4, opterr: 1, optopt: 0)
INFO: getopt option: '5' (optarg: '', optind: 5, opterr: 1, optopt: 0)
INFO: getopt option: '6' (optarg: '', optind: 5, opterr: 1, optopt: 0)
INFO: getopt option: '7' (optarg: '8', optind: 7, opterr: 1, optopt: 0)
INFO: getopt option: '9' (optarg: '', optind: 8, opterr: 1, optopt: 0)

Aside

Coincidentally, by combining test_getopt with utfout, you can prove Revelation #1 rather simply:

$ (utfout -a "\n" "\{\x21..\x7e}"; echo) |\
  while read char
  do
      test_getopt "x$char" -"$char"
  done

Note: The leading “x” in the specified optstring argument is to avoid having to special case the string since the first character is “special” to getopt(3). See the man page for further details.

Summary

Admittedly, these are very contrived (and hopefully unrealistic!) examples. The missile control example is also a very poor use of getopt(3) since in this scenario, a simple check on argv[1] would be sufficient to determine how many missiles to fire. However, you can now see the potential pitfalls of numeric options and strict argument parsing.

To test a parser

If you want to establish if your chosen command-line parsing library accepts numeric options and if it parses in strict order, create a program that:

  • Accepts a -1 flag option (an option that does not require an argument).

  • Accepts a -2 argument option (that does accept an argument).

  • Run the program as follows:

    $ prog -2 -1
  • If the program succeeds (and sets the value for your -2 option to -1), your parser is “getopt()-like” (is parsing in strict order) and implicitly also supports numeric options.

Conclusions

Here’s what we’ve unearthed:

  • The getopt(3) man page on Linux is currently ambiguous.

    I wrote a patch to resolve this, and the patch has been accepted. Hopefully it will land in the next release of the man pages project.

  • All command-line parsing packages should document precisely how they consume arguments.

    Unfortunately, most don’t say anything about it! However, ap does. Specifically, see the documentation here.

  • getopt(3) doesn’t just support alphabetic option names: a name can be almost any ASCII character (-3, -%, -^, -+, etc).

  • Numeric options should be used with caution as they can lead to ambiguity; not for getopt(3) et al, but for the end user running the program. Worst case, there could be security implications.

  • Permitting negative numeric option values should also be considered carefully. Rather than supporting -r -1, it would be safer if utfout and rout required the repeat count to be >= 1 and if the user wants to repeat forever, support -r max or -r forever rather than -r -1.

  • Some modern command-line parsers prioritise options over argument ordering (meaning they are not “getopt()-like”).

  • You should understand how your chosen parser works before using it.

  • Parsing arguments “in strict order” does not only mean “in sequential order”: it means the parser prioritises command-line arguments over option values.

  • If your chosen parsing package prioritises arguments over options (like getopt(3), you need to take care if you use numeric options since arguments will be consumed “greedily” (and silently).

  • If your chosen parsing package prioritises options over arguments, you will probably be safer (since an incorrect command-line will generate an error), but you need to be aware that the package is not “getopt()-like”.

  • a CLI program must validate all command-line option values; command-line argument parsers provide a way for users to inject data into a program, so a wise programmer will always be paranoid!

  • The devil is in the detail ;)

Comments

Popular posts from this blog

Procenv 0.43 released

rout is out