Can you handle an argument?
TL;DR
This post explores some of the darker corners of command-line parsing that some may be unaware of.
You might want to grab a coffee.
Intro
No, I’m not questioning your debating skills, I’m referring to parsing command-lines!
Parsing command-line option is something most programmers need to deal with at some point. Every language of note provides some sort of facility for handling command-line options. All a programmer needs to do is skim read the docs or grab the sample code, tweak to taste, et voila!
But is it that simple? Do you really understand what is going on? I would suggest that most programmers really don’t think that much about it. Handling the parsing of command-line options is just something you bolt on to your codebase. And then you move onto the more interesting stuff. Yes, it really does tend to be that easy and everything just works… most of the time.
Most? I hit an interesting issue recently which expanded in scope somewhat. It might raise an eyebrow for some or be a minor bomb-shell for others.
Back-story
utfout
Back in the mists of time (~2012), I wrote a simple CLI utility in C called utfout
. utfout
is a simple tool that basically produces output. It’s like echo(1)
or printf(3)
, but maybe slightly better ;)
Unsurprisingly, utfout
uses the ubiquitous and venerable getopt(3)
library function to parse the command-line. Specifically, utfout
relies on getopt(3)
to:
- Parse the command-line arguments in strict order.
- Handle multiple identical options as and when they occur.
- Handle positional (non-option) arguments.
(Note: We’re going to come back to the term “in strict order” later. But, for now, let’s move on).
One interesting aspect of utfout
is that it allows the specification of a repeat value so you can do something like this to display “hello” three times:
utfout "hello" -r 2
$ hellohellohello
That -r
repeat option takes an integer as the repeat value. But the integer can be specified as -1
meaning “repeat forever”. Looking at such a command-line we have:
utfout "hello\n" -r -1 $
We’re going to come back to examples like this later. For now, just remember that this option can accept a numeric (negative) value.
rout
Recently, I decided to rewrite utfout
in rust. Hence, rout
was born (well, it’s almost been born: I’m currently writing a test suite for it, but I should be releasing it soon).
When I started working on rout
, I looked for a rust command-line argument parsing crate that had the semantics of getopt(3)
. Although there are getopt()
clones, I wanted something a little more “rust-like”. The main contenders didn’t work out for various reasons (I’ll come back to this a little later), and since I was looking for an excuse to write some more rust, I decided, in the best tradition of basically every programmer ever, to reinvent the wheel and write my own. This was fun. But more than that, I uncovered some interesting behavioural points that may be unknown to many. More on this later.
I soon had some command-line argument parsing code and sine it ended up being useful to me, I ended up publishing it as my first rust crate. It’s called ap
for “argument parser”. Not a very creative name maybe, but succinct and simple, like the crate itself.
By this stage, the rout
codebase was coming along nicely and it was time to add the CLI parsing. But when I added ap
to rout
and tried running the repeat command (-r -1
), it failed. The problem? ap
was assuming that -1
was a command-line option, not an option argument. Silly bug right? Err, yes and no. Read on for an explanation!
getopt minutiae
It may not common knowledge, but getopt(3)
, and in fact most argument parsing packages, provide support for numeric option names. If you haven’t read the back-story, this means it supports options like -7
which might be a short-hand for the long option --enable-lucky-seven-mode
(whatever that means ;) And so to our first revelation:
Revelation #1:
getopt(3)
supports any ASCII option name that is not-
,;
or:
.
In fact, it’s a little more subtle that that: although you can create an option called +
, it cannot be the first character in optstring
.
If you didn’t realise this, don’t feel too bad! You need to squint a bit when reading the man page to grasp this point, since it is almost painfully opaque on the topic of what constitutes a valid option character. Quoting verbatim from getopt(3)
:
optstring is a string containing the legitimate option characters.
Aside from the fact that the options are specified as a const char *
, yep, that is your only clue! The FreeBSD man page is slightly clearer, but I would still say not clear enough personally. Yes, you could read the source, but I’ll warn you now, it’s not pretty or easy to grok!
But let this sink in: you can use numeric option names…
The more astute reader may be hearing faint alarm bells ringing at this point. Not to worry if that’s not you as I’ll explain all later.
An easy way to test getopt behaviour
I’ve created a simple C program called test_getopt.c
that allows you to play with getopt(3)
without having to create lots of test programs, or recompile a single program constantly as you tweak it.
The program allows you to specify the optstring
as the first command-line argument with all subsequent arguments being passed to getopt(3)
.
See the README for some examples.
Real-world evidence
If you’ve ever run the ss(1)
or socat(1)
commands, you may have encountered numeric options as both commands accept -4
and -6
options to denote IPv4 and IPv6 respectively. I’m reasonably sure I’ve also seen a command use -#
as an option but cannot remember which.
The ap bug
The real bug in ap
was that it was prioritising options over argument order: it was not parsing “in strict order”.
Parsing arguments in strict order
Remember we mentioned parsing argument “in strict order” earlier? Well “in strict order” doesn’t just mean that arguments are parsed sequentially in the order presented (first, second, third, etc), it also means that option arguments will be consumed by the “parent” (aka previous) option, regardless of whether the option argument starts with a dash or not. It’s beautifully simple and logical and crucially results in zero ambiguity for getopt(3)
.
To explain this, imagine your program calls getopt()
like this:
"12:"); getopt(argc, argv,
The program could then be invoked in any of the following ways:
prog -1
$ prog -2 foo $
But: it could also be called like this:
prog -2 -1 $
getopt(3)
parses this happily: there is no error and no ambiguity. As far as getopt(3)
is concerned the user specified the -2
option passing it the value of -1
. To be clear, as far as getopt(3)
is concerned, the -1
option was not been specified!
Revelation #2:
In argument parsing, “in strict order” means the parser considers each argument in sequential order and if a command-line argument is an option and that option requires an argument, the next command-line argument will become that options argument, regardless of whether it starts with dash or not!
Going back to the revelation. Consuming the next argument after an option requiring a value is a brilliantly simple design. It’s also easy to implement. And since getopt(3)
is part of the POSIX standard, it’s actually the behaviour you should be expecting from a command-line parser, atleast if you started out as a systems programmer. But since the details of this parsing behaviour have been somewhat shrouded in mystery, you may not be aware that you should be expecting such behaviour from other parsers!
But, alas, POSIX or not, this behaviour isn’t necessarily intuitive (see above) and indeed this is not how all command-line parsers work.
Summary of command-line argument parsers
As my curiosity was now piqued, I decided to do a quick survey of command-line parsing packages for a variety of languages. This is in no way complete and I’ve missed out many languages and packages. But it’s an interesting sample nonetheless.
The table below summarises the behaviour for various languages and parsing libraries:
language | library/package | strict ordering? |
---|---|---|
bash | getopts |
Yes (uses getopt(3) ) |
C/C++ | getopt(3) |
Yes (POSIX standard! ;) |
Go | cli |
Yes |
java | apache-commons-cli |
Yes |
lua | argparse |
No |
perl | Getopt::Std |
Yes |
python | argparse |
No |
ruby | optparse |
Yes |
rust | ap |
Yes |
rust | clap (v2+v3) |
No |
swift | swift-argument-parser |
No |
zsh | getopts |
Yes (uses getopt(3) ) |
Note:
The libraries that do not use strict ordering (aka “the
getopt
way”) are not wrong or broken, they just work slightly differently! As long as you are aware of the difference, there is no problem ;)
Why are some libraries different?
It comes down to how the command-line arguments are parsed by the package.
Assume the library has just read an argument and determined definitively that it is an option and that the option requires a value. It then reads the next argument:
If the library is like
getopt(3)
, it will just consume the argument as the value for the just-seen option (regardless of whether the argument starts with a dash or not).Alternatively, if this new argument starts with a dash, the library will consider it an option and then error since the previous argument (the option) was expecting a value.
The subtlety here is that “
getopt()
-like” implementations allow option values to look like options, which may surprise you.
So what?
We’ve had two revelations:
- Most argument parsers support numeric option names.
- Strict argument parsing means consuming the next argument, even if it starts with a dash.
You may be envisaging some of the potential problems now:
“What if my program accepts a numeric option and also has an option that accepts a numeric argument?”
There is also the slightly more subtle issue:
"What if my program has a flag option and also has an option that can accept a free-form string value?
Indeed! Here be dragons! To make these problems clearer, we’re going to look at some examples.
Example 1: Missile control
Imagine an evil and powerful tech-savvy despot asks his minions to write a CLI program for him to launch missiles. The program uses getopt(3)
with an optstring
of “12n:
” so that he can launch a single missile (-1
), two missiles (-2
), or lots (-n <count>
):
Here’s how he could do his evil work:
fire-missile -1
$ Firing 1 missile!
fire-missile -2
$ Firing 2 missiles!
Unfortunately, the poor programmer who wrote this program didn’t check the inputs correctly. Here’s what happens when the despot decides to fire a single missile, but maybe in a drunken stupor / tab-complete fail, runs the following by mistake:
fire-missiles -n -1
$ Firing 4294967295 missiles!
He’s meant to run fire-missiles -1
(or indeed fire-missiles -n 1
), but got confused and appears to have started Armageddon by mistake since the program parsed the -n
option value as a signed integer.
Example 2: Get rich quick or get fired?
Another example. Imagine a program used to transfer money between banks by allowing the admin to specify two IBAN (International Bank Account Number) numbers, an amount and a transaction summary field. Here are the arguments the program will accept:
-f <IBAN>
: Source account.-t <IBAN>
: Destination account.-a <amount>
: Amount of money to transfer (let’s ignore things like different currencies and exchange rates to keep it simple).-s <text>
: Human readable summary of the transaction.-d
: Dry-run mode - don’t actually send the money, just show what would be done.
We could use it to send 100 units of currency like this:
prog -f me -t you -a 100 -s test $
For this program we specify a getopt(3)
optstring
of “a:df:s:t:
”. Fine. But using strict ordering, if I run the program as follows, I’ll probably get fired!
prog -f me -t you -a 10000000000 -s -d $
Oops! I meant to specify a summary, but I forgot. But hey, that’s fine as I specified to run this in dry-run mode using -d
. Oh. Wait a second…
Yes, I’m in trouble because the money was sent as in fact I didn’t specify to run in dry-run mode: I specified a summary of “-d
” due to the strict argument parsing semantics of getopt(3)
.
Example 3: Something to give you nightmares
Using the knowledge of the revelations, you can easily contrive some real horrors. Take, for example, the following abomination:
prog -12 3 -4 -5 -67 8 -9 $
How is that parsed? Is that first -12
argument a simple negative number? Or is it actually a -1
option with the option argument value 2
? Or is it a -1
option and a -2
option “bundled” together?
The answer of course depends on how you’ve defined the optstring
value to getopt(3)
. But please, please never write programs with interfaces like this! ;)
You can use the test_getopt.c
program to test out various ways of parsing that horrid command-line. For example, one way to handle them might be like this:
test_getopt "1::45:9" -12 3 -4 -5 -67 8 -9
$ INFO: getopt option: '1' (optarg: '2', optind: 2, opterr: 1, optopt: 0)
INFO: getopt option: '4' (optarg: '', optind: 4, opterr: 1, optopt: 0)
INFO: getopt option: '5' (optarg: '-67', optind: 6, opterr: 1, optopt: 0)
INFO: getopt option: '9' (optarg: '', optind: 8, opterr: 1, optopt: 0)
But alternatively, it could be parsed like this:
test_getopt "12:4567:9" -12 3 -4 -5 -67 8 -9
$ INFO: getopt option: '1' (optarg: '', optind: 1, opterr: 1, optopt: 0)
INFO: getopt option: '2' (optarg: '3', optind: 3, opterr: 1, optopt: 0)
INFO: getopt option: '4' (optarg: '', optind: 4, opterr: 1, optopt: 0)
INFO: getopt option: '5' (optarg: '', optind: 5, opterr: 1, optopt: 0)
INFO: getopt option: '6' (optarg: '', optind: 5, opterr: 1, optopt: 0)
INFO: getopt option: '7' (optarg: '8', optind: 7, opterr: 1, optopt: 0)
INFO: getopt option: '9' (optarg: '', optind: 8, opterr: 1, optopt: 0)
Aside
Coincidentally, by combining test_getopt
with utfout
, you can prove Revelation #1 rather simply:
(utfout -a "\n" "\{\x21..\x7e}"; echo) |\
$ while read char
do
test_getopt "x$char" -"$char"
done
Note: The leading “
x
” in the specifiedoptstring
argument is to avoid having to special case the string since the first character is “special” to getopt(3). See the man page for further details.
Summary
Admittedly, these are very contrived (and hopefully unrealistic!) examples. The missile control example is also a very poor use of getopt(3)
since in this scenario, a simple check on argv[1]
would be sufficient to determine how many missiles to fire. However, you can now see the potential pitfalls of numeric options and strict argument parsing.
To test a parser
If you want to establish if your chosen command-line parsing library accepts numeric options and if it parses in strict order, create a program that:
Accepts a
-1
flag option (an option that does not require an argument).Accepts a
-2
argument option (that does accept an argument).Run the program as follows:
prog -2 -1 $
If the program succeeds (and sets the value for your
-2
option to-1
), your parser is “getopt()
-like” (is parsing in strict order) and implicitly also supports numeric options.
Conclusions
Here’s what we’ve unearthed:
The
getopt(3)
man page on Linux is currently ambiguous.I wrote a patch to resolve this, and the patch has been accepted. Hopefully it will land in the next release of the man pages project.
All command-line parsing packages should document precisely how they consume arguments.
Unfortunately, most don’t say anything about it! However,
ap
does. Specifically, see the documentation here.getopt(3)
doesn’t just support alphabetic option names: a name can be almost any ASCII character (-3
,-%
,-^
,-+
, etc).Numeric options should be used with caution as they can lead to ambiguity; not for
getopt(3)
et al, but for the end user running the program. Worst case, there could be security implications.Permitting negative numeric option values should also be considered carefully. Rather than supporting
-r -1
, it would be safer ifutfout
androut
required the repeat count to be>= 1
and if the user wants to repeat forever, support-r max
or-r forever
rather than-r -1
.Some modern command-line parsers prioritise options over argument ordering (meaning they are not “
getopt()
-like”).You should understand how your chosen parser works before using it.
Parsing arguments “in strict order” does not only mean “in sequential order”: it means the parser prioritises command-line arguments over option values.
If your chosen parsing package prioritises arguments over options (like
getopt(3)
, you need to take care if you use numeric options since arguments will be consumed “greedily” (and silently).If your chosen parsing package prioritises options over arguments, you will probably be safer (since an incorrect command-line will generate an error), but you need to be aware that the package is not “
getopt()
-like”.a CLI program must validate all command-line option values; command-line argument parsers provide a way for users to inject data into a program, so a wise programmer will always be paranoid!
The devil is in the detail ;)
That’s a great article! The neatly organized content is good to see. Can I quote a blog and write it on my blog? My blog has a variety of communities including these articles. Would you like to visit me later? 온라인카지노
ReplyDeleteBest Casinos that allow you to gamble on casino? - BSJeon.net
ReplyDeleteThe list of casinos 바카라쿠폰 you can bet on at 로투스바카라작업 one of these 바카라쿠폰 is: Bovada Casino; 바카라룰 Cafe Casino; 샌즈바카라 Café Casino; Café Casino; Café Casino.