How to get nothing - approaches to handle out-of-bounds access
2024-03-01Below is a snippet of pseudocode. What should it do, in your opinion? What’s the value of x
, if it
even has one?
list = [ 1, 2, 3, 4, 5 ]
x = item 10 of list
In the semantic vacuum of psuedocode, this is a difficult question to answer! How a language handles out-of-bounds container accesses can often set the scene for its wider semantics.
What options are there, how do they compare? Let’s take a look at the approaches chosen by different programming languages to handle this, for their common container types like lists and dictionaries. I’ll be focusing mostly on mainstream programming languages, though I’m sure there are novel approaches to be found on the cutting edge of language design. I’ll also share some of my opinions on them, but yours may well differ, and I’m always interested to hear those too!
Throw an exception
In many of the classic object-oriented programming languages, this is the default approach. They view out-of-bounds access as something that shouldn’t happen - you could say, an exception-al circumstance 😉
For example, in C#:
List<int> list = [ 1, 2, 3, 4, 5 ];
var x = list[10];
// System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative
// and less than the size of the collection.
// Parameter name: index
Python also uses this approach:
list = [ 1, 2, 3, 4, 5 ]
x = list[10]
# IndexError: list index out of range
Interestingly, this fits into a common Python programming idiom that it’s
“easier to ask for forgiveness than permission”.
If you would like to get an item from a dictionary, or do something else if it isn’t present,
you may choose to handle this with a try
/except
block rather than checking for the item in
advance:
def get_high_score(self, player):
try:
return self.high_scores[player]
except KeyError:
return 0
I must admit I find this kind of code structure a little unintuitive, and easier for mistakes to
slip in. If this method needs to be extended later, it’s all too easy to add additional code inside
the try
statement, without considering whether the except
is appropriate for that new code.
To provide an alternative, Python slightly breaks its
zen rule “there should be one– and preferably only one
–obvious way to do it” by providing a dict.get
method which can take a default instead:
def get_high_score(self, player):
return self.high_scores.get(player, 0)
In my opinion, this is more understandable, but also comes with some additional cognitive load.
Most languages use []
to access container items, so when I’m scanning code, I can see [...]
and
immediately recognise that’s what’s happening. Here, my brain takes a split-second longer to parse a
regular method call, and realise that this is still in fact accessing the collection, just with
different semantics.
Return a null value
This is a common approach in many dynamically-typed scripting languages, such as JavaScript or Ruby.
If you have an object in your language which represents the absence of a value - null
, nil
,
undefined
, etc - then it could make sense to return that when requesting an item which doesn’t
exist.
For example, in Ruby:
items = [ 1, 2, 3, 4, 5 ]
x = items[10]
# => nil
JavaScript’s choice to return undefined
has a notable effect on the rest of the language too.
JavaScript objects are effectively just dictionaries; a.b
is semantically equivalent to a['b']
.
So accessing a property which doesn’t exist will also return undefined
, rather than throwing an
exception like in most languages:
document.location.href
// => 'https://aaronc.cc'
document.location.ref // typo!
// => undefined
I think that returning a null value can often lead to nice ergonomics for default values, in
combination with logical ||
. You’re still using the standard []
for access, but with an
additional expression prepended, which I’d say helps the semantics scan more easily:
def get_high_score(player)
self.high_scores[player] || 0
end
On the other hand, it can lead to issues which are trickier to debug - you might unintentionally perform an out-of-bounds access, getting a null value, then pass this around for a while. Eventually, you try to use this value, which throws an exception. Tracing the original source of the out-of-bounds access can be painful in these scenarios!
This approach seen much less frequently in statically-typed languages.
Even TypeScript opts not to accurately model JavaScript’s semantics for arrays here, typing the
array subscript operator
as returning T
rather than T?
. Unwrapping a nullable type for every container access could
lead to overly-bloated code.
Blow up really hard
What do you do if your language doesn’t have nullability or exceptions? Bring the whole thing crashing down!
Accessing an out-of-bounds index of a Rust collection type will cause a panic, which is an unrecoverable fatal error which terminates the program:
let items = vec![1, 2, 3, 4, 5];
let x = items[10];
// thread 'main' panicked at src/main.rs:3:18:
// index out of bounds: the len is 5 but the index is 10
(Yes, it is technically possible to catch a panic, but not on all platforms, and this is rarely done in practice!)
This tends to be the option seen in statically-typed languages which utilise monadic Result
types
or other “errors as values” approaches instead of exceptions, including Haskell and Go. As with
nullable types, unwrapping a monadic type or checking if err != nil
every time you access a
collection could prove clunky and overly verbose, so these languages tend to circumvent this model
and fatally error for out-of-bounds accesses instead.
Rust does provide Vec::get
to safely get an item as an Option
, rather than panicking - but
this has the same scannability problems as other approaches which don’t use []
, in my opinion.
The Haskell ecosystem’s user-defined operators are a strength here. The built-in !!
operator
indexes with a panic, but various third-party packages define a !?
or !!?
operator for indexing
safely and returning a Maybe
monad. These look close enough to indicate that they serve a
similar purpose at a passing glance, while the type system ensures that you can’t mix them up by
accident.
main :: IO ()
main = putStrLn (show x)
where items = [1, 2, 3, 4, 5]
x = items !! 10
-- Main: Prelude.!!: index too large
import Relude.List
main :: IO ()
main = putStrLn (show x)
where items = [1, 2, 3, 4, 5]
x = items !!? 10
-- Nothing
Who cares!
Alright then, suppose your language doesn’t have exceptions, or nullability everywhere, or idiomatic error objects. What options do you really have left…?
Not a lot, it turns out - enter C! If you’ve used this language for any amount of time, I’m sure you’re well aware that its approach for accessing out-of-bounds elements of an array is “f*ck it”. More formally, it’s undefined behaviour, which means that absolutely anything can happen, and the language standard nor compiler make no guarantees.
Usually it causes a segmentation fault (segfault), the fancy term for trying to access memory which you aren’t allowed to. But sometimes, you might get some bonus data, where everything appears to be working just fine but you’re using incorrect data under the hood.
For example, this program ran with no error on my machine, even though it obviously performs an out-of-bounds access:
int main() {
int items[5] = { 1, 2, 3, 4, 5 };
int x = items[10];
}
$ gcc test.c ; ./a.out
$ echo $?
0
Printing the value with printf
reveals that it’s pulling a number from… somewhere:
$ tail -n2 test.c
printf("%d\n", items[10]);
}
$ gcc test.c ; ./a.out
-2092003104
Using a more outrageous index finally gets us a segmentation fault:
$ tail -n2 test.c
printf("%d\n", items[10000]);
}
$ gcc test.c ; ./a.out
[1] 24087 segmentation fault ./a.out
This approach can lead to swathes of difficult-to-debug problems and even security vulnerabilities. Sure, it’s performant - but it’s really easy to do the wrong thing quickly.
Going one level higher up, let’s take a look at the C++ STL containers. One of the most common is
std::vector
, its dynamically-sized array type. This feels like an opportunity to do some further
validation and throw an exception, especially since C++ has exceptions, unlike C - but no,
accessing out-of-bounds is still undefined behaviour.
Through the magic of undefined behaviour, both clang and GCC seem to compile this to print 0
, even
with the high index that saw a segfault for C:
int main() {
std::vector<int> items = { 1, 2, 3, 4, 5 };
std::cout << items[10000] << std::endl;
}
$ g++ test.c -std=c++11 ; ./a.out
0
Introducing some more complex types into the mix finally gets us a segfault payoff:
class Person {
public:
std::string name, location;
void introduce() {
std::cout << "Hello! I'm " << name << " from " << location << "." << std::endl;
}
};
int main() {
std::vector<Person> people;
std::cout << people[10].introduce() << std::endl;
}
$ g++ test.c ; ./a.out
[1] 25394 segmentation fault ./a.out
This is my least favourite approach of the bunch so far. C has excuses due to its limited error
handling capabilities, but C++ had an opportunity to cut away some of C’s undefined behaviour and
make the default []
accessor use some more safety for its STL types.
In fairness, C++ does have a vector::at
method, which will throw an exception if the index is out
of range. But I’m not convinced this is really a solution, more a workaround. []
is the most
“natural” syntax for most modern-day programmers, and the one which developers who aren’t intimately
familiar with a language may gravitate towards and expect to work “correctly”.
If hardcore C++ developers need a maximally-fast unchecked way of indexing into a container, I’d
rather this was flipped around - []
should bounds-check, with a separate unchecked_at
method to
skip that.
Bonus: adventures in C++ maps
On the topic of C++, let’s talk about its std::map
type. This is a standard key-value store, like
a Python dictionary. It has some interesting behaviour when a key doesn’t exist, due to decisions
on how the core operators of the language work.
We’ve just seen how an out-of-bounds access of an std::vector
is undefined behaviour. Let’s swap
it out for a std::map
instead:
class Person {
public:
std::string name, location;
void introduce() {
std::cout << "Hello! I'm " << name << " from " << location << "." << std::endl;
}
};
int main() {
std::map<std::string, Person> people;
people["Aaron"].introduce();
}
Of course, to be consistent with std::vector
, this will obviously segfau-
$ g++ test.cpp ; ./a.out
Hello! I'm from .
Ah… well. Turns out this is 100% defined behaviour!
C++ has default-constructed an
instance of Person
, with two empty strings for the name
and location
member variables. This
gives us a valid instance of Person
on which to call Person::introduce
, and this new instance
has been saved in the map for later. But why does it work like this!?
Let’s take a moment to think about the properties of std::vector
versus std::map
.
std::vector
needs to be explicitly resized for new indices to exist, and the items at those
indices are preallocated, ready to use. But with a std::map
, it’s permitted to assign to
any key, at any time, in any order, meaning that all given keys are always valid for an assignment.
This property, combined with C++’s pool of operators, means we have to make a compromise which gives
us this unintuitive result. If aaron
is a Person
instance, then this code…
people["Aaron"] = aaron;
…performs two main steps:
- Invoke
Person::operator[]
with the key"Aaron"
, to retrieve a reference to the item - Copy-assign
aaron
into that reference
The invocation of Person::operator[]
doesn’t have any context for how the result will actually
be used! It could be as a “getter” or a “setter”, but the operator implementation doesn’t know this;
it’s just being asked to return a reference.
Just in case the result is being assigned into,
std::map
needs to allocate a new Person
and return a reference to it, which can then be
copy-assigned into. This has a knock-on effect in the “getter” semantics, where requesting a key
which doesn’t exist will give you a default-constructed instance of the type.
These characteristics mean you can’t write a C++ key-value collection which can both create new
assignments and also detect out-of-bounds access through the []
operator, unless you have some
kind of proxy object being returned by your operator[]
implementation.
Rust works around this issue by implementing Index
but not IndexMut
for dictionary types like
HashMap
. This means you can access hashes for immutable access, implying the key already exists,
but not for mutable access. This removes the footgun of silently creating items, but means that
alternative methods must be used for inserting new items.
It seems that dynamic languages have this figured out, though. Both Python and Ruby provide
overridable “index assignment” operators, which can be used specifically for cases like this. For
example, suppose we wanted to implement a Ruby dictionary which throws an exception rather than
returning nil
on a non-existent key:
class Dictionary
def initialize
@items = {}
end
# Indexing operator
def [](key)
raise "no key `#{key}`" unless @items.has_key? key
@items[key]
end
# Index-assignment operator
def []=(key, value)
@items[key] = value
end
end
dict = Dictionary.new
dict["foo"] = 2
dict["foo"] # => 2
dict["bar"] # => no key `bar`
Conclusion
As with many decisions in programming language design, the right choice comes down to exactly how the rest of the language works. I think that, for many of the languages I’ve discussed here, their chosen approach is the right one for that language.
Exceptions are a regular occurrence when developing C# or Java code, and it fits Python’s “ask for forgiveness” mantra, so for those languages throwing an exception makes sense and feels consistent.
In highly-dynamic, “anything-goes” languages like JavaScript and Ruby, returning a null value which can be handled or substituted as desired fits nicely.
Panicking skips the performance overhead of exceptions, and the syntactic burden of unwrapping a result type, and type signatures of container accesses in languages which use this approach clearly indicate that panicking is the only possible behaviour for an out-of-bounds access.
And C… well, it has many other ways to cause undefined behaviour, so I suppose it’s consistent for out-of-bounds access to provide one more!
C++ may be the only decision I disagree with - I feel the STL containers were a good opportunity to add some additional safety to the language.
Hopefully this has served as an interesting insight into the semantics of some different languages, and how they approach what seems like a mundane edge-case!