Aaron Christiansen

How to get nothing - approaches to handle out-of-bounds access

2024-03-01

Below is a snippet of pseudocode. What should it do, in your opinion? What’s the value of x, if it even has one?

list = [ 1, 2, 3, 4, 5 ]
x = item 10 of list

In the semantic vacuum of psuedocode, this is a difficult question to answer! How a language handles out-of-bounds container accesses can often set the scene for its wider semantics.

What options are there, how do they compare? Let’s take a look at the approaches chosen by different programming languages to handle this, for their common container types like lists and dictionaries. I’ll be focusing mostly on mainstream programming languages, though I’m sure there are novel approaches to be found on the cutting edge of language design. I’ll also share some of my opinions on them, but yours may well differ, and I’m always interested to hear those too!

Throw an exception

In many of the classic object-oriented programming languages, this is the default approach. They view out-of-bounds access as something that shouldn’t happen - you could say, an exception-al circumstance 😉

For example, in C#:

List<int> list = [ 1, 2, 3, 4, 5 ];
var x = list[10];
// System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative
// and less than the size of the collection.
// Parameter name: index

Python also uses this approach:

list = [ 1, 2, 3, 4, 5 ]
x = list[10]
# IndexError: list index out of range

Interestingly, this fits into a common Python programming idiom that it’s “easier to ask for forgiveness than permission”. If you would like to get an item from a dictionary, or do something else if it isn’t present, you may choose to handle this with a try/except block rather than checking for the item in advance:

def get_high_score(self, player):
    try:
        return self.high_scores[player]
    except KeyError:
        return 0

I must admit I find this kind of code structure a little unintuitive, and easier for mistakes to slip in. If this method needs to be extended later, it’s all too easy to add additional code inside the try statement, without considering whether the except is appropriate for that new code.

To provide an alternative, Python slightly breaks its zen rule “there should be one– and preferably only one –obvious way to do it” by providing a dict.get method which can take a default instead:

def get_high_score(self, player):
    return self.high_scores.get(player, 0)

In my opinion, this is more understandable, but also comes with some additional cognitive load. Most languages use [] to access container items, so when I’m scanning code, I can see [...] and immediately recognise that’s what’s happening. Here, my brain takes a split-second longer to parse a regular method call, and realise that this is still in fact accessing the collection, just with different semantics.

Return a null value

This is a common approach in many dynamically-typed scripting languages, such as JavaScript or Ruby. If you have an object in your language which represents the absence of a value - null, nil, undefined, etc - then it could make sense to return that when requesting an item which doesn’t exist.

For example, in Ruby:

items = [ 1, 2, 3, 4, 5 ]
x = items[10]
# => nil

JavaScript’s choice to return undefined has a notable effect on the rest of the language too. JavaScript objects are effectively just dictionaries; a.b is semantically equivalent to a['b']. So accessing a property which doesn’t exist will also return undefined, rather than throwing an exception like in most languages:

document.location.href
// => 'https://aaronc.cc'

document.location.ref // typo!
// => undefined

I think that returning a null value can often lead to nice ergonomics for default values, in combination with logical ||. You’re still using the standard [] for access, but with an additional expression prepended, which I’d say helps the semantics scan more easily:

def get_high_score(player)
    self.high_scores[player] || 0
end

On the other hand, it can lead to issues which are trickier to debug - you might unintentionally perform an out-of-bounds access, getting a null value, then pass this around for a while. Eventually, you try to use this value, which throws an exception. Tracing the original source of the out-of-bounds access can be painful in these scenarios!

This approach seen much less frequently in statically-typed languages. Even TypeScript opts not to accurately model JavaScript’s semantics for arrays here, typing the array subscript operator as returning T rather than T?. Unwrapping a nullable type for every container access could lead to overly-bloated code.

Blow up really hard

What do you do if your language doesn’t have nullability or exceptions? Bring the whole thing crashing down!

Accessing an out-of-bounds index of a Rust collection type will cause a panic, which is an unrecoverable fatal error which terminates the program:

let items = vec![1, 2, 3, 4, 5];
let x = items[10];
// thread 'main' panicked at src/main.rs:3:18:
// index out of bounds: the len is 5 but the index is 10

(Yes, it is technically possible to catch a panic, but not on all platforms, and this is rarely done in practice!)

This tends to be the option seen in statically-typed languages which utilise monadic Result types or other “errors as values” approaches instead of exceptions, including Haskell and Go. As with nullable types, unwrapping a monadic type or checking if err != nil every time you access a collection could prove clunky and overly verbose, so these languages tend to circumvent this model and fatally error for out-of-bounds accesses instead.

Rust does provide Vec::get to safely get an item as an Option, rather than panicking - but this has the same scannability problems as other approaches which don’t use [], in my opinion.

The Haskell ecosystem’s user-defined operators are a strength here. The built-in !! operator indexes with a panic, but various third-party packages define a !? or !!? operator for indexing safely and returning a Maybe monad. These look close enough to indicate that they serve a similar purpose at a passing glance, while the type system ensures that you can’t mix them up by accident.

main :: IO ()
main = putStrLn (show x)
  where items = [1, 2, 3, 4, 5]
        x = items !! 10
-- Main: Prelude.!!: index too large

import Relude.List
main :: IO ()
main = putStrLn (show x)
  where items = [1, 2, 3, 4, 5]
        x = items !!? 10
-- Nothing

Who cares!

Alright then, suppose your language doesn’t have exceptions, or nullability everywhere, or idiomatic error objects. What options do you really have left…?

Not a lot, it turns out - enter C! If you’ve used this language for any amount of time, I’m sure you’re well aware that its approach for accessing out-of-bounds elements of an array is “f*ck it”. More formally, it’s undefined behaviour, which means that absolutely anything can happen, and the language standard nor compiler make no guarantees.

Usually it causes a segmentation fault (segfault), the fancy term for trying to access memory which you aren’t allowed to. But sometimes, you might get some bonus data, where everything appears to be working just fine but you’re using incorrect data under the hood.

For example, this program ran with no error on my machine, even though it obviously performs an out-of-bounds access:

int main() {
    int items[5] = { 1, 2, 3, 4, 5 };
    int x = items[10];
}

$ gcc test.c ; ./a.out
$ echo $?
0

Printing the value with printf reveals that it’s pulling a number from… somewhere:

$ tail -n2 test.c
    printf("%d\n", items[10]);
}

$ gcc test.c ; ./a.out
-2092003104

Using a more outrageous index finally gets us a segmentation fault:

$ tail -n2 test.c
    printf("%d\n", items[10000]);
}

$ gcc test.c ; ./a.out
[1]    24087 segmentation fault  ./a.out

This approach can lead to swathes of difficult-to-debug problems and even security vulnerabilities. Sure, it’s performant - but it’s really easy to do the wrong thing quickly.

Going one level higher up, let’s take a look at the C++ STL containers. One of the most common is std::vector, its dynamically-sized array type. This feels like an opportunity to do some further validation and throw an exception, especially since C++ has exceptions, unlike C - but no, accessing out-of-bounds is still undefined behaviour.

Through the magic of undefined behaviour, both clang and GCC seem to compile this to print 0, even with the high index that saw a segfault for C:

int main() {
    std::vector<int> items = { 1, 2, 3, 4, 5 };
    std::cout << items[10000] << std::endl;
}

$ g++ test.c -std=c++11 ; ./a.out
0

Introducing some more complex types into the mix finally gets us a segfault payoff:

class Person {
public:
    std::string name, location;
    void introduce() {
        std::cout << "Hello! I'm " << name << " from " << location << "." << std::endl;
    }
};

int main() {
    std::vector<Person> people;
    std::cout << people[10].introduce() << std::endl;
}

$ g++ test.c ; ./a.out
[1]    25394 segmentation fault  ./a.out

This is my least favourite approach of the bunch so far. C has excuses due to its limited error handling capabilities, but C++ had an opportunity to cut away some of C’s undefined behaviour and make the default [] accessor use some more safety for its STL types.

In fairness, C++ does have a vector::at method, which will throw an exception if the index is out of range. But I’m not convinced this is really a solution, more a workaround. [] is the most “natural” syntax for most modern-day programmers, and the one which developers who aren’t intimately familiar with a language may gravitate towards and expect to work “correctly”.

If hardcore C++ developers need a maximally-fast unchecked way of indexing into a container, I’d rather this was flipped around - [] should bounds-check, with a separate unchecked_at method to skip that.

Bonus: adventures in C++ maps

On the topic of C++, let’s talk about its std::map type. This is a standard key-value store, like a Python dictionary. It has some interesting behaviour when a key doesn’t exist, due to decisions on how the core operators of the language work.

We’ve just seen how an out-of-bounds access of an std::vector is undefined behaviour. Let’s swap it out for a std::map instead:

class Person {
public:
    std::string name, location;
    void introduce() {
        std::cout << "Hello! I'm " << name << " from " << location << "." << std::endl;
    }
};

int main() {
    std::map<std::string, Person> people;
    people["Aaron"].introduce();
}

Of course, to be consistent with std::vector, this will obviously segfau-

$ g++ test.cpp ; ./a.out
Hello! I'm  from .

Ah… well. Turns out this is 100% defined behaviour!

C++ has default-constructed an instance of Person, with two empty strings for the name and location member variables. This gives us a valid instance of Person on which to call Person::introduce, and this new instance has been saved in the map for later. But why does it work like this!?

Let’s take a moment to think about the properties of std::vector versus std::map. std::vector needs to be explicitly resized for new indices to exist, and the items at those indices are preallocated, ready to use. But with a std::map, it’s permitted to assign to any key, at any time, in any order, meaning that all given keys are always valid for an assignment.

This property, combined with C++’s pool of operators, means we have to make a compromise which gives us this unintuitive result. If aaron is a Person instance, then this code…

people["Aaron"] = aaron;

…performs two main steps:

  1. Invoke Person::operator[] with the key "Aaron", to retrieve a reference to the item
  2. Copy-assign aaron into that reference

The invocation of Person::operator[] doesn’t have any context for how the result will actually be used! It could be as a “getter” or a “setter”, but the operator implementation doesn’t know this; it’s just being asked to return a reference.

Just in case the result is being assigned into, std::map needs to allocate a new Person and return a reference to it, which can then be copy-assigned into. This has a knock-on effect in the “getter” semantics, where requesting a key which doesn’t exist will give you a default-constructed instance of the type.

These characteristics mean you can’t write a C++ key-value collection which can both create new assignments and also detect out-of-bounds access through the [] operator, unless you have some kind of proxy object being returned by your operator[] implementation.

Rust works around this issue by implementing Index but not IndexMut for dictionary types like HashMap. This means you can access hashes for immutable access, implying the key already exists, but not for mutable access. This removes the footgun of silently creating items, but means that alternative methods must be used for inserting new items.

It seems that dynamic languages have this figured out, though. Both Python and Ruby provide overridable “index assignment” operators, which can be used specifically for cases like this. For example, suppose we wanted to implement a Ruby dictionary which throws an exception rather than returning nil on a non-existent key:

class Dictionary
    def initialize
        @items = {}
    end

    # Indexing operator
    def [](key)
        raise "no key `#{key}`" unless @items.has_key? key
        @items[key]
    end

    # Index-assignment operator
    def []=(key, value)
        @items[key] = value
    end
end

dict = Dictionary.new
dict["foo"] = 2
dict["foo"] # => 2
dict["bar"] # => no key `bar`

Conclusion

As with many decisions in programming language design, the right choice comes down to exactly how the rest of the language works. I think that, for many of the languages I’ve discussed here, their chosen approach is the right one for that language.

Exceptions are a regular occurrence when developing C# or Java code, and it fits Python’s “ask for forgiveness” mantra, so for those languages throwing an exception makes sense and feels consistent.

In highly-dynamic, “anything-goes” languages like JavaScript and Ruby, returning a null value which can be handled or substituted as desired fits nicely.

Panicking skips the performance overhead of exceptions, and the syntactic burden of unwrapping a result type, and type signatures of container accesses in languages which use this approach clearly indicate that panicking is the only possible behaviour for an out-of-bounds access.

And C… well, it has many other ways to cause undefined behaviour, so I suppose it’s consistent for out-of-bounds access to provide one more!

C++ may be the only decision I disagree with - I feel the STL containers were a good opportunity to add some additional safety to the language.

Hopefully this has served as an interesting insight into the semantics of some different languages, and how they approach what seems like a mundane edge-case!