Don’t underestimate the dot (shared state is dangerous)

Debugging :-). Yesterday it was working. Today it said

  File "/app/conbench/app/benchmarks.py", line 19, in newest_of_many_results
    return max(results, key=lambda r: r.started_at)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: max() arg is an empty sequence

So, in the following dictionary comprehension, the function newest_of_many_results(bmrlist) was called with an empty list:

    newest_result_by_bname = {
        bname: newest_of_many_results(bmrlist)
        for bname, bmrlist in bmrt_cache["by_benchmark_name"].items()
    }

But bmrt_cache["by_benchmark_name"] cannot have empty lists. I thought. Strong invariant in the code. So, I added a log statement to see the benchmark name that we supposedly have no result for:

        if not bmrlist:
            log.warning("wow, empty list for name: %s", bname)

Now, all this happens as part of handling an HTTP request. I keep hitting F5 in the browser, and sometimes the exception above happens, sometimes not. “meh, now it turns out to even be a race?” Ok, it happened again. Let’s see about the name:

[conbench.app.benchmarks] WARNING: wow, empty list for name: ...

What? Literally ...? That’s a fascinating benchmark name. I was working with a production database snapshot at the time. I could totally see that someone submitted a benchmark result for the benchmark with the name “…”. There’s no strict validation.

But gladly I found another hint in the log. I found the following HTTP server access log line:

"GET /c-benchmarks/... HTTP/1.1" 200 4575 "http://localhost:5000/c-benchmarks/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36" (0.007145 s)

The browser is requesting /c-benchmarks/... as part of loading /c-benchmarks/. I got a suspicion, immediately understood a lot (not everything though). Let’s take this slow.

There is another HTTP handler @app.route("/c-benchmarks/<bname>", methods=["GET"]) whose implementation starts with

    try:
        matching_results = bmrt_cache["by_benchmark_name"][bname]
    except KeyError:
        return f"benchmark name not known: `{bname}`"

As of the HTTP server’s access log this was called with bname == "...", i.e. bmrt_cache["by_benchmark_name"]["..."]. And here is the bug. bmrt_cache["by_benchmark_name"] is a defaultdict(list), and it is shared across HTTP-handling threads. Yikes.

The special behavior of Python’s collections.defaultdict() is to create a new thing (list in this case) upon KeyError. So, this specific way of asking if a key is known is not only wrong. It does not work, and it has a mean, significant side effect: it injects a new key into the dictionary, with the value being an empty list.

Side note: I naively proposed that this might be worth checking for in pylint (see pylint/issues/8666), and maybe that’s going to happen! Cool.

That key injection seems to happen at about the same time as another HTTP handler iterates through the items in the same dictionary.

So far so good, that explains the initial symptom ValueError: max() arg is an empty sequence above.

But: why does my browser request /c-benchmarks/... in the first place? I do a quick Internet search and find myself typing stuff like

chrome appends three dots to url

to no avail. My intuition tells me that something stupid is going on. Key insight: yesterday it worked, today it doesn’t — what changed? We have git diff to tell us. This somewhat comes from experience: before being too frustrated about being punished by the meanest of all mysteries (like a cosmic ray induced bitflip) typically it’s just us doing something incredibly stupid, and we just need to systematically answer “what changed?”.

I looked at the diff and found that I changed an HTML template in an attempt to play with certain “card” component in Bootstrap 5:

      <div class="card">
        <img src="..." class="card-img-top" alt="...">
        <div class="card-body">
          <h5 class="card-title">Card title</h5>
          <p class="card-text">This is a longer card with supporting text below as a natural lead-in to additional content. This content is a little bit longer.</p>
        </div>
      </div>

There you have it, those three dots came straight from an HTML snippet that I copy/pasted ignorantly w/o even looking. It has those three dots in the src attribute of an img tag. That is why the browser performed an HTTP request to /c-benchmarks/....

This is a beautiful example for how sometimes the combination of seemingly unrelated and innocent things creates a funky situation. I am glad I found this bug during local development, and not in production.

You can find these snippets here: https://getbootstrap.com/docs/5.2/components/card/#grid-cards

Leave a Reply

Your email address will not be published. Required fields are marked *

Human? Please fill this out: * Time limit is exhausted. Please reload CAPTCHA.