Too close to see the canvas – Part 3

The Pattern beyond the pixels – Change it up

In the first post we introduced the project and the problem, and in the last post we got into how reactive work never runs out and how stress affects our ability to see beyond the (seemingly) urgent. My second big learning relates to change, seeing possibilities and understanding why the people right in the middle of the problem often can’t.

”It’s impossible” is more often ”I can’t see how”

The main message of this post is this: change is usually possible. But it is perfectly possible that it feels impossible to you, with your current skillset, mandate, and level of expertise in a given area. Those are very different things — and mixing them up is one of the most common reasons good teams stay stuck.

When I started talking to people about the current state, where they saw us heading and my early findings, something became clear quite quickly. People weren’t only stuck in reactive work. They were also stuck in the belief that the current state of affairs was the only possible one – impossible to fix. A lot of the issues I raised were met with a flat ”that’s just how it is.” A lot of these people were very skilled, knew the domain well, did good work. But they lacked the competence, individually or as a team, to see a different way of doing things.

What worried me more was that they didn’t. They had moved from ”we don’t know how to fix this” to ”this cannot be fixed.” The jump from one to the other is subtle and sneaky. You try a few things, they don’t work, or maybe you meet resistance from peers or management. You run out of time, money, confidence, trust capital and gradually the problem solidifies into a wall. You stop questioning it. It becomes just another fact of life.

”Management needs to hire more testers!”.
”Developers need to add IDs to all UI objects!”
”We tried fixing the test data issue, it didn’t work.”

So let’s look at some of the areas where I saw clusters of problems — and why they are more fixable than they look.

”We don’t have the infrastructure to do that!”

A substantial share of the flakiness we saw could be traced back to infrastructure and operations. Timing issues — backups, patches, batch jobs, applications going offline at specific hours. Security and network issues — firewalls, DNS, blocked sites, queue behaviour. Things like:

”Oh, our firewall doesn’t allow us to do X, so we needed to do Y to get around it.”
”The tests fail if they start late or run a bit long, because at 3am the backup kicks in for system A.”
”Yes, they crash every third Monday of the month — that’s just patch day.”

All of these are solveable. Some very easily. But if you don’t have any experience with operations, or you feel like you don’t have the mandate to change things – they might look unavoidable. Most of the time they can be sorted by explaining your problem to someone with experience and asking for help to find a solution.

The key, as with so many things, is being able to articulate why it matters. If you can show someone that fixing this will save time, reduce noise, or improve team (or customer) satisfaction — they are usually more than willing to help. Operations teams are not the enemy. They’re just solving different problems from you, and they don’t always know yours exist. Potential solutions can range from rescheduling tests, separation of domains, re-configuration of hardware or even by throwing money at it.

In my case: we tweaked the timing of when we ran tests, batches and patches, and found a way to meaningfully reduce the timing conflicts. We got some adjustments to our firewall and security settings that unlocked several things we wanted to do. Some things had to wait for a bigger domain separation project. And we made a deliberate call to simply not run automated tests on nights we knew patches were rolling out — which reduced noise and made our metrics actually readable.

None of that was magic. It was conversations with the right people and a bit of coordination.

Ecosystems are complicated, delicate things

Then there was a big cluster of issues around environments, integrations and data. Comments that fell into this category included:

”The test failed because another test had already used up the data.”
”The run failed last night because someone had the environment locked.”
”That’s not a real bug — Service X was down and had the wrong data.”

These come down to things like concurrency issues, race conditions, data that gets corrupted or used up, limited capacity or number of environments, tests with unnecessary dependencies on external services that may be in the wrong state, or have the wrong version.

These are not trivially fixed. But a lot of the pain can be reduced by practices that are pretty standard today: using synthesised data where possible, removing hard dependencies with mocks and stubs where it makes sense, using virtual environments that can be spun up and torn down as needed, having tests create their own data rather than assuming they’ll find something useful already there. And, bluntly: getting management and operations to support a modern setup. Sometimes the answer is to throw money at it and that is a completely valid solution, especially if you can frame it in terms of what it costs not to do it.

But I also found some gems in this pile that were a bit different in nature.

”We didn’t know the code had changed”

I also found communication and process problems wearing the costume of test problems. They might look technical on the surface, but the root cause is somewhere in how people and teams talk to each other — or don’t.

“We didn’t know the code has changed so we didn’t know the tests needed to be updated!”
“We didn’t have time to fix the tests because we were busy implementing new tests!”
“Oh, that happened because the application in that environment was version X, we can only test version X.Y”

Code changes that don’t get communicated to the people maintaining the tests. Changes that do get communicated but nobody has the bandwidth to act on. No mechanism for running different versions of a test suite, so you can’t easily handle both a planned release and an urgent patch at the same time.

Some of this is genuinely hard to fix, depending on the people and organisation involved. But a lot of it is not. The obvious starting point is making sure testing is part of the conversation from the very beginning — not something you bolt on at the end. Practices like review sessions with three different roles involved, pair programming and ensemble programming exist precisely to close these gaps before they become expensive.

Communication breaking down between roles and teams is a strong anti-pattern. When you see it, it should be on everyone’s priority list to fix — not just the testers’. And a surprisingly good first step is to ask the other teams directly: what information would actually help you, and how would you prefer to receive it? That conversation alone tends to improve both relationships and understanding.

One thing that still baffles me a little is version-handling tests. Version-controlling production code has been standard practice for years. Yet I keep running into situations where test code doesn’t follow the same logic. We should always be able to run the exact version of a test or test suite that matches a given environment or release. Most organisations already have version control in place. The effort to apply it to test code is minimal. There is genuinely no good reason not to. And as a general principle: the closer your automation code lives to your production code, the better.

And tests that you know will fail? Disable them. Running them anyway creates noise, skews your metrics and wastes everyone’s time. Disable them, put a note on why, and focus your testing energy on those areas through other means in the meantime. Or take an informed decision to accept the risk — that’s a valid choice too, as long as it’s actually a decision and not just drift.

The common thread

Looking back at all three clusters — infrastructure, environments and data, communication and process — the pattern is the same. These problems look fixed and permanent from the inside. They have been there long enough that people have stopped questioning them. They have accumulated a kind of unearned authority: this is just how things are here.

But they are not permanent. They are just… stuck. And unsticking them usually requires someone who is a little bit outside of it. Someone who hasn’t yet accepted the unwritten rule that this particular problem isn’t worth fighting. Sometimes that someone walks in from outside. Sometimes it’s someone inside who decides to look up from the work long enough to ask: why, exactly, are we doing it this way?

That’s the thing about being too close to the canvas. Someone has to step back to see the picture.

In part four we will look into something I know lost me an assignment once: metrics and how to make sure you are measuring the right things.

Author

Lena Pejgan Nyström

Lena has been building software in one shape of form since 1999 when she started out as a developer building Windows Desktop Applications (Both impressive and scary - they are still alive and kicking)

She later found her passion for testing and even though her focus has shifted to building organizations and growing people, she is still an active voice in the testing community.

Her core drive is continuous improvement and she strongly believes we all should strive to challenge ourselves, our assumptions and the way things are done.

Lena is the author and creator of “Would Heu-risk it?” (card deck and book), an avid blogger, international keynote speaker and workshop facilitator.