🎙 Develpreneur Podcast Episode

Audio + transcript

Statistical Bigotry: Avoiding this Anti-Pattern in Software Architecture

In this episode, we discuss the anti-pattern of statistical bigotry in software architecture. We explore how this pattern can lead to poor design decisions and how to avoid it by taking the time to understand the problem and the solution.

2022-03-19 •Season 16 • Episode 554 •Statistical Bigotry in Software Architecture •Podcast

Summary

In this episode, we discuss the anti-pattern of statistical bigotry in software architecture. We explore how this pattern can lead to poor design decisions and how to avoid it by taking the time to understand the problem and the solution.

Detailed Notes

Statistical bigotry is a pattern of behavior where we take small amounts of data or anecdotal data and turn it into something bigger than it is. This can lead to poor design decisions because we're not considering the full context of the problem. In software architecture, this can manifest as focusing on the wrong thing at the right time or in the right way. To avoid this anti-pattern, we need to spend some time understanding not only the problem we're solving, but how the solution will be used. We need to step back and look at how each component fits into the overall solution. This requires taking the time to understand the data and the context in which it will be used.

Highlights

  • Statistical bigotry is where we take small amounts of data or anecdotal data and turn it into something bigger than it is.
  • This anti-pattern is one of a couple that are related to focus, basically not focusing at the right thing at the right time or in the right way.
  • We need to be able to step back and look at how does this component, how does this piece, how does this function fit in the context of the solution we're building?
  • We need to spend some time understanding not only the problem you're solving and the solution that you're designing and the architecture that's building that, but how is this going to be used?
  • If you take your time with that, then you can help yourself avoid getting bit and those are painful bites to have because you're, you know, if they show up once you're in production and they require some rearchitecting, that can be very challenging for you to work with because you now have live customers, they're feeling the pain.

Key Takeaways

  • Statistical bigotry is a common anti-pattern in software architecture.
  • This anti-pattern can lead to poor design decisions.
  • To avoid statistical bigotry, we need to spend time understanding the problem and the solution.
  • We need to consider the full context of the problem, including how the solution will be used.
  • We need to take the time to understand the data and the context in which it will be used.

Practical Lessons

  • Take the time to understand the problem and the solution.
  • Consider the full context of the problem, including how the solution will be used.
  • Avoid making design decisions based on small amounts of data or anecdotal data.

Strong Lines

  • Statistical bigotry is where we take small amounts of data or anecdotal data and turn it into something bigger than it is.
  • We need to be able to step back and look at how does this component, how does this piece, how does this function fit in the context of the solution we're building?

Blog Post Angles

  • The dangers of statistical bigotry in software architecture and how to avoid it.
  • How to identify and avoid statistical bigotry in your own design decisions.
  • The importance of considering the full context of the problem when making design decisions.
  • The consequences of ignoring statistical bigotry and how it can lead to poor design decisions.
  • The benefits of taking the time to understand the problem and the solution, and how it can lead to better design decisions.

Keywords

  • Statistical bigotry
  • Software architecture
  • Design decisions
  • Context
  • Data
Transcript Text
Welcome to building better developers, the developer podcast, where we work on getting better step by step, professionally and personally. Let's get started. Hello and welcome back. We are continuing our season where we're looking at patterns and anti-patterns for software architecture. This episode, we're going to look at a new one. We're going to call this statistical bigotry using some nice inflammatory words, as you sometimes see within the anti-patterns to sort of get your attention. This particular anti-pattern is one of a couple that are related to focus, basically not focusing at the on the right thing at the right time or in the right way and thus working really but not really getting the solution that you should get or that makes the customer happy, which is really the most important one. Statistical bigotry is where we take small amounts of data or anecdotal data and turn it into something bigger than it is. Now you see examples of this all the time in sports and political talk and things like that where people want to make a point and they'll say, you know, such and such had this great game and if they did that for the next 10 games, they'd be the greatest player ever or in politics, it's, you know, this happened this one time. So this must be the way it is for everybody that believes that way or, you know, something like that. Just these wide, broad statements. Now you may think that doesn't happen in technology. That doesn't happen in software, but it does. And we'd see it in a more specific way. Most often, what we'll see with this anti-pattern is that we spend essentially too much time on something that does not occur as often versus something that when it is put into use, when it becomes production, is used frequently. We've talked in the past about different challenges and one of the things that you want to do in order to really to tune your application to properly perform, to get good response times, is look at what matters for lack of a better term. And by that, I mean, how often is this code run? If it runs once, then if it's a little sloppy or a little off, it's not going to matter. If it runs a million times a minute, you really need to tighten that stuff up. And that's where statistical bigotry shows up. With this anti-pattern, what we do is we look at usually from one particular vantage point or user type. And, you know, I'm going to go ahead and throw the developers under the bus because we do this a lot. We look at building the solution based on how we use it. How we would see ourselves with that solution, with that application. And so we are emphasizing the things that we do or that we understand, which is really a lot of the times where this comes from. We don't consider some of these other features that maybe we don't understand as much or don't realize where they're where they fit in the overall solution. And we don't know how often they're going to be done. And so we really don't pay as much attention to those. And the next thing you know, you have an application that goes out that works awesome for these cases that the developers used it for or tested it for, but not at all for the end users or maybe not at all. But they get complaints, they being the developers, the implementation team gets complaints because the application isn't responding in the way that we would expect. And these expectations usually refer back to either this thing is being executed, this component or the section of code is being executed more often than we expected, we being the implementers, or it can be a data sizing issue. Now, sometimes it's it's a minor kind of thing where we, for example, a minor thing would be we have a string field in a database that we have too small. We don't give enough space for it. We think that, let's say, comments on this particular record or notes are going to be short and maybe a sentence or two. So we set it at, I don't know, 250 characters. Great. And we find out in actual usage that this thing is actually more like a summary or maybe even a small short essay. And so it could be hundreds, even thousands of characters. And so we have to change the space. Now, with that, it's not just that simple. Usually it may be that we also have to handle some of the reports that we have and change those that they display in some meaningful way. The data entry piece may have to adjust because you don't want to be typing where you can only see 10 characters at a time. It's things like that. But that would be one of the areas we look at it. We say, oh, we looked at test data. This is like a perfect statistical bigotry thing. We looked at this test data or this sample data. They give us 10 records. We looked at them. Our solution covered those 10 records. What we didn't do is we didn't dig further into the data we were given to see if it was representative of what reality would be. If it was, if it included any outliers or things like that that are essentially not happy path data, is it, was it typical? Was it pushing the boundaries? What was it? Because we need to know that we need to know. For example, if we get in just going off this example, if we get a bunch of sample records, we need to know, are these fields maxed out essentially? Are the values that we're seeing here the maximum and or the minimum or both that we're going to see for this, for that field, for that data structure? Or could it be different? Data types are the same way. We see numbers in this one field. Is it always going to be numbers or could it be alphanumeric? And that's very common in IDs and things of that nature where it could very well be an 18 digit number is the ID or it could be an alphanumeric where those may be digits. They may be letters. Who knows? They could even have some special characters. Maybe there's some periods in there or like a decimal place or something like that, or dashes or underscores or spaces and all those kinds of things. So there's a lot that we need to examine with the samples that we're given. And in a similar fashion, we see these kinds of things in reports and dashboards and things of that nature at times where we'll put together a report that may be, may be very complex. Maybe it's pulling from a lot of different data and it's doing it essentially on the fly and in doing so it takes a while. But we figure this isn't going to be run that often. Maybe it'll be run once a day or once a week or once a month. Turns out, no, that thing actually gets run every hour and it's important for that to be run on the hour. Well, that may be something where now we looked at it. We thought, oh, that's this is a an outlier or a minor report. No, incorrect. It's actually something very important that we need to go back and address in a way that it can be run on a regular basis without bringing the system to their knees or also without somebody having to go get a cup of coffee every time they run that report. Now, what we're doing when we run into this anti-pattern is we are looking at situations and we're not spending the time to understand the real context around those. This would be, for example, if we were looking at a graph, if we were looking at a line of some sort. Now, if you don't have a, you know, like some sort of metric, like an X and Y axis, that actually has numbers and things like that, that it can be very deceptive. You can see some sort of a slope that looks devastating until you realize that in context, it's only a very, you know, it's a microscopic subset of a larger line. That's actually great. For example, you could actually have a situation where you have a line that's a very large, you know, it's a very small line. But you could actually have a situation where you have a line that is, let's say, you know, steadily increasing, but for some short period it decreases. And if you look only at that short period, it totally misrepresents it. And think of like a stock that's growing at a steady pace, a couple of percentage points a year, but it has two bad days in a row and drops, I don't know, 10 or 20% in those couple of days and then jumps back up a few days later, or maybe, you know, over the next four weeks, it jumps back up. In any case, that couple of days is going to make it look like you don't want to ever touch that set that business is going out of business in no time versus when you step back and you look at the overall context, you realize that no, that thing's fine. It's steady. Yeah. It has a bad day here and there. Who cares? Same thing with our architecture. We need to be able to step back and look at how does this component, how does this piece, how does this function fit in the context of the solution we're building? Is this something that is going to be used a lot? Is it a key cog or is it something that's really more of a nice to have? And we'll talk about this. We'll get back into these sorts of requirement digging and kinds of expeditions that we'll go on as we talk about some of the other lack of focus anti-patterns. This one is more, let's take a look at it. Let's slap a metric or some numbers to it. And then we just run with those numbers, not really spending the time to step back and look at what those mean. Maybe like if it's a science experience and you run it two or three times and it works great, and then you assume that it's always going to work great. That's not the case. Just toss a coin. If you toss a coin in the air and it comes up heads three times in a row, does that mean that it's always going to come up heads because it just did three times in a row? No, there are other options and that's what you have to understand. You have to understand that. That's what you have to understand. You have to understand what is it that I might be missing when I'm looking at this functionality. And again, it goes to some of the things we've covered in the past as well, as far as having a solid environment, being able to properly test your code and your solution is to have something that resembles reality when you do it. So that means it's your data sizes. And like if you deal with a database, if it's normally going to have hundreds of thousands or millions of records, then testing with tens of records is probably not going to be sufficient and almost definitely is not going to be sufficient because there's going to be issues that only run, you only run into once you hit certain numbers of records. For example, you like this one, we see a lot would be memory leak times of types of problems. You could have a minor memory leak that is in one little function, doesn't even show up as a blip during development and maybe even testing. But in reality, when you suddenly have a system that's now not being restarted daily and has hundreds, thousands, tens of thousands, millions of users, instead of maybe a couple dozen, suddenly that little thing magnifies a lot. And that would be, you know, us saying that, yeah, there's no memory leak would be one of those cases we're running into this anti-pattern we're saying, well, we didn't see it and this little sort of overly clean example that we put together of a few records and, you know, run it, ran it for a little while, but it does show up later and that's the challenge of the sanity pattern is it does not always appear to us at all until we get into a more realistic production system. These even can lead us to the ever popular developer statement of, I worked on my machine and it doesn't because as soon as you move it to another machine, it doesn't have that pristine or limited or whatever it is environment that allows it to work in one case and not in the other. So this can be a very dangerous anti-pattern to stumble across because it may not show up until you hit production, unless you get ahead of it. And that's the key to avoiding this anti-pattern is to spend some time understanding not only the problem you're solving and the solution that you're designing and the architecture that's building that, but how is this going to be used? Is this going to be something that is, is hit on in a regular, you know, every minute that there's features and functions that are pulled and addressed. Is it going to have lots and lots of users or very small minor users that only touch it every day, once a week, something like that. What kind of data sizes are we talking about? Are we talking about thousands of records? Are we talking about millions of records? Are we talking about, um, you know, the, the, the, the, the, the, the, the, like current users, are we talking about a few current users or hundreds of current users? Think of the resources that can run out things like disk space, processing, memory, time, just general time to run or return results at. And then using those possible limiting factors, ask questions about the data that you're examining to see if maybe you need to adjust your context and change your opinion or your assertions based on that data. Maybe, you know, it's, it would be like, you know, a political poll. If you go ask one person, how are they going to vote? I think we all know that's not sufficient. We know that there's a lot of other voters and they are going to have different opinions. And so you need to get to a point and there is this math behind all this too. If you really want to work on getting to the, the right amount that is minimal, but also representative enough. But in so doing, you also have to understand what does it take to make data representative of the situation we're going to be going into the production system, the production and end users. If you take your time with that, then you can help yourself avoid getting bit and those are painful bites to have because you're, you know, if they show up once you're in production and they require some rearchitecting, that can be very challenging for you to work with because you now have live customers, they're feeling the pain. You've got to find some way to change things up enough to fix it, but also to keep the train running. Something to think about when you're thinking about, you know, eh, we'll, we'll just sort of drive this one home. We'll see how it works. Now you may not want to do that. That being said, I think you'd may want to go on and check out the rest of your day. So we'll wrap this one up. And as always go out there and have yourself a great day, a great week, and we will talk to you next time. Thank you for listening to the show. Thank you for listening to building better developers to develop a new podcast. You can subscribe on Apple podcasts, Stitcher, Amazon, anywhere that you can find podcasts. We are there. And remember just a little bit of effort every day ends up adding into great momentum and great success. One more thing before you go, the developer podcast and site are a labor of love. We enjoy whatever we do, try to help developers become better. But if you've gotten some value out of this and you'd like to help us be great, if you go out to developer.com slash donate and donate whatever feels good for you. If you get a lot of value, a lot, if you don't get a lot of value, even a little would be awesome. In any case, we will thank you and maybe I'll make you feel just a little bit warmer as well. Now you can go back and have yourself a great day.