If there’s one problem every modern IT organization struggles with, it’s the problem of data silos.
Take a typical sales department. If you were the vice president of the sales department in your organization, you’d want to make sure all your leads, customer information and any other important sales data is stored in a centralized database repository that’s easily accessible by you and all your sales staff.
Of course, in order to make sales, you need to make sure you have the proper inventory on hand and in stock. So as the vice president of inventory, your job is to make sure you know exactly what and how much inventory you have at any given minute. So you make sure you have a centralized database repository containing all relevant inventory and quantity information for every product your company sells as a business.
Similarly, if you were the vice president of your shipping department, your job is to make sure all the orders from the sales department get properly packaged up and shipped off to the right customer and shipping address. So as the VP of shipping and fulfillment, you want to make sure that all data relevant to packaging and shipping are stored in a safe, easily accessible, centralized database repository of its own.
It’s almost important to note that as part of the proper fulfillment of a sales order, you actually need data from the sales, inventory, and shipping departments.
A complete invoice must contain the information about the customer initiating the sales order, as well as the actual inventory items that have been ordered, and the shipping and packaging information.
The inventory department is particularly interested in sales orders and shipping department activity because they need to know when to replenish inventory. They may need to analyze past data from the sales department in order to help them figure out which months of the year are busier than others, so they can adjust their inventory ahead of time and avoid inventory shortages or overstock.
In short, if you’re the CEO of the company, your job is to make sure that each of these departments not only understands their own piece of the business inside and out, they need to effectively work well with the other departments they need to interact with.
Unfortunately, this is easier said than done.
I’ve been in enough companies in my career, to confidently say this is actually a rare occurrence.
And I don’t say this out of some vague gut feeling. I say this from the perspective of a software developer with experience working firsthand with many departments within my organization.
Common sense would dictate that you maximize your efficiency by ensuring you can access the data repositories from each department in a well-documented and prescribed manner.
More importantly, you make sure that none of your data is duplicated anywhere else. You want all your sales organization data stored and centralized in exactly ONE place. The same goes for the inventory and shipping departments. Duplicate data causes lots of problems. Duplicate data makes it impossible to determine what the golden “source of truth” is.
For instance, if your sales data is scattered across multiple data repositories, how do you determine which represents the “real” sales order?
This is actually one of the most important software design principles as a software programmer.
Don’t Repeat Yourself
Or DRY for short.
When data or code exists in one and ONLY one place, then you only have to change the data or code in one place, and not a million other places. Similarly, when you have to fix a bug or identify a piece of data, you only have to find the one single data repository or piece of code… it doesn’t become a wild goose chase across the entire IT organization, to find what you’re looking for.
While this is what SHOULD happen in an IT organization, more often than not, what really ends up happening is what I refer to as the data silo problem.
Say you’re the shipping department looking for data from the sales order department. Logic would dictate you locate the sales data repository and get the data you need.
But instead, you create a COPY of the sales data repository for your own purpose.
The same thing could happen in the opposite direction as well. If you’re part of the sales department and you need information from the shipping department, you might create a duplicate copy of the shipping department data repository.
This may sound crazy, but I assure you I’ve seen this too many times in my career.
Mirror copies of sales databases. Or inventory databases. Or shipping databases.
All scattered all over the place.
Why Data Silos Happen
There are lots of reasons. Some are human problems, others are technical.
What are the human problems? Sometimes a particular department just doesn’t want to share. They may be wary of other departments within the same company wanting their data, for whatever reason. Maybe they just don’t want other departments or employees touching their data repository and accidentally wiping data, or changing data that shouldn’t be changed.
So other departments may have no choice but to make duplicate copies of the original data repository.
Other times, the reasons are technical. If the company is scattered across different regional data centers located in different parts of the country or across different countries and continents, then data and network latency becomes a concern.
If one data center is located in the United States, and another is halfway around the world, there are real and tangible network latency issues across significant geographic distances like that.
The closer together two different data repositories are physically located to each other, the less network and data latency you have to deal with.
So again, that means creating duplicate copies of the original “source of truth” data repository, so that it exists closer to your own geographic location.
But as we previously stated, data silos and redundant repositories cause huge maintenance headaches and technical problems in the long run.
What a lot of organizations have gone to is the concept of the data lake.
Pros and Cons of Data Lakes
A data lake is really a fanciful term for a pretty straightforward concept. In a data lake architecture, you round up all relevant data repositories into one single centralized “lake”.
You may have heard of the term “data warehouse”, which is similar to a data lake, with the primary difference being that a data warehouse contains data that has a predefined schema and structure to it.
A data lake can contain raw, unstructured data that in many forms… flat files, comma delimited files, etc.
Using a data lake means that you can always assume the data is the “golden source of truth.”
When a company or organization utilizes a data lake, they make a deliberate business decision to dictate that every department who needs data from a certain repository goes to the lake, instead of some redundant and duplicate system.
Of course, like anything else, it’s not always rainbows and unicorns. There is a price to pay for a data lake.
There’s the “all your eggs in one basket” risk of centralizing all your data in one place. If that data lake becomes unavailable, you’ve suddenly created a company-wide outage until the problem is resolved.
It also becomes a prime target for malicious hackers.
Security and uptime availability concerns have to become constant high priority concerns. So like anything, every organization needs to analyze the benefits and costs of utilizing a data warehouse.
My only real concern is ever needing to interact with a “Crystal Lake” repository.
CH-CH-CH-AH-AH-AH ….