When to Build A Data Fabric?
This paper examines reasons and use-cases for building a data fabric. A data fabric---also called dataware or data lake---offers reliability, scalability, and engineering freedom to teams with independent projects within a large organization. A data fabric has the role of distributed computer cluster housing all the data used throughout all facets of an entire organization. In order to avoid the known limitations of monolithic application architecture, we evaluate the life-cycle of a distributed microservices architecture built with and without a data fabric.
Read our related post about various data fabric architectures here and here.
A Brief History
All web applications, whether we architect them for mobile, desktop, or machine interfaces as APIs rely on data. Myriad computing devices generate incomprehensibly large volumes of data daily from myriad inputs. At any moment in time, the computational power of all the world's smartphones alone tops an estimated 10 ExaFLOPS (2 GigaFLOPS x ~5x10^9 phones ~= 10^19) of computational potential, yet never at peak utilization. By 2025, IDC estimates that the world will produce 163 zettabytes of new data, annually.
Where will it all go? What will we do with it? How to tame this ultra-dimensional Goliath to serve humanity for lifetimes? To dare an answer, we need to think globally and act locally when it comes to our data. We need a strong focus on our data logistics early on to help reign in unexpected long term costs. Moreover, while having efficient and clearly delineated boundaries for our applications using container based systems can be a boon to exponentially scaling software deployments, those very boundaries in our data can silently set us up for mediocrity in the coming age of ultra-intelligence.
At the start, it's reasonable to assume that not all software will span the globe and regardless of your pedigree, finances, and circle of comrades. Success arrives without guarantees. Yet, as we write this, one of our clients---a socially responsible video upstart from Vancouver, Canada---is determined to follow the long arc of influence and value to humanity while risking success with a powerhouse of influencers and investors. To that end, with attention to the meticulous nature of long-term thinking, they are laser focused on where their data goes from day one. With the confidence of a planned continuum, they ensure that the subconscious doubts of relentless growth don't become collateral damage to the small fires that every young company endures on their historic march to maturity.
To belabor the point here, another of our potential global partners has the largest small-scale clients of their own that we could ever have the privilege to support: international governments. Where will the data go for even the smallest analytical projects at that scale? Distributed databases? Mainframes? Tape? None of the above can do it all.
Can one solution really fit all sizes? We've asked that question over 14 years in software engineering and consulting over a project history spanning ventures in Washington D.C., Silicon Valley and New York geographically. Along the way, we found a thesis; a solution. Indeed, our founder and principal consultant had a part in building it.
A Case For Containers
Common applications developed by the majority of self-proclaimed Full-Stack Engineers for early stage companies typically have a few crucial components. The least of which is the data storage layer, typically a database. I write "least" because "Our app uses a killer database!" seldom underscores an elevator pitch. These are typically one of 3 solutions: SQL Database, NoSQL Database (aka NoSQL Document Store), or In-Memory Database. It's true that there are all log files generate critical data but those are typically reactive, rotational data stored in flat files. They're only mission critical when shit hits the fan, or when engineering locally.
Bear with this and imagine the successful team writing their first working MVP. In the thrill of it, they focus on one thing: functional success. All the features work. The Visionary CEO is happy. Marketing is hamming it up with customers (aka users), investors, and preparing for general availability and the big launch. Even the algorithms for machine learning are sputtering out meaningful results from the pre-trained data models leveraged from magical frameworks promising the virtue of countless jargon on the path toward a sentient future. All is gold in the land of milk and honey, yet, are these teammates considering what will happen in the weeks and months that follow the launch? Are they really considering the reality ahead; their contingencies; the actual costs of the magic they've spun into being? Let's all hope so, but for the sake of argument, assume flatly "no, they are not".
So what happens to their data?
First, they roll out a process manager that load-balances their deployment. This works for a few dozen people and perhaps holds up for hundreds of infrequent beta-testers.
They decide, pre-launch, that a container strategy will work best for the product in production. They package of all the application code into one container that requires a second database container. They decide to put static assets on S3 or a similar cloud storage solution.
The plan is to replicate containers that load balance and maximize the compute capacity of cloud service virtual machines. Everyone dubs it "horizontal scaling". At a reserved license cost, the operation estimates a reasonable cost and all looks well.
It isn't. Not for long.
As the application use spikes, a lack of consideration of system failure overwhelms the engineering team. Fire-fighting takes the place of product engineering and innovation. The company ship starts to swamp with technical debt and no one can pump the bilge fast enough to keep up.
Momentum lost, the company pivots, flails and awakens to their real need: Solid data logistics built on fundamentally performant microservices supporting key application processes as containers with a distributed cluster of database replicas rather than applications-as-containers bound to one database each.
A Data Fabric: Panacea or Steroid Treatment?
Imagine a database that spans machines, racks, and data centers around the world. It runs on commodity hardware either in the cloud or on-premises. It distributes data with algorithmic redundancy across disks whose occasional but certain failure will neither crash the cluster nor the production system, for that matter. Expand that vision beyond the database into the filesystem. This massively scalable database and filesystem offer APIs to event streaming in real time, Apache Hadoop, Apache Hive, and Apache Drill. Now consider a deployment that hosts application microservices on compute nodes throughout the cluster.
Our microservices mustn't hunt for data. Nor must they closely wed themselves to their own independent data storage mechanisms. Instead, they must have access to a configurable API (Application Programming Interface) that offers a blanketed storage layer with appropriate services exposed to the micro-service without needing additional "glue". Moreover, the raw bytes comprising all the data throughout the system must NOT move around in order to complete their lifecycle of value for the organization. For some types of data, this can take years depending on the tasks and dependencies of machine learning models at play.
Data must nourish the organization at rest as much as possible. It must also be secure without compromising accessibility to authorized researchers and algorithmic services alike. Machine learning services, web services, or any other microservices must deliver data to an optimized location in a data architecture and prevent data from moving in order to maximize architectural efficiency and minimize I/O cost. Input / Output (I/O) refers to the movement of bytes internally between disks or externally between computers over network connections.
Within a database cluster, we may deploy a host of machines to run replicas and shards full of data. What if we want to mount the data on a filesystem to process it with our newfound favorite ML toolkit? We have to export it.
Within a filesystem, we may deploy a host of buckets for low cost storage of video, audio, or log data. What if we want to transform all that data into a database of JSON data ready for website retrieval? We have to migrate it.
Within a IoT server architecture, we may deploy a host of topics for unified event stream analysis in Kafka. What if we want the some data to go into our database, some to stream to enriched topic streams, and other onto the filesystem for archiving. Oh, I suppose we have to build three separate architectures.
This gets more complicated as we get more realistic about the various scenarios for secondary data analysis, ML from indirect inputs, and the number of teams that will eventually work on a large scale real world project.
While these hypothetical problems illustrate a strong case for a unified solution, we must identify that solution and clarify whether this panacea cures the inevitable or acts as a steroid that deludes self-aggrandized fulfillments. Tying these things up requires the keen contextual analysis of organizational trajectory. Every startup might claim a hockey stick forecast and exponential returns, however, this is seldom the case.
In our work at pSilent Partners Ltd, we have thought about this ad-infinitum with clients of all sizes and been in the trenches on small and large projects alike. We've learned to consider the trajectory of a project with the utmost earnest consideration and our bullshit filters turned way up to 11. In our analysis, we gauge the potential needs of a project based on the speculative value of human resources and network effect of a given organization. In plain english, we ask:
"What's your strategy for deployment?"
"What's your anticipated user base?"
"What's your data model and how large is the data and how much exists?"
"What is your budget for infrastructure?"
"Do you need customized machine learning? Data streaming? Analytics? Business intelligence?"
"Are there any user acquisition events planned that will result in exponential adoption?"
"What is the current services architecture design?"
These questions typically determine whether a client organization is a strong candidate for a Data Fabric Microservices architecture.
For example, one client is engineering a groundbreaking microservices machine learning infrastructure for crowd-sourced mobile video production sourced. While no success is ever guaranteed, their founders and investors will notify millions of their existing followers when they officially releases their iOS App. Is that a reason to prepare for a massive ML rollout on a unified data fabric? Yes, indeed. (Read more about an ideal unified data fabric video architecture in our blog post.)
Moreover, a small team of researchers developing prototype IoT devices for a large electronics corporation must also consider a unified data fabric to support various customers on a multi-tennant SaaS IoT infrastructure accessible to 60+ design centers world-wide.
Lastly, a global organization responsible for combating poverty in various countries world-wide wants a centralized "ledger" for reconciling transactions across hundreds of disparate systems hosted by each of its national government customers. Should they consider a unified data fabric? Absolutely.
What do these examples all have in common? Data footprints exceeding hundreds of terabytes over the course of time. Some cross the petabyte threshold. This is a concern because even "simple" computations touch vast amounts of data again ...and again ...and again. The overhead of managing research, development, event streaming, machine learning and artificial intelligence across terabytes (or petabytes) of data will be devastating without the use of a unified data fabric.
On the other hand, a website forum for car enthusiasts, a local online retailer, your mom's cooking blog, or anyone with a small scale infrastructure with no anticipation of scaling massive data sets... don't bother. The overhead of a unified data fabric cluster will be a burden at best when only a few gigabytes of data are in motion on a "good day".
The "point" here is to start small even if you know you'll go big.
MapR as Data Fabric
Our "go-to" for data fabric infrastructure is MapR's Converged Data Platform. The reason is that we helped build it. Truthfully, as mentioned our founder and principal consultant was on the founding engineering team at MapR when the filesystem, control system, and NoSQL database components were released. Since then, the platform has evolved from a highly performant Hadoop MapReduce architecture into a full scale converged data platform. Thus, today it is a technology partner and fundamental technological solution for over a hundred global organizations. It's no wonder that this is a sleeping giant in the large scale data industry, especially in a time when ML and AI technologies are still in the infancy of this booming and often over-hyped market for differentiating technologies. MapR was 1000x faster at batch processing and MapReduce when it came to market in 2011. It holds the terasort world record on Google Compute infrastructure. It is the key technology differentiator for any organization starting small with plans to go big in BI, ML, and AI.
We strongly advise MapR as a data fabric to our clients. Having worked there early on, we know that the MapR architecture is unparalleled in terms of reliability, performance, and ease of use. It is today even more of what it was when we released it publicly in 2011. At its essence, MapR offers the world's fastest and most dependable filesystem for collecting, storing, and computing analytics and ML, AI etc. It's trusted by the US Government Intelligence Community, DoD, and Several Civilian Agencies. Google, Amazon, and Hewlett Packard Enterprise are partners and investors. Cisco Systems uses MapR in its core operations. AMD, the chipmaker, developed it's next generation of processors for data centers specifically for the MapR architecture. Although they won't disclose it publicly, Walmart and their technology company, Walmart Labs, run on MapR for analytics, ML and more. The list goes on.
The full "stack" data fabric architecture is rather simple. The underlying technical design of MapR-FS (the MapR Filesystem) affords the platform's users the opportunity to design a microservices application architecture that adheres to best practices we pass on to our clients:
Portable Application Micro-Service Architecture - Applications use inter-process communications (IPC) behind the firewall. Applications running on threads communicate with one another using RPC calls to exchange functional data units while sharing a common API to a cluster-wide database (MapR-DB) for sessions and high-speed data. Various functional components (e.g. Authentication, Feeds, Modeling, and Analytics) run in entirely separate containers and scale horizontally without local data storage issues. Teams supporting these components work independently, iterating independently and failing independently without total system failure (No SPOF).
Event Streaming to Database and Filesystem - Application microservices in machine learning can establish their own event pipelines independently as well. Competing models can access the global data fabric and partition their contributions without creating unknown inter-dependent relationships with one another. Each can run in a separate container and scale horizontally. Railway oriented architectures can evolve inside application containers and leverage the open source Kafka streaming APIs from web service applications, sending a cascade of analytics on write into both the database (for high-priority data) and the filesystem for long-run intelligence.
Shared POSIX Network File System Access - Data scientists can mount the entire cluster filesystem (or specific sandbox data replicas) and perform data cleaning and learning algorithm research on-demand over massive data sets cascading from the application. This may seem like a trivial addition, but it isn't for the sole reason that data doesn't need to move in order to be interrogated. Data Scientists can engineer their models without migrating data between local computers, staging servers, and production environments.
Encryption at Rest and Unix File Permissions - Internal multi-tenancy ensures that groups have permission to access specific filesystem volumes using group policies that ensure sensitive information stays hidden from prying eyes even within a trusted organization. Encryption at rest ensures that data volumes containing sensitive data won't reveal secrets in the unlikely event of a filesystem compromise (caused perhaps by a failure to 0-write failed drives at an on-prem or colocated data center).
High Performance on Commodity Hardware - All the features and functionality in the entire ecosystem of frameworks and architectures doesn't amount to much without extremely efficient performance. We live in an era of high-computation on commodity hardware. The use of specialized hardware leads to lock-in and loss of revenue. Moreover, it's rarely the case that ML teams release models without evolving their efficacy over time thus the hidden costs of ML's low-hanging fruits easily become the high-interest credit card of an organization's technical debt. This is exactly why a data fabric needs to compute efficiently on low-budget standardized hardware at scale. We simply can't be bothered with high-latency and management hurdles.
Overall, this the architecture that we chose determines how we evolve as organizations. It's one thing to build a scrappy prototype or alpha version proof-of-concept in order to prove out a business idea, raise an angel round, and gather users around your campfire. We don't disagree with this approach. We encourage it and will even support clients with this. Yet, it's a risk to take funding and run hot on PoC Prototype architectures with the promise of "one day" migrating to an industrial solution for data operations. You may get away with it as an organization by throwing people, cloud computing solutions, and budget legacy systems at the problem, but there will come a day when your organization will have to pay the piper and pony up additional funding to "fix the problem".
Thus, it's our business to help you strategize your ascent to greatness with a plan that includes a migration to a data fabric architecture timed aptly as we anticipate a leap into largess. We exercise that capability with partnerships at MapR and Amazon Web Services by offering PoC licenses and architectures to suss out a microservices architecture that will pour gas on the right fires and avoid the need to throw water on the ones we can easily avoid.
Let us know your thoughts below. Be sure to read the related articles. Most importantly, reach out to us if you want to take the leap into strategic data operations at any stage of business.