Legacy App Migration Blog 3: With Great (Data) Power, Comes Great Responsibility (& Flexibility) - Creating data lakes as a powerful, ageless foundation for microservices

As you may have seen in our previous blog posts, we are exploring best practices and conquering the challenges associated with migrating legacy apps to the cloud.

What We Know So Far

Enterprise systems are a complex mix of business processes, analytics and more. With all that data swirling around, it once made sense to address business needs with a single, monolithic application. However, times have changed, and innovation is speeding ahead, which means the old, legacy model no longer meets the needs of the business.

REAN Cloud is here to help with that transition, migrating legacy applications into workable microservices that require less code, provide more scalability, and let you focus on doing what your company does best – innovating towards the next BIG thing.

Microservices architecture is a combination of successful and proven concepts of software engineering, such as Agile software development, SOA, API-first design and Continuous Integration and Delivery (CI/CD). A microservices approach to software development is designed to accelerate development cycles, foster innovation and ownership and improve the ability to maintain and scale applications.

At REAN Cloud, we use data lakes as the foundation for microservices architecture, creating a system where data can thrive.

Why Do I Need a Data Lake

A data lake is a giant repository that holds raw data in its native (structured or unstructured) form until the data is needed. It then provides multiple processing options to create reliable and business-ready datasets that can be used in whole or in part to feed numerous end user applications, including data science, reporting or consumable microservices. Storing and processing data in this highly flexible manner makes it much easier to use and repurpose again and again, regardless of design needs or requirements.

Legacy apps usually have two sets of value trapped within their systems: the actual data and the business logic. Data lakes set the information free, no longer siloed into different components.

In traditional data warehouses, data is stored only in its final format, which results in a limited amount of transactional data making it into the warehouse. This inflexible structure is constrained by the decisions made at the time of design, suppressing future agility and introducing needless workarounds. Data lakes are different! They capture all data, organize it into a variety of managed datasets, and then provide an array of tools and storage formats for flexibility. In particular, we can build microservices on the same layer to automatically use, manage and refresh data.

At its heart, a data lake is an economic and technological construct that can be used for:

  • Governing portfolios of managed datasets and Agile analytics
  • Creating value through business / mission process optimization
  • Addressing organizational constraints, liberating the use of the data
  • Minimizing the cost of data ownership through flexible sourcing and on-demand, elastic pricing

Data Lakes for Microservices Architecture

Using a data lake as the foundation for migrating legacy apps to the cloud allows us to decouple the data from the legacy app. In its most basic form, that data is then brought to light so that it may be repurposed many times over.

Continuous Integration (CI) and Continuous Delivery (CD) are then introduced through automated systems that speed the rate of change for an application. Next, serverless and microservice components are created and wrapped as APIs, which can be used repeatedly and configured in countless ways.

In the picture below, the data lake acts a central hub for data sources, microservices, applications and analytics. This, combined with governance processes and technologies, help the data lake to act as a healthy ecosystem and an agile platform.

Creating the data lake as a foundation for a microservices application architecture sets up a virtuous cycle. It provides a robust data application layer, while creating the organic mechanisms that keep the data lake fresh and managed.

Key Benefits of Data Lakes with Microservices

  1. Minimize cost through reuse: Data lakes prevent data loss, or data being trapped that may be needed for future use. Instead of reinventing the whole stack or buying new products, organizations can save significantly, basing their data layer on a data lake that promotes ingesting or creating data once, but curating and using it many times.
  2. Decouple application and data: Improve scalability and operational excellence by separating application services (update data / execute business logic) and data services (publish data).
  3. Utilize smarter metadata: Understand the data intimately by providing detailed context on the origin, meaning and value of the data.
  4. Streamline feature development / increase innovation: Expedite microservice deployment by utilizing available data through existing structures.
  5. Create Big Data applications: Stream real-time data into the data lake to provide ongoing updates for applications powered by microservices sitting right there. Then use this to create intelligent applications that benefit from applying machine learning to the streaming data.

The data that is ingested into the data lake resides in a raw object storage bucket, while the transformed data resides in a curated bucket. The success of a data lake depends on the transformation from raw data to curated data, which should organize the data so that it’s easily accessible. The best way to provide that access is by developing a microservices architecture.

Microservices can be classified into two categories:

  • Analytical: Serve the data to analytical applications like BI reports and dashboards
  • Operational: Serve near real-time data to operational systems

Unlike traditional data warehouses, data lakes ingest both structured and unstructured data. It is a daunting task to interpret and understand unstructured data without extensive domain knowledge. Microservices, however, provide a structure and schema (JSON) to the data, making it a breeze to understand. In addition, microservices that are exposed through API management tools provide traceability and security. Finally, the scalability of microservices make it an ideal solution for external users and applications to harness the data.

Ensuring the Data Lake Doesn’t Become a Swamp

We know that data lakes can be a helpful repository for all the information contained within legacy apps, but ensuring they do not become a swamp can be challenging.

A major concept of modern, managed data lakes is that central control and quality assurance of data leads to stagnation. Starting with a wide aperture, where everything is ingested first and then moves through refinement, promotes a more vital and useful data platform. Providing data owners with the ability to submit data “as-is” and through self-service mechanisms leads to greater participation and collaboration since many do not have the time to perform complex transformations to make data available. While this is a big step forward, the data lake remains the place where the data goes after it is used by the owner. This means the data lake relies on the data owner to report changes and ensure the data continues to flow, which is rarely high on the owner’s list of priorities.

When a microservices architecture sits on top of the data lake, the services interact directly with the data and changes flow into the lake naturally. The data lake is not a swamp where data goes to die, but instead where it lives and thrives. The data lake becomes a fully functioning ecosystem, and normally reactive governance services such as Metadata Management, Data Quality Management (DQM) and Master Data Management (MDM) are integrated into live, rather than stale, data.

For Governance, DQM and MDM, the complexity is not the technology itself. These components are just enablers. Designing the processes, ceremonies and policies that govern the data lake is the most difficult part. The more often the data is generated or changed by complex, monolithic apps outside of the data lake, the more intricate these reactive processes and controls must be. Breaking monolithics apps and their data into microservices allows more of this logic to reside in smaller, easier to manage components governed by the small Agile teams designing and creating them. As a result, the governance and management becomes more organic.

The REAN Cloud Approach

REAN Cloud can design, build and operationalize a data lake using AWS East, GovCloud or another cloud provider or region of choice, meeting the principles that a typical data warehouse solution could never accomplish.

REAN Cloud’s customizable data foundation platform provides the flexibility to support a multitude of tools, mechanisms and services, leveraging various technologies to act as an enabler or accelerator, rather than a replacement of tools customers may already have. The graphic below shows the set of open source, AWS managed service and packaged tools (e.g. ISVs) that a data lake often includes:

REAN Cloud supports a variety of “harvester” mechanisms to consume the available datasets “as-is,” including using AWS Kinesis functionality and a wide range of code running as AWS Lambda or Step Functions. The platform can be deployed 100% with AWS services, as well as incorporate open source or packaged applications built on scalable AWS resources.

Similar architectures are also available for Google Compute Platform and Microsoft Azure.

REAN Cloud works with customers using an Agile methodology to develop tailored data lakes to meet unique needs and almost any operational requirement. At the outset of an engagement, REAN Cloud aligns with its customers to manage the scope and schedule, applying resources available within an engagement.

Using REAN Cloud’s platform and vast store of pre-built components to jump-starting data lake implementations, we are able to begin ingesting, curating and publishing data within 30 days. This provides the ability to configure services specifically to mission and command requirements, but also quickly show value.

For additional reading on the benefits of data lakes, check out a few of these articles:

Get Started