PurpleCube AI
Data Orchestration for the Modern Enterprise
This publication may not be reproduced or distributed without Eckerson Group’s prior permission.
About the Author

Kevin Petrie is VP of Research at Eckerson Group, where he manages the research agenda and writes independent reports about topics such as data integration, generative AI, machine learning, and cloud data platforms. For 25 years Kevin has deciphered what technology means for practitioners as an industry analyst, instructor, marketer, services leader, and journalist. He launched a global data analytics services team for EMC Pivotal and ran field training at the data integration provider Attunity (now part of Qlik). A frequent public speaker and coauthor of two books about data management, Kevin loves helping start-ups educate their communities about emerging technologies.
About Eckerson Group
Eckerson Group is a global research, consulting, and advisory firm that helps organizations get more value from data. Our experts think critically, write clearly, and present persuasively about data analytics. They specialize in data strategy, data architecture, self-service analytics, master data management, data governance, and data science. Organizations rely on us to demystify data and analytics and develop business-driven strategies that harness the power of data. Learn what Eckerson Group can do for you!
About This Report
Eckerson Group provides independent and objective research on emerging technologies, techniques, and trends in the field. Although we do not recommend vendors or products, we write sponsored profiles such as this one to help practitioners understand different offerings.
Executive Summary
PurpleCube AI offers a data orchestration platform that unifies and streamlines data engineering across data centers, regions, and clouds. It helps enterprises reduce their number of pipeline tools, accelerate performance, and reduce cost as they support business intelligence (BI) and artificial intelligence and machine learning (AI/ML) projects as well as operational workloads. PurpleCube AI targets three primary use cases: data modernization, data preparation, and self-service. It differentiates its platform with cost performance, generative AI assistance, a unified platform, flexibility, and active metadata. Eckerson Group recommends that data and analytics leaders evaluate how PurpleCube AI might alleviate their data silos and pipeline bottlenecks; learn more about PurpleCube AI’s differentiators; and compare PurpleCube AI to alternative approaches with both current requirements and likely future requirements in mind.
Company
CEO Bharat Phadke cofounded PurpleCube AI in 2020 to help enterprises tackle the persistent problem of siloed, bottlenecked data. As leader of the consulting firm Edgematics, Phadke had witnessed this problem firsthand. His clients encountered project delays and cost overruns as they tried to manage heterogeneous data lakes with multiple data pipeline tools. To alleviate that pain, Phadke and his team built a data orchestration platform that unifies and streamlines data engineering across data centers, regions, and clouds. PurpleCube AI’s platform helps enterprises improve the cost performance of extract, load, and transform (ELT) workloads, with the flexibility to choose where to process their data. Today this bootstrapped venture helps Fortune 2000 companies simplify how they manage data in hybrid, cloud, and multi-cloud environments. PurpleCube AI offers a graphical user interface (GUI), generative AI (GenAI) chatbot, and distributed architecture to help data engineers and architects—or even business analysts—prepare data for analytics. It extracts data from a myriad of sources, then loads and transforms it on targets that include Cloudera, the Databricks lakehouse, Snowflake Data Cloud, and hyperscalers such as AWS or Microsoft Azure. PurpleCube AI users can validate data quality, perform exploratory analysis, and orchestrate pipelines with scheduling and monitoring capabilities. Enterprises use PurpleCube AI to boost productivity and efficiency as they democratize data consumption throughout the organization.
Figure 1. PurpleCube AI Data Orchestration Platform

Figure 1 illustrates the range of PurpleCube AI capabilities for modern data ecosystems.
Target Customers and Use Cases
PurpleCube AI’s ideal customer is an enterprise that struggles to deliver multi-sourced data to business intelligence (BI) projects, artificial intelligence/machine learning (AI/ML) projects, and operational applications. Its data teams use different pipeline tools for different parts of their environment, reducing productivity and exacerbating data silos. These data engineers and architects fall short of cost performance objectives with heritage Hadoop data lakes in particular. Short-staffed, they cannot keep pace with business demands for structured and semistructured data from various sources. They need to consolidate tools, reduce scripting, and accelerate pipelines while also enabling self-service for a growing population of business-oriented data consumers.
Customers such as Virgin Mobile and Canadian Tyre implement PurpleCube AI to meet such objectives. Data managers and consumers alike use PurpleCube AI to manipulate data sets with spoken language, mouse clicks, or command lines, according to their preference. They build, orchestrate, and optimize high-performance data pipelines that span many sources, targets, and processing engines. These pipelines can transfer data in batches or streams across software-as-a-service (SaaS) applications, file systems, databases, data warehouses, and lakehouses. PurpleCube AI unifies such disparate elements with automated workflows to help companies better support analytics and operations.
PurpleCube AI targets three primary use cases: data modernization, data preparation, and self-service.
Data modernization. Enterprises use PurpleCube AI to modernize their data architectures. With its help, data engineers and architects can perform initial migrations as well as subsequent updates, using continuous integration and continuous delivery (CI/CD) capabilities to refine pipelines, change sources, and tune performance. Capabilities such as these enable enterprises to transform and validate data in support of cloud-based analytical projects and operational applications. T-Mobile, for example, uses PurpleCube AI to move data to Snowflake.
Data preparation. Data engineers use PurpleCube AI to simplify data preparation. The PurpleCube AI studio enables them to configure, schedule, and monitor the pipelines that transform data for analytics. They can discover tables or other data objects, then reformat, merge, cleanse, filter, and structure them to meet consumption requirements. With the help of GenAI, data teams can query and explore these data sets, generate quality rules, and enrich metadata. Data engineers at Home Depot use PurpleCube AI to consolidate purchase records on Google BigQuery, then synchronize dynamic source schemas to prepare for real-time and historical customer analytics.
Self-service. Data analysts, data scientists, and business analysts use PurpleCube AI to prepare data for analytics themselves rather than relying on busy data engineers. They design pipelines in the studio, then inspect, analyze, and share the resulting data product. PurpleCube’s GenAI capabilities play a major role in self-service, helping less technical users prepare data and query tables on their own. They can explore, inspect, and interpret data using their business knowledge, with no little or no need for scripting skills.
Product Functionality
All these data and business stakeholders collaborate on the PurpleCube AI platform to integrate data for analytics and operations, using the PurpleCube AI Studio as well as additional capabilities.
PurpleCube AI Studio creates and assembles the components of a pipeline. Its primary elements are data points, data objects, data flows, job flows, and data connect (see Figure 2).
Figure 2. PurpleCube AI Studio

> Data points provide connection details for source and target systems, such as hostname, IP address, and user credentials. They also help standardize data transfer.
> Data objects encapsulate metadata—names, schemas, attributes, and so on—that describes physical objects such as tables, views, and files.
> Data flows are the ELT pipelines that use data points and data objects to map attributes between sources and targets.
> Job flows define the sequences of jobs that a pipeline executes, including their order, dependencies, and triggers.
> Data connect performs real-time ingestion via change data capture technology and integration with message brokers such as Apache Kafka.
Figure 3 illustrates a sample data pipeline based on these elements.
Figure 3. Configuring a Data Pipeline

Additional capabilities
Additional product capabilities include the monitor, scheduler, SQL editor, GenAI module, admin, data stream, and command line interface (CLI).
> The monitor tracks job status and operational metrics, with visual displays and alerts as well as debugging assistance.
> The scheduler enables the user to specify task execution dates, times, and frequency, as well as events that trigger tasks.
> The SQL editor enables users to discover and explore data, eliminating the need for a separate SQL client. It provides views of tables and runs custom queries.
> The GenAI module provides a chatbot through which users instruct PurpleCube AI to perform commands based on their choice of language model (GPT or Vertex to start).
> The admin enables administrators to set up and control project parameters, application settings, user privileges, agent configurations, and connectivity settings.
> The data stream uses Spark to perform real-time data transformations on file systems or object stores.
> The command line interface (CLI) receives and executes instructions from Linux machines or applications that use REST APIs.
Differentiators
PurpleCube AI competes with data pipeline vendors such as Informatica and Talend (now part of Qlik). It differentiates its offering with cost performance, GenAI assistance, a unified platform, flexibility, and active metadata.
Cost performance. PurpleCube AI optimizes workloads to improve the cost performance of data management in several ways. Its distributed agents push the workload down to local processing engines such as Apache Spark for Databricks, massively parallel processors for Teradata, and so on. It also places the data in optimally sized files using the Hadoop file system (HDFS) during the processing phase to make the most efficient use of CPU and memory. This lightweight footprint and approach helps increase throughput, decrease latency, and reduce cost. PurpleCube AI helped Scotia Bank, for example, reduce tool and infrastructure costs by $1.5 million while consolidating data to detect money laundering.
GenAI assistance. The GenAI module, in beta as of Q1 2024, both increases productivity for technical users and enables self-service for business users. On the technical side, data engineers and data analysts can type natural language commands to autogenerate queries, data quality rules, or descriptions of glossary terms. Such capabilities help these technical users complete projects faster and free up time for additional projects. Business analysts, meanwhile, can use the GenAI module to discover, format, and query data as part of exploratory analysis. This helps democratize data consumption by making insights available to more business users without increasing the workload for data engineers.
Unified platform. PurpleCube AI unifies the ingestion, transformation, and validation of structured and semistructured data sets to support various types of workloads. It integrates with more than 150 sources and targets to ingest data in periodic batches or real-time increments. These broad capabilities enable PurpleCube AI users to consolidate pipeline tools and gain productivity, simplifying data management even as the underlying data sources and targets proliferate. Later this year, PurpleCube AI plans to add support for semi- or unstructured data objects such as the text and images that feed new generative AI initiatives.
Flexibility. PurpleCube AI offers flexibility on several dimensions. For starters, customers can process data on their choice of existing infrastructure rather than having to configure, implement, or purchase separate compute resources. PurpleCube AI also supports a wide range of workloads, platforms, and users thanks to open APIs and data formats. Users can consume and interact with data themselves, for example using GenAI, rather than pulling up a standalone BI tool. In addition, they can port PurpleCube AI software between processing engines without rewriting code or reconfiguring systems. Finally, customers say PurpleCube AI’s services team speeds implementations, reduces training requirements, and rapidly resolves issues.
Active metadata. PurpleCube AI uses metadata to optimize processes and simplify the user experience. It gathers the metadata that describes users, tasks, events, systems, pipeline code, and applications. This rich metadata enables PurpleCube AI’s controller to give efficient instructions to agents and orchestrate workflows across the elements of enterprise environments. Metadata also streamlines workloads by enabling PurpleCube AI to optimize file sizes and consumption of CPU or memory resources. Users have open access to all this metadata, which gives them graphical views of data lineage to assist impact analysis and governance programs. By going further to expose its code and metadata, PurpleCube AI helps developers build their own extensions in less time.
Architecture
A browser-based user interface, which includes the studio, monitor, admin, and scheduler, issues instructions to a Java-based controller. This controller serves as the brain of PurpleCube AI. It captures user instructions, maintains them in the metadata repository, and compiles them into messages that Java-based agents then execute on pipelines, sources, and targets. A broker manages communications between agents and the controller (see Figure 4).

As described earlier, PurpleCube AI pushes data processing workloads down to underlying platforms such as Hadoop, Teradata, Snowflake, Amazon Redshift, or Google BigQuery.
Pricing
PurpleCube AI offers express, advanced, and enterprise pricing plans:
> The express plan includes a single-user license for data ingestion across a limited number of sources and targets, along with five days of deployment consulting and 40 weekday hours of product support.
> The advanced plan includes three user licenses, support for more sources and targets, data cleansing, 10 days of deployment consulting, and 40 weekday hours of support.
> The enterprise plan comprises five user licenses, full source and target support, three custom connections, multi-environment support, and additional capabilities. It includes 15 days of deployment consulting and 24-7 support.
PurpleCube AI also offers a 30-day free trial with 24-7 support, and is available in all three cloud marketplaces.
Summary and Recommendations
PurpleCube AI’s data orchestration platform helps enterprises improve the cost performance of ELT workloads while choosing their own processing engine. PurpleCube AI offers a graphical user interface, GenAI chatbot, and distributed architecture to help data engineers and architects—or even business analysts—prepare data for analytics. It helps enterprises consolidate pipeline tools as they support BI, AI/ML, and operational workloads. PurpleCube AI targets three primary use cases: data modernization, data preparation, and self-service.
Eckerson Group recommends that data and analytics leaders take the following actions:
> Evaluate how PurpleCube AI might alleviate data silos and pipeline bottlenecks in their heterogeneous environments. They might find that PurpleCube AI can help them consolidate their tools and processes for data engineering.
> Learn more about PurpleCube AI’s differentiators—including cost performance, GenAI assistance, a unified platform, flexibility, and active metadata—and map those differentiators to their own data requirements.
> Compare PurpleCube AI to alternative approaches, including existing homegrown or commercial tools, in terms of its ability to meet both current requirements and likely future requirements.
About Eckerson Group

Wayne Eckerson, a globally known author, speaker, and consultant, formed Eckerson Group to help organizations get more value from data and analytics. His goal is to provide organizations with expert guidance during every step of their data and analytics journey.
Eckerson Group helps organizations in three ways:
> Our thought leaders publish practical, compelling content that keeps data analytics leaders abreast of the latest trends, techniques, and tools in the field.
> Our consultants listen carefully, think deeply, and craft tailored solutions that translate business requirements into compelling strategies and solutions.
> Our advisors provide competitive intelligence and market positioning guidance to software vendors to improve their go-to-market strategies.
Eckerson Group is a global research, consulting, and advisory firm that focuses solely on data and analytics. Our experts specialize in data governance, self-service analytics, data architecture, data science, data management, and business intelligence.
Our clients say we are hardworking, insightful, and humble. It all stems from our love of data and our desire to help organizations turn insights into action. We are a family of continuous learners, interpreting the world of data and analytics for you.
Get more value from your data. Put an expert on your side. Learn what Eckerson Group can do for you!

About the Sponsor
PurpleCube AI is a unified data orchestration platform that revolutionizes how businesses manage and utilize their data. We achieve this by directly embedding the power of Generative AI into the data engineering process. This unique approach enables us to:
Unify data engineering: PurpleCube AI provides a single platform to manage all your data engineering needs, from structured and semi-structured data to streaming data. This eliminates the complexity and cost associated with managing multiple tools and eliminates the need for specialized skills for each platform.
Automate complex tasks: Leveraging cutting-edge AI, PurpleCube AI allows you to streamline data integration, transformation, and processing.
Activate insights: PurpleCube AI empowers you to unlock valuable insights from your data with complete certainty and agility. Its comprehensive metadata management ensures data quality and trust, while its intelligent agents enable you to adapt and innovate at the speed of business.
Beyond standard data lake and warehouse automation, PurpleCube AI harnesses the power of language models. This enables a range of innovative use cases, including processing various file formats, uncovering hidden insights through exploratory data analysis and natural language queries, automatically generating and enriching metadata, assessing and improving data quality, and optimizing data governance through relationship modeling.
PurpleCube AI is ideal for:
> Data architects
> Data engineers
> Data scientists
> Data analysts
Across various industries:
> Banking
> Telecommunications
> Automotive
> Healthcare
> Retail
PurpleCube AI represents a true paradigm shift in data orchestration. We believe in empowering businesses to extract maximum value from their data, driving innovation, and achieving a competitive edge in the data-driven world.