Automating data pipelines: How Upsolver aims to reduce complexity
To further strengthen our commitment to providing industry-leading coverage of data technology, VentureBeat is excited to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their articles in the Data Pipeline.
Upsolver’s value proposition is interesting, particularly for those with streaming data needs, data lakes and data lakehouses, and shortages of accomplished data engineers. It’s the subject of a recently published book by Upsolver’s CEO, Ori Rafael, Unlock Complex and Streaming Data with Declarative Data Pipelines.
Instead of manually coding data pipelines and their plentiful intricacies, you can simply declare what sort of transformation is required from source to target. Subsequently, the underlying engine handles the logistics of doing so largely automated (with user input as desired), pipelining source data to a format useful for targets.
Some might call that magic, but it’s much more practical.
“The fact that you’re declaring your data pipeline, instead of hand coding your data pipeline, saves you like 90% of the work,” Rafael said.
MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.
Consequently, organizations can spend less time building, testing and maintaining data pipelines, and more time reaping the benefits of transforming data for their particular use cases. With today’s applications increasingly involving low-latency analytics and transactional systems, the reduced time to action can significantly impact the ROI of data-driven processes.
Underlying complexity of data pipelines
To the uninitiated, there are numerous aspects of data pipelines that may seem convoluted or complicated. Organizations have to account for different facets of schema, data models, data quality and more with what is oftentimes real-time event data, like that for ecommerce recommendations. According to Rafael, these complexities are readily organized into three categories: Orchestration, file system management, and scale. Upsolver provides automation in each of the following areas:
- Orchestration: The orchestration rigors of data pipelines are nontrivial. They involve assessing how individual jobs affect downstream ones in a web of descriptions about data, metadata, and tabular information. These dependencies are often represented in a Directed Acyclic Graph (DAG) that’s time-consuming to populate. “We are automating the process of creating the DAG,” Rafael revealed. “Not having to work to do the DAGs themselves is a big time saver for users.”
- File System Management: For this aspect of data pipelines, Upsolver can manage aspects of the file system format (like that of Oracle, for example). There are also nuances of compressing files into usable sizes and syncing the metadata layer and the data layer, all of which Upsolver does for users.
- Scale: The multiple aspects of automation pertaining to scale for pipelining data includes provisioning resources to ensure low latency performance. “You need to have enough clusters and infrastructure,” Rafael explained. “So now, if you get a big [surge], you are already ready to handle that, as opposed to just starting to spin-up [resources].”
Other than the advent of cloud computing and the distribution of IT resources outside organizations’ four walls, the most significant data pipeline driver is data integration and data collection. Typically, no matter how effective a streaming source of data is (such as events in a Kafka topic illustrating user behavior), its true merit is in combining that data with other types for holistic insight. Use cases for this span anything from adtech to mobile applications and software-as-a-service (SaaS) deployments. Rafael articulated a use case for a business intelligence SaaS provider, “with lots of users that are generating hundreds of billions of logs. They want to know what their users are doing so they can improve their apps.”
Data pipelines can combine this data with historic records for a comprehensive understanding that fuels new services, features, and points of customer interactions. Automating the complexity of orchestrating, managing the file systems, and scaling those data pipelines lets organizations transition between sources and business requirements to spur innovation. Another facet of automation that Upsolver handles is the indexing of data lakes and data lakehouses to support real-time data pipelining between sources.
“If I’m looking at an event about a user in my app right now, I’m going to go to the index and tell the index what do I know about that user, how did that user behave before?” Rafael said. “We get that from the index. Then, I’ll be able to use it in real time.”
Upsolver’s major components for making data pipelines declarative instead of complicated include its streaming engine, indexing and architecture. Its cloud-ready approach encompasses “a data pipeline platform for the cloud and… we made it decoupled so compute and storage would not be dependent on each other,” Rafael remarked.
That architecture, with the automation furnished by the other aspects of the solution, has the potential to reshape data engineering from a tedious, time-consuming discipline to one that liberates data engineers.