ORNL, Google, Snowflake working on watermarks for tracking streaming data


A team of collaborators from the U.S. Department of Energy’s Oak Ridge National Laboratory (ORNL), Google, Snowflake, and Ververica have tested a computing concept that could help speed up real-time processing of data that stream on mobile and other electronic devices.

The concept explores the function of watermarks, a mechanism for tracking how complete streaming data processing is. Watermarks allow new tasks to be processed immediately after prior tasks are completed, according to ORNL.

“There hasn’t been a clear, efficient mechanism for tracking phenomena of interest in a data stream over time and across different data processing pipelines,” said Edmon Begoli, AI Systems section head in ORNL’s National Security Sciences Directorate. “Watermarking is an up-and-coming concept that advances the state-of-the-art in stream processing frameworks.”

To determine how different platforms might effectively process real-time data, the team compared watermarks on the two that currently enable the most advanced implementation of them: Apache Flink, an open-source stream- and batch-processing framework, and Google Cloud Dataflow, a streaming analytics service.

The researchers found that Cloud Dataflow’s watermarks propagation tends to have higher latencies — delays in transferring data — and that Flink’s latency grows nonlinearly as the pipeline depth and compute node count increase. However, both open-source systems, which were built by the same community, provide a similar user experience.

In the context of DOE and ORNL research, watermarks will be useful for analyzing complex cyber events as well as collecting data from multiple sources and over various time scales, such as from sensors that measure health stats, human behaviors and movements, or environmental interactions, according to a statement by ORNL.

[Image courtesy: ORNL]