What SQL users should know about time series data
What SQL users should know about time series data
SQL often struggles when it comes to managing massive amounts of time series data, but it’s not because of the language itself. The main culprit is the architecture that SQL typically works in, namely relational databases, which quickly become inefficient because they’re not designed for analytical queries of large volumes of time series data.
Traditionally, SQL is used with relational database management systems (RDBMS) that are inherently transactional. They are structured around the concept of maintaining and updating records based on a rigid, predefined schema. For a long time, the most widespread type of database was relational, with SQL as its inseparable companion, so it’s understandable that many developers and data analysts are comfortable with it.
However, the arrival of time series data brings new challenges and complexities to the field of relational databases. Applications, sensors, and an array of devices produce a relentless stream of time series data that does not neatly fit into a fixed schema, as relational data does. This ceaseless data flow creates colossal data sets, leading to analytical workloads that demand a unique type of database. It is in these situations where developers tend to shift toward NoSQL and time series databases to handle the vast quantities of semi-structured or unstructured data generated by edge devices.
While the design of traditional SQL databases is ill-suited for handling time series, using a purpose-built time series database that accommodates SQL has offered developers a lifeline. SQL users can now utilize this familiar language to develop real-time applications, and effectively collect, store, manage, and analyze the burgeoning volumes of time series data.
However, despite this new capability, SQL users must consider certain characteristics of time series data to avoid potential issues or challenges down the road. Below I discuss four key considerations to keep in mind when diving head-first into SQL queries of time series data.
Time series data is inherently non-relational
That means it may be necessary to reorient the way we think about using time series data. For example, an individual time series data point on its own doesn’t have much use. It is the rest of the data in the series that provides the critical context for any single datum. Therefore, users look at time series observations in groups, but individual observations are all discrete. To quickly uncover insights from this data, users need to think in terms of time and be sure to define a window of time for their queries.
Since the value of each data point is directly influenced by other data points in the sequence, time series data is increasingly used to perform real-time analytics to identify trends and patterns, allowing developers and tech leaders to make informed decisions very quickly. This is much more challenging with relational data due to the time and resources it can take to query related data from multiple tables.
Scalability is of paramount importance
As we connect more and more equipment to the internet, the amount of generated data grows exponentially. Once these data workloads grow beyond trivial—in other words, when they enter a production environment—a transactional database will not be able to scale. At that point, data ingestion becomes a bottleneck and developers can’t query data efficiently. And none of this can happen in real time, because of the latency due to database reads and writes.
A time series database that supports SQL can provide sufficient scalability and speed to large data sets. Strong ingest performance allows a time series database to continuously ingest, transform, and analyze billions of time series data points per second without limitations or caps. As data volumes continue to grow at exponential rates, a database that can scale is critical to developers managing time series data. For apps, devices, and systems that create huge amounts of data, storing the data can be very expensive. Leveraging high compression reduces data storage costs and enables up to 10x more storage without sacrificing performance.
SQL can be used to query time series
A purpose-built time series database enables users to leverage SQL to query time series data. A database that uses Apache DataFusion, a distributed SQL query engine, will be even more effective. DataFusion is an open source project that allows users to efficiently query data within specific windows of time using SQL statements.
Apache DataFusion is part of the Apache Arrow ecosystem, which also includes the Flight SQL query engine built on top of Apache Arrow Flight, and Apache Parquet, a columnar storage file format. Flight SQL provides a high-performance SQL interface to work with databases using the Arrow Flight RPC framework, allowing for faster data access and lower latencies without the need to convert the data to Arrow format. Engaging the Flight SQL client is necessary before data is available for queries or analytics. To provide ease of access between Flight SQL and clients, the open source community created a FlightSQL driver, a lightweight wrapper around the Flight SQL client written in Go.
Additionally, the Apache Arrow ecosystem is based on columnar formats for both the in-memory representation (Apache Arrow) and the durable file format (Apache Parquet). Columnar storage is perfect for time series data because time series data typically contains multiple identical values over time. For example, if a user is gathering weather data every minute, temperature values won’t fluctuate every minute.
These same values provide an opportunity for cheap compression, which enables high cardinality use cases. This also enables faster scan rates using the SIMD instructions found in all modern CPUs. Depending on how data is sorted, users may only need to look at the first column of data to find the maximum value of a particular field.
Contrast this to row-oriented storage, which requires users to look at every field, tag set, and timestamp to find the maximum field value. In other words, users have to read the first row, parse the record into columns, include the field values in their result, and repeat. Apache Arrow provides a much faster and more efficient process for querying and writing time series data.
A language-agnostic software framework offers many benefits
The more work developers can do on data within their applications, the more efficient those applications can be. Adopting a language-agnostic framework, such as Apache Arrow, lets users work with data closer to the source. A language-agnostic framework not only eliminates or reduces the need for extract, transform, and load (ETL) processes, but also makes working on large data sets easier.
Specifically, Apache Arrow works with Apache Parquet, Apache Flight SQL, Apache Spark, NumPy, PySpark, Pandas, and other data processing libraries. It also includes native libraries in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Working in this type of framework means that all systems use the same memory format, there is no overhead when it comes to cross-system communication, and interoperable data exchange is standard.
High time for time series
Time series data include everything from events, clicks, and sensor data to logs, metrics, and traces. The sheer volume and diversity of insights that can be extracted from such data are staggering. Time series data allow for a nuanced understanding of patterns over time and open new avenues for real-time analytics, predictive analysis, IoT monitoring, application monitoring, and devops monitoring, making time series an indispensable tool for data-driven decision making.
Having the ability to use SQL to query that data removes a significant barrier to entry and adoption for developers with RDBMS experience. A time series database that supports SQL helps to close the gap between transactional and analytical workloads by providing familiar tooling to get the most out of time series data.
In addition to providing a more comfortable transition, a SQL-supported time series database built on the Apache Arrow ecosystem expands the interoperability and capabilities of time series databases. It allows developers to effectively manage and store high volumes of time series data and take advantage of several other tools to visualize and analyze that data.
The integration of SQL into time series data processing not only brings together the best of both worlds but also sets the stage for the evolution of data analysis practices—bringing us one step closer to fully harnessing the value of all the data around us.
Rick Spencer is VP of products at InfluxData.
—
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.