Column: Former Google Software Engineer Christopher Olson On Treode Database

November 12, 2014

Christopher Olson is a software industry veteran with over two decades of experience, starting with Oracle in 1994. Mr. Olson quit his job with Google to start his own database company, Treode.

In a guest column with TechTaffy, Mr. Olson introduces his startup, and explains why you should try TreodeDB.

BY CHRISTOPHER OLSON

TreodeDB is a resilient and consistent database for petabytes. It can be replicated for high-availability and sharded for high-scalability. TreodeDB is a transactional database; it provides atomic multirow writes, even across replicas and shards.

Why use TreodeDB?

The site db-engines.com lists over 200 databases to choose from. Why do we need another? Some entrepreneurs and investors might say this signifies a crowded field, but we believe it indicates that the incumbents are failing to meet an essential combination of needs. First, database users want to analyze petabytes of data in near real time. Second, users want the system to be available 24×7. Third, many applications require database transactions. The enthusiasm surrounding BigData has exposed first two desires, but the third requirement remains hidden in back-office applications that still use traditional databases.

Newer databases, for example MongoDB and Cassandra, have served many applications well, and they have supported very lucrative web properties. However, applications in manufacturing, finance and health care remain tied to SQL databases. Let’s be clear: businesses in those industries are using NoSQL platforms like Hadoop for analytics, but the transactional element of their applications still rely on SQL databases. Why? When these applications perform a write to the database, they update multiple rows together. They need that update to happen as an atomic unit; they need all rows to be updated or none of them. To have only some of the rows updated would leave the database in an inconsistent state. By that, we do not mean negligible effects on statistics, and it’s not okay to reconcile the discrepancy later. We mean a vendor might spend $30,000 to ship $2M of heavy equipment after the customer had cancelled the order, or the bank might lose $5M in a botched transfer between accounts.

Some product managers argue that legacy code, not lack of transactions, ties these applications to SQL databases, and there is a little bit of truth to that. However, intrepid companies in these industries, and upstart tech companies servicing these industries, are experimenting with NoSQL databases. This activity shows that some users can begin rewriting their applications, they want to move to open source, they want a simple programming API, and they want availability and scalability. Cassandra and HBase have responded by adding a limited type of transaction, and one can perform an atomic multirow write so long as all the rows happen to land on the same server. MySQL and PostgreSQL have responded by offering a mechanism to shard data, but again the rows of a transaction must all land on the same shard. Unfortunately, the applications cannot always structure their behavior to meet this limitation.

On the one hand, an application developer can choose a traditional SQL database. It may be proprietary and expensive, require effort to replicate, lack the ability to scale, or provide limited transactions. On the other hand, that developer can choose aNoSQL database. They are usually open source, can tolerate failures of individual machines, have demonstrated their ability to scale, but again have limited transactions. Enter TreodeDB: it fills the gap between these choices. It provides an open source, replicated, and sharded database with support for atomic multirow writes.

How to Use TreodeDB

TreodeDB is a key-value store, so the API is dead simple: read, write, scan. It’s a large hash-map, but with a twist: its versioned and uses optimistic transactions, which fits smoothly with HTTP ETag and dovetails nicely into RESTful architecture.

read (read-timestamp, keys) → timestamped-values
write (condition-timestamp, rows) → write-timestamp
scan (table, window, slice) → timestamped-rows

To illustrate how one would use this API, lets make a movie database. It will keep two tables: one of movies and one of actors. For each table we need a key and a value. For the key for movies, we will derive a URL safe string from the title, like “star-wars” or “weird-science”. For the value, we will use an object that includes the title, release date and cast members. We will do something similar for actors. Our database will have entries like this:

movies:star-wars → {
“id”: “star-wars”,
“title”: “Star Wars”,
“cast”: [
{ “actorId”: “mark-hamill”, “actor”: “Mark Hamill”, “role”: “Luke Skywriter” },
{ “actorId”: “harrison-ford”, “actor”: “Harrison Ford”, “role”: “Han Solo” },
{ “actorId”: “carrie-fisher”, “actor”: “Carrie Fisher”, “role”: “Princess Leia Organa” }
] }

movies:mark-hamill → {
“id”: “mark-hamill”,
“name”: “Mark Hamill”,
“roles”: [
{ “movieId”: “star-wars”, “title”: “Star Wars”, “role”: “Luke Skywriter” },
{ “movieId”: “empire-strikes-back”, “title”: “Star Wars: The Empire Strikes Back”, “role”: “Luke Skywalker” },
{ “movieId”: “return-of-the-jodi”, “title”: “Star Wars: Return of the Jedi”, “role”: “Luke Skywalker” }
]}

Notice that the entries are redundant. On the suspicion that reads will outnumber writes, we chose this design to answer reads quickly, but writes may need to change multiple rows. In the above example, we misspelled Luke’s name, probably thanks to autocorrect. To fix the mistake, we would update both the row for star-wars and for mark-hamill:

retry {

rt ← now // read timestamp

((a, t1), (b, t2)) ← read (rt, movie:star-wars, actor:mark-hamill)
// result is
// a: object for star-wars
// b: object for mark-hamill
// t1, t2: timestamps from each row

a’ → … // fixup Luke’s name in star-wars
b’ → … // fixup Luke’s name in mark-hamill
write (max (t1, t2), movie:star-wars → a’, actor:mark-hamill → b’)

} until success

The timestamps let us update the movie and actor in a way that detects write-write conflicts. If two writers attempt to update a row at the same time, one will succeed and the other will fail. The failed writer has the option to kick the error further up the stack, or to retry its read-modify-write, restarting with its reads to fetch the new data. This is the essence of optimistic concurrency: a writer does not pessimistically lock data. Rather, a writer optimistically assumes that the data will remain stable, and it conditionally applies its writes in one batch at the end, if the data in fact remained unchanged. Optimistic concurrency works because usually the data does remain stable, and usually the writer does not need to retry—the application developer must take care to avoid creating hot items like global counters, but that’s true of pessimistic concurrency too. Optimistic concurrency avoids the overhead of locking, and the headaches of recovering locks from apparently failed clients in a distributed system. These benefits offset the cost of occasional retries.

Treode uses timestamps to implement the condition, and it uses Lamport clocks so that the client needs to track only the maximum timestamp. When a client retrieves a value, it supplies a read timestamp. Treode does to things: it finds the value for the row as of that time, that is with a timestamp on or before that time, and it arranges that writes will use a greater timestamp. For the condition timestamp on the write, a client can safely use any value between the maximum timestamp of the retrieved values (inclusive) and the timestamp used for the read (inclusive).

These timestamps correspond to HTTP headers. When reading, the maximum timestamp of the read values can serve as the standard HTTP header ETag. When writing, the condition timestamp can come from the header If-Unmodified-Since. Many HTTP proxies will react appropriately to these tags, so you can use a content distribution network (CDN) to serve reads near your web clients, and those clients can issue updates to the origin repository. If a client bases an update on stale data from a cache, TreodeDB will report it to the client so that it can request fresh data and reattempt the update.

The scan method allows one to scan all the rows of a table over some window of time, and over some slice of the data. The window of time can mean: scan only the latest value for each row, or scan all the updates to the row between two points in time. The slice allows one to run the scan in parallel. One can use the scan method to implement an InputFormat for Hadoop MapReduce, or a Resilient Distributed Dataset (RDD) for Spark. While using TreodeDB for your transactions, you can leverage the Hadoop ecosystem for analytics.

How to Deploy TreodeDB

To use TreodeDB, you’re probably expecting that you can install a Debian package, or push a button in AWS to setup a cell. Also, you undoubtedly want client libraries for a variety of programming languages. We will get there some day, but for now we have packaged TreodeDB as a Scala library for implementing a storage server. An additional layer is needed to expose the library as a server. That layer would add code for encryption, authentication, authorization, auditing, rate limiting, and so on. There are many standards to choose from, and different database users will want to use different components to address each of these. We have packaged TreodeDB as a server side library because that provides a clean modular interface between those components and the distributed, replicated, transactional key-value repository.

Most importantly though, the layer that connects the key-value library to a server interface will probably impose some data model. For example, one model might decide that they keys are strings and the values JSON objects. The model might also add support for secondary indexes. Another model might follow the lead of relational databases, in which keys can be composite and values are tuples of primitive datatypes. Other models might be tailored to specific business concepts like customers, vendors, orders and shipments. Or, some creative developers might explore novel modeling concepts.

The language bindings would also be tied to the model. With the JSON model, the language binding might be a RESTful library implemented using the language’s tools for HTTP. For the relational model, the language binding might be some sophisticated object-relational mapper. The Hadoop and Spark connectors would also be tied to the model. In these two examples, one set of connectors would need to account for nested JSON objects, while the other would just expose tuples of primitive datatypes.

Summary

TreodeDB is a prototype for highly available transactions; it reveals they really are possible. TreodeDB exposes a programming interface that meshes with RESTful architectures. It is very flexible, letting you chose the model and client protocol. Implementing distributed transactions is a tricky business. Creating a useful generic data model and client protocol is also a tricky business. TreodeDB handles the first part, and provides a modular way to let other components handle the other parts.

Treode, Inc. is a prefunded company. We have one primary lead, one part-time contributor, and one part-time business advisor. We are talking with early stage investors, since some seed money would let us hire developers to work full-time on documentation, packaging, testing, performance and features. However, we haven’t asked those investors for money yet. Before that, we would like to see some traction. We’d like to see developers star the [repository on GitHub](https://github.com/Treode/store) and users ask questions on the [online forum](https://forum.treode.com). We’d like to find an other full-time contributor. And we’d like two or three marquee companies to informally endorse us.

Checkout TreodeDB on GitHub. How would you use it? Are you interested in contributing? You can ask questions online at our forum or email us.

[Image courtesy: Treode]

(Updated: 19 September 2022)