NoSQL Improved Upon Record Based Databases

NoSQL databases (known as key/value, key/JSON, or “document” database) provide data that is self-describing, which is a huge improvement compared to older “record-based” databases. NoSQL databases are often compared to (purely) relational databases, but this is primarily because relational databases hold the biggest market. Record-based databases are really a non-entity these days, so there wouldn’t be much benefit in trying to “compete” against them.

The best comparison of NoSQL is against record-based databases.

Record Based Databases

Old fashioned “record-based” databases contained the data, while the structure of data was contained within the application. The database would contain a set of data values such as the following:

123, Chris, Craft, 123-555-1212

The application would know that these data attributes were as follows:

Record Key, First Name, Last Name, Phone Number

The application would typically contain a structure definition for the language the application was written in. If you go back far enough in history, this would have been Assembler or possibly COBOL language. The C/C++ language(s) would have used a struct contained in a header file.

This was fine in the days when ALL of the application code that touched the data was written in the same language. It was a huge pain when multiple applications accessed a common record-based database using different languages, such as a COBOL application that created data, and a C/C++ or Java application needed to access the same data. Someone needed to maintain structure definitions that were in-sync with each other, and it was a huge problem when changes were made to the structure.

Record-based databases included things like VSAM and ISAM. Neither of those databases used SQL, so they would certainly fit the definition of a NoSQL database. I realize the NoSQL people in this world aren’t thinking of VSAM and ISAM to be part of their orbit. Of course that’s a basic problem with the “NoSQL” term itself. Anything defined by what it’s not is very poorly defined.

Self-Describing Data

NoSQL databases typically contain self-describing data. JSON (Java Script Object Notation) is the most commonly used form of self-describing data, but there are other forms as well including XML. Rather than describing data in a struct within the application code, those structure definitions are kept within the data. The database provides a single source of truth for data structure, but also allows for much simpler modification of the data structure and code that access the data as the application and database evolve over time.

A JSON structure matching the previous example might look like this:

{
    "recordkey":"123",
    "firstname":"Chris",
    "lastname":"Craft",
    "phonenumber":"123-555-1212"
}

Of course every single record contains the field names, which consume more storage than the data in this example. Storage is cheap these days and we have much greater processing power than years ago, so there isn’t as much “data pressure” that would force us to define the structure inside the application. The reduction of “data pressure” allows us to use a method that’s less storage efficient to gain advantages in the application development side.

Data Pressure Considerations

The term “data pressure” essentially refers to the cost of storage and the cost of processing larger and larger volumes of data.  Record-based database were extremely efficient in terms of the space they required and compute power to process that data. Less data volume meant less data store and less data to process, so nobody would have considered making data self-describing.

Record-based databases were never self-describing because that simply was not practical decades ago. The cost of storage was simply too high and putting data descriptions into the data itself would have resulted in a huge amount of data duplication (duplication of the data definitions stored within each record), and there would not have been enough compute power on earth to work with that data.

Comparing NoSQL to Relational

NoSQL databases are more about the structure and representation of data than about the  absence of SQL language. Part of this is the same issue of object/relational impedance mismatch that has been around for decades. The rise of object-oriented programming resulted in application code that simply didn’t fit well with data modeled relationally.

Oracle Database supports Key/JSON

The SQL language does not work only with relational data. We can certainly use SQL to access a 2 column table containing a KEY (column #1) and JSON structure (column #2). Oracle Database includes the ability to model data in key/JSON format. Oracle databases might contain nothing but key/JSON data, some relational and some key/JSON, or JSON can be used to extend a relational model.

Oracle is a Converged Database

Oracle Database is a called a “converged” database because it supports virtually any data modeling approach and supports virtually any workload, all within a single database or deployed as separate databases.  Oracle is a multi-model database that supports relational, graph, time series, key/JSON, row-oriented, column-oriented (on Exadata), Geospatial data, normalized models, dimensional models (star schema, snowflake schema, etc.) and others.  Oracle Database supports all workload types including OLTP, DW, and IoT within a single database.

Oracle database uses the SQL language to process what some consider to be “NoSQL” format data (like JSON).

All Data Has Relationships

All data has relationships to other data. You can easily see this by simply looking at any piece of data.  All data has relationships to other data such as:

  • Customers place Orders
  • Orders are for Products
  • Products are in Warehouses
  • Warehouses have Employees

The list of relationships goes on and on, ad infinitum. Ensuring relationships are tracked is critical to data integrity. What happens if a customer orders a product that doesn’t exist? What happens if products go missing from a warehouse?

Databases Simplify Applications

The raison d’être for databases has always been to simplify application code. Anything done by the database is something the application code doesn’t have to handle. The basic functions of a database through the history of IT has been these:

  • Preventing Data Loss
  • Controlling multiple user access to data
  • Enforcing data values are correct
  • Ensuring data relationships are valid
  • Data Security

These functions have been done by databases for decades because it simplifies application code.

Preventing Data Loss

Databases protect data, which is the lifeblood of an organization. Preventing loss of data has been key to survival even in the days of paper records and stone tablets. Electronic records are subjected to loss from a wide variety of failures, and databases are responsible for providing protection against these failures. Data loss might mean loss of the ENTIRE database, but also means changes from one user getting destroyed by another user.

Controlling Multiple User Access to Data

Without databases, applications would face extreme complexity in controlling access to data by multiple users. Access to data includes reading data, but also includes changing that data. Applications would be horribly complex without the concurrency controls provided by databases. Users also shouldn’t have access to changes made by others until those changes are complete.

Enforcing Data Values are Correct

The correctness of data values includes enforcing number attributes contain numeric values, date attributes contain valid dates, JSON strings contain valid JSON, GPS locations are valid locations, countries are valid countries, etc. Without this enforcement by the database, errors in application code will introduce corrupt data, which will have a cascading impact on other application code. If the database contains an order date of February 30th, that’s clearly wrong and it’s difficult to determine how that value got inserted into the database. It’s much better to catch that when it happens and prevent the change.

Ensuring Data Relationships are Valid

Relationships represent a special class of data values that are enforced by the database. This goes to the heart of why databases were created, and why the relational database became the leader. Hierarchical databases could only enforce certain data relationships. Network databases were a little better, but also had limitations. Relational databases (invented by Edgar F. Codd) provided much greater capability to model data relationships and ensure those relationships were reflected in the data.

Data Security

Of course databases play a central role in the security of data, including who can view data as well as who can make changes to data. This includes not simply access to the entire database, but down to individual attributes as well. There are different access controls over who can view customers, who can change customer data, and who can view or change customer addresses and credit cards.

All of this complexity can be pushed into the application code, but we often want separate data access controls enforced by the database. The principal is known as “defense in depth”. Fort Knox has the gold inside a vault, inside a building, surrounded by a fence, patrolled by guards. You certainly wouldn’t want a Fort Knox with the gold sitting in the middle of a field with guards running around trying to protect it. Defense. In. Depth.

Normalization & Denormalization

Data often begins in a denormalized form or at least we often start to think about data that’s not normalized, so we typically “normalize” the data when designing a relational data model. We then “de-normalize” data primarily for performance reasons. One example might be phone numbers. Customers often have multiple phone numbers such as home, work, and cellular phones.

In a normalized data model, each of a customers’ phone numbers would be stored in separate rows of a phone number table related to the customer. However, that means accessing extra rows of data when you want to show them all. To improve performance of accessing ALL of the rows, we might simply put separate home, work, and cellular phone numbers on the customer table. However, what happens when a customer has 2 cell phones?

Phone numbers are a great case for the combination of JSON within an otherwise relational database.  Rather than modeling N number of phone numbers in the customer table, we might create a PHONE_NUMBERS column as a JSON structure.  That design allows a variable number of phone numbers for each customer.

Database Optimization

I heard someone say recently that relational databases aren’t optimized for anything, which is a complete misrepresentation of reality. Every table in a normalized relational data model should have a primary key. That primary key will have an index. Access to a row of data through the index is “optimized” and will give good performance.

The database “engine” does not provide the logical structure of data. Database “engines” (historically) have not automatically generated indexes beyond those needed to enforce constraints such as primary keys and unique keys. Building of secondary indexes (those beyond what’s needed to enforce constraints) has always been the responsibility of the database administrator, data modeler, or application developer. So yes, in a sense the database is not “optimized” out of the box if you don’t optimize it!

Oracle Autonomous Database is the first on the market that is doing this sort of optimization and automatically handling what has traditionally been the responsibility of the DBA, data modeler, or application developer.