Hackolade

Sunday, June 3, 2018

Data Modeling is Dead... Long Live Schema Design!

Reality or not, the perception nowadays is that data modeling has become a bottleneck and doesn't fit in an agile development approach. Plus with NoSQL being "schema-less", perception often is that there is no need for data modeling ahead of coding. You may pretend that it is not happening. Or blame complexity, speed of change, culture, or developers' mentality. Or argue that data modeling is actually agile.

In the meantime, data modelers feel left out of the development process... because they are! They fear for their jobs, long term if not sooner. This is a recurring theme we sense at every Fortune 500 company across the US and Europe when we give our training 'Agile Query-Driven Data Modeling for NoSQL'.

The reality is that data modeling needs to be re-invented in order to remain relevant. And since there is so much baggage associated with the term "data modeling", maybe we should give it a less threatening name, such as "schema design"?

Here, the purists generally stop me to say: "Wait, you can't go straight into physical modeling without doing first the conceptual then logical models." Well... maybe, but that's part of the issue. If you can't demonstrate that you facilitate speed to market, then you're viewed as being in the way, and autonomous agile teams will try to get around you.

Logical modeling is counter-productive (for NoSQL)

Working our way backwards in the traditional sequence: conceptual -> logical -> physical, we all know by now that schema design is actually more important with NoSQL than with relational databases, since JSON is so powerful and flexible, but not so forgiving.

Traditional Data Modeling Process

Logical modeling makes sense when aiming to achieve an application-agnostic database design, which is still best served by relational database technology. But when designing a NoSQL database, which should be application-specific to leverage the benefits of the technology, it becomes apparent that logical modeling is a counter-productive step. Since logical modeling is supposed to be normalized while NoSQL schema design will be mostly denormalized, why go through the logical modeling exercise at all?

From Traditional Data Modeling to NoSQL Schema Design

Some sort of conceptual modeling continues to be required to document the understanding and blueprint of the business. But when dealing with NoSQL and agile development, we propose that Domain-Driven Design should replace conceptual modeling. Then, driven by business rules and application screens, reports and queries, we can map directly from domain aggregates in bounded contexts of DDD to the design of the NoSQL physical schema, thereby bypassing logical modeling.

Domain-Driven Design helps avoid "big balls of mud"

Creating an enterprise model is achievable for the initial incarnation of software systems. But without care and attention, inherent domain and technical complexity will, over time, turn monolithic applications into a pattern known as the "big ball of mud". Change is risky, and the best developers spend valuable time fixing technical complexity and technical debt, instead of adding value in domain evolution.

Domain-Driven Design is a language- and domain-centric approach to software design for complex problem domains. It recognizes that over time, an entreprise conceptual model will lose integrity as it grows in complexity, as multiple teams work on it, and as language become ambiguous. With DDD you decompose complex problems so you can be effective at modeling bounded contexts that are defined with unity and consistency. DDD promotes the use a Ubiquitous Language to minimize the cost of translation between business and technical terminology and to enable deep insights into the domain thanks to a shared language and collaborative exploration during the modeling phase.

DDD consists of a collection of patterns, principles, and practices that enable teams to focus on what's core to the success of the business while crafting software that tackles the complexity in both the business and the technical spaces. One such pattern is an aggregate, a cluster of domain objects that can be treated as a single unit, for example an order and its order lines.

Domain-Driven Design maps directly to the concepts of Agile and NoSQL

There's nothing in agile to suggest that one should skip design. It suggests that design should be evolutionary and iterative. DDD also encourages an iterative process, first at a strategic level to divide the work and focus on what's important to the business, then at a tactical level to understand the details of each bounded context.

On the database side, relational modeling is vastly different than the types of structures that application developers use. Database joins slow down performance and lead to object-relational impedance mismatch, causing developers to move away from relational modeling and towards aggregate models. When an aggregate is retrieved from the database, the developer gets all the necessary related data, thereby facilitating manipulations.

A NoSQL document structure corresponds to the structure of a programming object in a much better way than a relational database does, and at the same time, can closely represent DDD aggregates of domain objects.

DDD maps directly to NoSQL document DB concepts

Back to our proposal that logical modeling should be avoided, why would you break down domain aggregates into normalized entities, only to re-assemble them again during the physical schema design process?

Logical modeling when DDD and NoSQL are used together

If you had a logical model, how would you go about doing your NoSQL schema design with no knowledge of what queries and reports will look like? In other words, how would you perform entities aggregation without the context of the application screens and their content?

Document schema design

Having defined the aggregates of a bounded context, it is necessary to create additional artifacts: mainly a pragmatic charting of workflows and business rules (not a full BMPN that would be hard to produce, maintain, and digest), plus mockups (or wireframes) for application screens and reports. What's important here is to not fall in the same traps as reviewed earlier with enterprise data models! But the creation of these artifacts tends to reveal points of attention that may have been overlooked in the DDD phase.

Domain- and Query-Driven Schema Design for NoSQL

Based on the above streamlined process, the actual schema design step should be clearer. But the flexibility and power of JSON is the next challenge. It seems so intuitive at first that is easy to overlook the potential traps.

Say you've agreed to denormalize and aggregate information into one document. The next question is "how?" There are probably as many different ways to do it as you have members on your team: do you embed locally all related entity data? Or do you embed a partial duplicate or snapshot of remote entity data? Or do you refer to remote entity data, with one- or two-way referencing?

Here are a few factors influencing choices in relationship expression:

cardinality: does high cardinality lead to practical or technical issues?
strength of entity relationships: do they all conceptually belong together?
query atomicity: what info needs to be returned together?
update atomicity: must it all change together?
update complexity: what's the impact if data is duplicated? How do we avoid data inconsistency?
document size: how much time will it take to load? Are we in a mobile environment where data traffic matters? Will the document size grow indefinitely?
coding complexity: does it all make sense in the code?

The added-value of Data Modelers

Beyond the provocative nature of the headline, the exercise of designing a NoSQL database is obviously far from trivial. The dynamic and evolutive nature of a JSON structure is a wonderful opportunity that should not be spoiled by a careless approach. While developers are certainly capable of doing their own schema design, is it really the best allocation of resources? In enterprises dealing with any kind of application complexity, it becomes quickly obvious that data modelers can be tremendous contributors to the quality of agile development.

Years of experience in data modeling of relational databases have trained them to naturally:

focus on the core business use case
create pragmatic models without being over ambitious or perfectionist
reveal hidden insights and simplify
experiment with different designs to reach a flexible solution
challenge assumptions and look at things from a different perspective
facilitate the dialog between application stakeholders

Data modeling is no longer an exercise taking place just in the early stage of an application lifecycle. Data modeling is now part of the iterative agile development and continuous integration loop, adding value every step of the way.

Data Modeling has a role in every step of the agile development process, including in production

Even in production, data modeling is used to reverse-engineer all production NoSQL databases to discover new fields and structures that may have been added, providing unique documentation of unstructured and semi-structured data - a critical factor in the context of GDPR and privacy regulations.

As usual when a major shift is under way, there are 2 possible approaches: resist change, or embrace it. Data modelers should not fear agile development. They should enthusiastically embrace change, become the developers' best friends, and demonstrate their tremendous added value to achieve together higher quality applications.

Thursday, March 29, 2018

Is NoSQL dead?

TL;DR -- "Reports of NoSQL's death are greatly exaggerated!..."

Every article introducing NoSQL usually starts by explaining that the term is a misnomer, as it really stands for "Not Only SQL", etc... And back in 2014, some analysts predicted that "By 2017, the "NoSQL" label will cease to distinguish DBMSs, which will reduce its value and result in it falling out of use." This was pleasing news for traditional DBMS vendors, and also "multi-model" vendors.

For sure, we've seen some convergence. RDBMS vendors all allow storage of JSON documents, and MongoDB has recently announced support for multi-document transactions with ACID transactions.

But full convergence and the disappearance of NoSQL would not be such a good thing for users. Incumbents might like it if the buzz about NoSQL levels off. But it seems in the interest of NoSQL vendors to maintain a striking differentiator, while demonstrating their maturity as enterprise solutions. The term "NoSQL" carries tremendous marketing power, and vendors would be foolish to stop leveraging that.

After that, the situation resembles the debate of best-of-breed versus integrated platforms, ranging from hi-fidelity sound systems to ERPs. There will always be fervent proponents of each philosophical approach. The only question is: do you want the right tool for the job? For companies that have adopted NoSQL, few today use just a single database technology. They may use one platform for operational big data, another one for search, yet another for caching, and one more to power their recommendation engine.

Enterprises are embracing more and more a variety of best-of-breed NoSQL solutions to solve their specific challenges. They want proper data governance for their unstructured and semi-structured data, particularly in the context of GDPR and privacy concerns. They need a single tool to perform the data modeling of the top NoSQL vendors with a powerful and user-friendly interface. Hackolade provides just that:

- document-oriented: MongoDB, Couchbase, Cosmos DB, Elasticsearch, Firebase, Firestore

- key-value: DynamoDB; with Redis coming at a later date

- column-oriented: HBase, Cassandra

- graphs: we're actively developing a new version to support property graph databases, starting with Neo4j, and RDF triples

- RBDMS with JSON: we also plan support for JSON modeling in Oracle, MySQL, MS SQL Server, and PostgreSQL
- JSON and APIs: there's high demand for us to apply our data modeling to GraphQL, Swagger 2, OpenAPI 3, and LoopBack.

NoSQL is dead, long live NoSQL!

Current Hackolade DB targets

Tuesday, January 30, 2018

Schema validation for a schemaless database: is it a contradiction?

MongoDB recently introduced, with its version 3.6, a validation capability using JSON Schema syntax. As we keep hearing that one of the great benefits of NoSQL is the absence of schema, isn’t this new feature an admission of the limitations of NoSQL databases? The answer is a resounding NO: schema validation actually brings the best of both worlds to NoSQL databases!

Previously with version 3.2, MongoDB had introduced a validation capability, using their Aggregation Framework syntax. This was in response to the request of enterprises wishing to leverage the benefits of NoSQL, without risk of losing control of their data. JSON Schema is the schema definition standard for JSON files, sort of the equivalent of XSD for XML files. So, it was only natural that MongoDB would adopt the JSON Schema standard. There are multiple reasons to leverage this capability:

1) Enforcing schema only when it matters: with JSON Schema, you can declare fields where you want enforcement to take place. And let other fields be added with no enforcement at all, by using the property: ‘AdditionalProperties’. Some fields are more important than others in a document. In particular in the context of privacy laws and GDPR, you may want to track some aspects of your schema and ensure consistency. You may also want to control data quality with field constraints such as string length or regular expression, numeric upper and lower limits, etc…

2) JSON polymorphism: having a schema declared and enforced does not at all limit you in your ability to have multi-type fields or flexible polymorphic structures. It only makes sure that they do not occur as a result of development mistakes. JSON Schema, with oneOf/anyOf/allOf/noneOf choices, lets you declare in your validation rules exactly what is allowed and what is not allowed.

3) Degree of enforcement: MongoDB lets you decide, for each collection, the validation level (off, strict or moderate), and the validation action to be returned by the database through the driver (warning or error.)

In effect, the $jsonschema validator becomes the equivalent of a DDL (data definition language) for NoSQL databases, letting you apply just the right level of control to your database.

Hackolade model dynamically generates MongoDB $jsonschema validator

Since Hackolade was built from the ground up on JSON Schema, it has been quite easy to maintain MongoDB certification as a result of this v3.6 enhancement. No JSON Schema knowledge is required! You build your collection model with a few mouse clicks, and Hackolade dynamically generates the JSON Schema script for creation or update of the collection validator.

Friday, July 28, 2017

MongoDB Developer Productivity in an Agile World

Most database developers will tell you that traditional relational SQL databases are not ideally suited for agile development. They require a schema defined upfront and subsequent (costly) database migrations as the structure changes -- all things that don’t fit well within the two-three-week sprint cycles of an agile or Continuous Integration approach.

MongoDB on the other hand is built to free developers from upfront schema specification, even when changes occur. It supports dynamic schemas, which can evolve along with the application, reducing both development effort and expensive migrations, making companies more reactive and agile.

Similarly, IT departments and vendors can no longer impose the illusion of one-size-fits-all tools and approaches to self-managed teams. Developers want, and should get, the freedom to use the tools and features that can let them be the most productive.

New-generation developer productivity tools allow for taking full advantage of agile development and MongoDB: IDE’s (Integrated Development Environments), GUI’s (Graphical User Interfaces), and data modeling software.

Multi-threaded Skills for the Future

Sharing a common development environment is a great way to work faster when you’re on a team. It helps get you closer to that state of meditative bliss known as “Deploy at Will.” Achieving daily or even better, hourly, deployments to production means reducing code inventory - lines of code still log-jammed in the Dev/Test/CI/QA pipeline that are not yet delivering value to customers.

SQL query builder

Studio 3T is an IDE that helps teams work better with MongoDB, irrespective of physical location or technical level. Realizing that “one size never fits” for even one user let alone all users, it offers multiple querying options, including a drag-and-drop query builder, an auto-completion Intellishell, an Aggregation Pipeline builder, and in the most recent release, the option to write and run traditional SQL queries against your NoSQL collections. A basic GUI (such as Robomongo) may be sufficient for an individual working on their personal project, but as soon as you have a team of three or more, with different needs, preferences and technical skill levels, and working to commercial deadlines to boot, then having a shared IDE to work in is really indispensable.

Visual Data Modeling for MongoDB Schemas

By the same token, while schemas for small applications may be simple enough that no documentation is necessary, the power and flexibility of JSON makes physical data modeling even more important. It has been demonstrated time and again that data modeling accelerates development, significantly reduces maintenance, increases application quality, and lowers execution risks across the enterprise.

A new generation of data modeling tools is available on the market to properly represent physical data models for MongoDB collections and views. Hackolade is the pioneer for data modeling for NoSQL and multi-model databases. It was built from the ground-up to support the polymorphic and evolving structure of JSON documents. It helps the onboarding of NoSQL technology in corporate IT landscapes via a user-friendly visual interface and a more bottom-up, continuously changing, agile approach.

Scripts and documentation generation

Schema documentation provides the necessary map of the data to guide users through building the queries with Studio 3T.

One Size Fits None

The sign of mature growth in a platform is the richness of associated tooling that emerges over time. While platform vendors such as Microsoft, Oracle and now with MongoDB in their wake, all naturally focus on a unitary enterprise use-case, the availability of quality tools makes the platform many times more flexible to a far wider variety of end users. Like major RDBMS vendors before it, MongoDB has become the clear leader in the NoSQL market, in large part because of a healthy and growing ecosystem.

Wednesday, April 19, 2017

The Tao of NoSQL Data Modeling

The idea for Hackolade came from my own personal need for a data modeling tool for NoSQL databases. I searched the web, and couldn’t find one that would satisfy my needs. I tried really hard to use existing tools! After all, all I wanted was to give my credit card number and download the right tool to do my job. The last thing on my mind was to embark on a new entrepreneurial adventure...

There is a short explanation for why I was not satisfied with the existing tools, and there's also a long answer below. The short answer is simple and holds (almost) in this one picture:

data modeling, yelp challenge dataset, ERD

Reverse-engineering of Yelp Challenge dataset using traditional ERD tool

Periodically, Yelp awards prize money for interesting insights out of the analysis of their sample dataset. In the past, it has led to hundreds of academic papers. As the data is provided in JSON format, any NoSQL document database is a good candidate to store the data, and several blogs explain how to use MongoDB for the analysis. Using a data modeling tool to discover the data structure should be a great first step...

Only problem is: the Yelp dataset is made of just five data collections in MongoDB, yet the traditional ER tools finish their reverse-engineering process by showing these stats:

If there are just five collections in the database, you would expect only five entities in the Entity Relationship diagram, one for each of the collections in MongoDB, right? Something more like this:

Reverse-engineering of Yelp Challenge dataset using Hackolade

Besides the more orderly aspect, this second diagram is also a lot easier to understand. It is a closer representation of the physical storage, displaying nested JSON sub-objects as indentations rather than as separate boxes (entities) in the ERD -- in a manner similar to what you would find in a JSON document.

And if you're developing or maintaining your own model, it is a lot easier to deal with the entire JSON structure in just one view, including all nested objects (arrays and sub-documents), than if you need to open a new entity for each nested object (like in the following picture representing the structure of just one of the Yelp documents...)

Yelp Business collection represented by a traditional ER tool

No wonder some developers of NoSQL applications don't want to hear about data modeling, when the diagram that is supposed to help understand and structure things, is actually more confusing, and doesn't look anywhere close to the physical documents being committed to the database! A more natural view would be this one:

Yelp Businesses collection represented by Hackolade

To manage objects metadata, Hackolade provides a second view -- a hierarchical tree view -- similar to the familiar XSD tree:

Hierarchical tree view in Hackolade

One of the great benefits of this tree view is the handling of the polymorphic nature of JSON, letting the user define choices between different structures.

The reason for the difficulty with traditional ER tools in representing JSON nested structures is actually simple and logical: they were originally designed for relational databases, and their own persistence data model (how they store objects and metadata) is itself relational.

As a user, if you use a traditional ER diagramming tool for the data modeling of relational databases and apply it to a NoSQL database (MongoDB in this case), you are constrained by the original purpose and underlying data model of the tool itself. And while it is quite creative of the vendor to make its tool "compatible" with MongoDB, it is clearly an afterthought, and it ends up not being very useful.

Just like NoSQL databases are built differently than relational databases, data modeling tools for NoSQL databases need to be engineered from the ground up to leverage the power and flexibility of JSON, with its ability to support nested semi-structured polymorphic data. And to do that, the modeling tool cannot store its own data in flat relational tables!

Hackolade stores data model metadata in JSON (actually in JSON Schema, the JSON equivalent of XSD for XML), making it easy to represent JSON structures in a hierarchical manner that is close to the physical storage of the data. And the user interface was built according to the specific nature and power of JSON. This is why Hackolade is the pioneer for the data modeling of NoSQL and multi-model databases!

Longer answer

The challenges in modeling JSON with tools made for flat database structures are as follows:

similarity between JSON and its GUI representation

structure
sequence
indentation

clarity of complex models
meaning of relationship lines
representation of polymorphism

Structure

Contrary to conceptual modeling, JSON is a representation of the physical storage in the database as implemented, or intended to be implemented, in a NoSQL database (or multi-model DBMS.) Entity Relationship modeling theory has worked wonders for the normalization of relational databases, in its ability to represent in diagrams: conceptual, logical, and physical models. But ER theory has to be stretched for the purpose of NoSQL because of the power and flexibility provided by embedding, denormalization, and polymorphism.

If the ERD is going to represent conceptual entities, then each embedded objet in a JSON document could (maybe simetimes) be represented by 1 box in the ERD. However, we’re dealing here with physical storage, and therefore in such case, it is preferable to have:

1 JSON document = 1 entity = 1 box in the ERD

That way, the contextual unity of the document can be preserved.

Sequence

Preserving in the ERD the sequence of the physical document helps legibility and understanding.

As a consequence of splitting embedded objects from the main document, the ERD drawn with traditional tools makes things harder for the observer by not displaying the same sequence of fields in the diagram as in the physical JSON.

On the other hand, Hackolade's views (ERD and the hierarchical tree) both respect the physical sequence of the document:

Indentation

Indentation of embedded objects in JSON (arrays and sub-documents) helps legibility. As another consequence of splitting embedded objects from the main document, the ERD drawn with traditional tools does not preserve the indentation of JSON that would make it easy to read.

Clarity of complex models

Take a look at an example of the structure of a real document from a real customer (with some field names obfuscated on purpose...)

Complex JSON document

The ER rendering of such a document by a traditional ER tool would result in so many boxes that it becomes nearly impossible to work with. And that’s with a single document. Imagine what an ERD would look like for an application comprised of dozens of such collections.

Meaning of relationship lines

As yet another consequence of splitting embedded objects from the main document, the ERD drawn with traditional tools displays relationship lines of different nature:

Relationships resulting from the embedding of objects
Traditional foreign key relationships [even though we are dealing with so-called ‘non-relational’ DBs, there are often implicit relationships in NoSQL data]

This makes for a confusing picture as true foreign key relationships are hard to distinguish from embedding relationships (even though there can be dashed and solid lines.)

All this does not leave much room for a useful 3^rd type of relationships: those issued from denormalization (i.e.; redundancy of data which is useful in NoSQL to improve the read performance of the database.)

Polymorphism

One of the great features of JSON as applicable to NoSQL and Big Data, is the ability to deal with evolving and flexible schemas, both at the level of the general document structure, and at the level of the type of a single field. This is known as "schema combination", and can be represented in JSON Schema with the use of subschemas and the keywords: anyOf, allOf, oneOf, not.

Let’s take the example of a field that evolves from being just a string type to becoming a sub-document, or with the co-existence of both field types. Traditional ER tools have a hard time dealing graphically with subschemas (let's be frank, they're simply unable to deal with it...), whereas with Hackolade:

Polymorphism in 2 Hackolade views

Conclusion

Besides the above demonstration, Hackolade has many other advantages. For example, reverse-engineering is done through a truly native access to the NoSQL database, not via a "native" 3rd-party connector (is that not a contradiction in terms?...) Hackolade provides useful developer aids such as the ability to generate sample documents and forward-engineering scripts specific to each supported NoSQL database vendors. And Hackolade supports other NoSQL vendors than just MongoDB: DynamoDB, Couchbase, Cosmos DB, Elasticsearch, Apache HBase, Cassandra, Google Firebase and Firestore, with many others coming up.

Data is a corporate asset, and insights on the data is even more strategic. Sometimes overlooked as a best practice, data modeling is critical to understanding data, its interrelationships, and its rules.

Hackolade lets you harness the power and flexibility of dynamic schemas. It provides a map for applications, a way to engage the conversation between project stakeholders around a picture. Proper data modeling collaboration between analysts, architects, designers, developers, and DBAs will increase data agility, help get to market faster, increase quality, lower costs, and lower risks.

Friday, March 31, 2017

Even non-relational databases have relationships

Hackolade CEO Pascal Desmarets Speaking at EDW17 and NoSQL Now! in Atlanta;

Company is a Sponsor and Will Be Exhibiting

Company Will Demo its Data Modeling Tools for Various NoSQL Databases

Isn't it ironic that a technology that bears the label of “schema-less” is also known for the fact that schema design is one of its toughest challenges? Aside from the well-known scalability and cost benefits of NoSQL databases, schema flexibility frees up users from many of the constraints of normalization rules in relational databases. The JSON-based dynamic-schema nature of NoSQL is a fantastic opportunity for application developers: ability to start storing and accessing data with minimal effort and setup, flexibility, fast and easy evolution. But while flexibility brings power, it also brings dangers for designers and developers new to NoSQL or less experienced.

This is why the NoSQL database vendors counter their marketing department’s simplicity message by devoting countless pages, blogs, and videos to the subject of schema design (i.e.; MongoDB, DynamoDB, Couchbase, Cassandra, etc…)

To make matters worse, each NoSQL document database adopts a different storage strategy, even if pretty much all of them use JSON. For example, MongoDB assumes the definition of one “collection” for each entity, while Couchbase encourages to mix different entities in as few “buckets” as possible, ideally just one. Each vendor also prescribes a different approach for the definition and usage of the primary key (e.g.; DynamoDB’s hash and range vs MongoDB’s system-generated objectIDs vs Couchbase’s user-defined IDs.)

All these factors create a steeper learning curve and sometimes an unnecessary barrier to the adoption of NoSQL. A number of negative stories have appeared on the web, but when you read between the lines, failure is always due to a misunderstanding or a lack of experience with the design of the data model. Additional difficulties start appearing with increased complexity of the data and scale.

All this is compounded by the fact that the data structure is tacitly described -- in the application code. And examining the code is not the most productive way to engage in a fruitful dialog between analysts, architects, designers, developers, and DBAs.

This is where data modeling comes into play as a best practice. A database model describes the business. A database model is the blueprint of the application. Such a map helps evaluate design options beforehand, think through the implications of different alternatives, and recognize potential hurdles before committing sizable amounts of development effort. Even more so in an Agile development approach, a database model helps plan ahead, in order to minimize later rework. In the end, the modeling process accelerates development, increases quality of the application, and reduces execution risks.

We’ll be discussing this at much greater length during my presentation at EDW Atlanta titled, you guessed it, “Even non-relational databases have relationships,” and from our booth at the show. Come to the presentation and say hello, stop by our booth and let me know your thoughts.

Entity Relationship Diagram for NoSQL with embedded entities and foreign key relationships