A Small Overview of Big Data

Last updated: 26 Dec 2014

"Basically what you can do with information is transmit it through space, and we call that communication. You can transmit it through time; we call that storage. Or you can transform it, manipulate it, change the meaning of it, and we call that computation." Martin Hilbert, PhD

Introduction

Over the last decades, the sheer amount of digital data generated and stored in the world has grown by orders of magnitude. One estimation holds that the amount of digital data stored worldwide has grown at an average rate of 23% every year from 1986 to 2007. One would expect this rate to have gathered even more momentum since then due, among other factors, to the massive decrease in the cost of storage hardware. One particular source citing commercial prices for hardware claims that the cost of storing one Megabyte in hard disks has decreased from US$ 90 to US$ 0,00287 to in the same period (1986-2007) and has picked up speed since then.

The cost of storing one Megabyte in hard disks has decreased by a factor of 31,000 in roughly 20 years.

One example of the sheer scale at which data is kept nowadays is the amount of data held by Facebook, whose data warehousing facility stored upwards of 300 Petabytes (300 million Gigabytes) worth of data as of April 2014. Such growth appears to be nonlinear and is therefore more difficult for humans to comprehend and reason about since most day-to-day interactions and phenomena appear (at a human scale) to be linear.

After trying, at first, to address the new amounts of data with traditional technologies (RDBMS systems for the most part), decision makers and engineers soon found out that a paradigm shift was inevitable.

"Until around 2005, performance concerns were resolved by purchasing faster processors. In time, the ability to increase processing speed was no long an option. As chip density increased, heat could no longer dissipate fast enough without chip overheating. This phenomenon, known as the power wall, forced system designers to shift their focus from increasing speed on a single chip to using more processors working together." Dan McReary

A tipping point seems to be reached when it becomes too cumbersome or downright impossible to store all data for a single application or business objective in a single disk or database, even with current top of the line hardware. It becomes evident that the physical limits on how much more powerful you can make a single machine just cannot be ignored anymore.

In most scenarios, it is (even when scaling up is still an option) much more cost-effective to scale out (i.e. linking together many low-cost “commodity” computers and forming a distributed cluster) than to scale up (i.e. adding more resources, such as processing power and memory capacity, to a single node).

In this context, Big Data can be thought of as an all-encompassing term given to techniques used in situations where traditional approaches to handling data (albeit scaled up) are not effective anymore in dealing with the large amounts being produced by today’s industry and academia. Moreover, these solutions tend to favour distributed architectures that can scale more easily as more nodes are attached. (Drawing from "Undefined by Data: a Survey of Big Data Definitions" )

Big Data is an all-encompassing term given to techniques used in situations where traditional approaches to handling data are not effective anymore.

These new developments are causing major changes in the way we collect, handle and store digital data. This is important because, in the near future, all but the smallest organizations dealing with digital data will fit the criteria for Big Data.

Every business or startup in the digital world has the potential to go global in a short span of time. This is possible because there are no physical boundaries for internet services – like there are in the physical world. This means that the economic rewards for companies that can harness their data for productive ends (using Big Data technologies) can be very large indeed and, similarly, companies that do not do so will be at a crippling disadvantage.

NoSQL

The term NoSQL appears frequently whenever one starts looking beyond traditional forms of accessing data. Even though the term itself may be difficult to define (much like Big Data) and slightly misleading (the SQL language itself is hardly ever the bottleneck when dealing with large datasets), the core idea embedded in this name is very relevant. In most cases, it is used to refer to data storing solutions other than traditional RDBMS systems.

"NoSQL is a set of concepts that allows the rapid and efficient processing of data sets with a focus on performance, reliability and agility." Dan McReary

In this document, the term Big Data will be used to refer to any technologies in the context of dealing with large dataset and the term NoSQL for techniques for storing and searching only.

Current Solution Landscape

Aggregate- vs Relation-based Models

The distinction between aggregate-based and relation-based is useful for understanding currently existing solutions and the differences between them.

The relational model is the one most of us is already familiar with because it is the one used in traditional RDBMS systems. Data is often normalized to avoid unnecessary replication and relationships (via constructs such as joins) must be declared at querying time.

Document-oriented Storage

The document-oriented paradigm is an example of the aggregate model. Documents are generally stored in JSON format and, in most cases, have no predefined schema. In general, documents having any structure (even with different attributes) can be added to the store.

MongoDB is a document-oriented database. It supports horizontal scaling (scaling out) with replicas and shards and allows users to perform transformations using MapReduce as a means to do aggregations not otherwise supported. Queries are performed using a slightly modified version of JSON.

It is currently (as of November 2014) the most popular NoSQL solution in use throughout the industry.

Lucene is a popular search server primarily intended for full-text search. There at mainly two fully-fledged products that use Lucene and build upon it. One is Solr and the other is Elasticsearch. Both store documents as text-based JSON objects.

Elasticsearch is a newer product and was designed from the ground up to embrace the distributed paradigm. New features are being added currently and there is at least one major player using Elasticsearch not just as a search index but also as their primary data store.

Key-value Storage

Also a type of aggregate-based data store, key-value databases save data using a method similar to that of associative arrays (also called dictionaries or maps in some programming languages). This means that every piece of data has a unique key which is used for indexing and fast lookup.

The distinction between key-value stores and document databases is sometimes blurry so it is not uncommon to see other products suchs as MongoDB also listed as a key-value store.

Two examples of key-value stores are Redis and Riak.

Column-family storage

Column-family storage is a type of data store that is useful for very large amounts of data, which do not change very often, i.e. mostly read-only.

They originated in the new famous Google BigTable paper, which outlines the data store infrastructure used by some of Google’s most data-intensive projects, such as Google Earth and Google Finance.

Some examples of column-family storage are Cassandra and HBase.

Graph Databases

Graph databases are a little different from the examples given so far insofar as they do not address the inability of traditional databases to deal with large datasets. They are designed to deal with datasets where the connections between nodes are more important than the nodes (and the data contained therein) themselves. In these cases, traditional database solutions don’t scale well either, because the way they deal with relationships (i.e. using joins and foreign keys) is very costly and makes response times prohibitively long.

Some examples of graph databases are Neo4J and Titan.

Other Relevant Technologies

These are some examples of areas that, although not directly related to storage and search, are nonetheless relevant to whoever needs to handle large amounts of data and extract useful information from them. Examples of widely used tools are also given.

Data Visualization

When data size surpasses our ability to reason about them, it can be useful to show them in alternative ways. The field of data visualization is itself fast moving and full of interesting ideas to display complex data in a way that is understandable by humans. The main contender here is D3.js, a comprehensive Javascript library containing useful primitives that can be built upon for original data visualizations on a web browser.

Virtualization

All the techniques shown here deal with data but are ultimately used by humans. Virtualization techniques help engineers and scientists see a large, sometimes heterogeneous mass of hardware and software as one logical unity, thereby helping them reason and make decisions. Some vendors provide solutions for virtualizing a whole data centre.

Flow of Information in Networks

Before data is available on a place where it can usefully be analysed and/or searched, it first needs to get there. The process of moving large amounts of data around a distributed infrastructure (which is prone to failures, high latency and other hurdles) sometimes cannot be satisfyingly accomplished with traditional tools such as rsync. New technologies like Apache Kafka and Apache Flume are some examples of tools being used for these purposes.

Batch Processing and Storage with the Hadoop Stack

The Hadoop Stack is a framework for writing distributed processing jobs. Its main composing modules are the MapReduce programming model and the HDFS file system, on which data is stored. Hadoop is used to turn large, monolithic tasks (the type of which would be previously be run on a single, powerful computer) into smaller ones, called jobs, each of which is sent to a separate node and executed in parallel.

As of December 2012, it was estimated that half of the Fortune 50 companies (the 50 largest companies in the US) use Hadoop.

Opportunities for Academia and the Industry

Reactive Systems

Reactive systems, as defined by the Reactive Manifesto, are systems that display properties that make them ideally suited for being used in distributed systems. Reactive systems favour asynchronous communication and fault-tolerance, properties that help it work better under unreliable conditions such as those found in heterogeneous computer clusters made of so-called commodity components (low-cost hardware).

Functional Programming Paradigm

The functional programming paradigm naturally emphases ways of processing data and performing computations which eliminate some difficulties when designing and implementing distributed systems.

By eliminating (or at least reducing) side effects, systems designed using functional programming languages are easier to reason about and do not depend upon the order they run, reducing the need for locks and other explicit concurrency controls, which are often a source of problems.

Data Mining and Machine Learning

One of the key uses for data (not only Big Data) lies in identifying existing patterns and predicting future behaviour. Large datasets may introduce new challenges that must be addressed in order to extract useful information from them.

Enforcing Constraints in Distributed Data Stores

The unique trade-offs posed by sets of properties such as the CAP Theorem or the ACID acronym pose difficult challenges when viewed from the viewpoint of Big Data. While it may not be possible to provide the same guarantees as those present in traditional RDBMS systems, it may be possible to design systems such that at least some desirable properties are present.

Energy Efficiency for data Centres

Data centres are ultimately large buildings that keep various amounts of hardware running at high frequencies. Processors and other pieces of hardware generate heat and consume electricity and require cooling in order to function well. Designing systems so as to keep energy usage low (for example, by using efficient algorithms) include disciplines such as Data Centre design, Algorithmic efficiency, Resource allocation and Power management.

Domain-specific Systems Leveraging Diverse Technologies

Since it may be hard for end-users to develop their own solutions based upon the previously mentioned technologies and products, there may be room for companies developing products that leverage different technologies and connect them together in a simple, coherent package, optimized for distinct domains, such as financial analysis, genomic sequencing and log management, just to name a few.

References

In addition to the websites linked to in the text, I have drawn considerably from books such as M. Fowler's NoSQL Distilled and also Dan McReary's Making Sense of NoSQL

Felipe 26 Dec 2014 26 Dec 2014 big data