How big MNC’s like Google, Facebook, Instagram etc stores, manages and manipulate Thousands of Terabytes of data with High Speed and High Efficiency.
This problem solved by Big Data
big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task, or new product and service offerings. Like traditional analytics, it can also support internal business decisions. The technologies and concepts behind big data allow organizations to achieve a variety of objectives, but most of the organizations we interviewed were focused on one or two. The chosen objectives have implications for not only the outcome and financial benefits from big data, but also the process — who leads the initiative, where it fits within the organization, and how to manage the project.
$37,000 for a traditional relational database, $5,000 for a database appliance, and only $2,000 for a Hadoop cluster.1 Of course, these figures are not directly comparable, in that the more traditional technologies may be somewhat more reliable and easily managed. Data security approaches, for example, are not yet fully developed in the Hadoop cluster environment.
A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.
Distributed storage is the basis for massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage, as well as on-premise distributed storage systems like Cloudian Hyperstore.
Distributed storage systems can store several types of data:
- Files — a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
- Block storage — a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
- Objects — a distributed object storage system wraps data into objects, identified by a unique ID or hash.
Distributed storage systems have several advantages:
- Scalability — the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
- Redundancy — distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
- Cost — distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.
- Performance — distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.
With tens of millions of users and more than a billion page views every day, Facebook ends up accumulating massive amounts of data. One of the challenges that we have faced since the early days is developing a scalable way of storing and processing all these bytes since using this historical data is a very big part of how we can improve the user experience on Facebook. This can only be done by empowering our engineers and analysts with easy to use tools to mine and manipulate large data sets.
About a year back we began playing around with an open source project called Hadoop. Hadoop provides a framework for large scale parallel processing using a distributed file system and the map-reduce programming paradigm. Our hesitant first steps of importing some interesting data sets into a relatively small Hadoop cluster were quickly rewarded as developers latched on to the map-reduce programming model and started doing interesting projects that were previously impossible due to their massive computational requirements. Some of these early projects have matured into publicly released features (like the Facebook Lexicon) or are being used in the background to improve user experience on Facebook (by improving the relevance of search results, for example).
We have come a long way from those initial days. Facebook has multiple Hadoop clusters deployed now — with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. We are loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running each day against these data sets. The list of projects that are using this infrastructure has proliferated — from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. An amazingly large fraction of our engineers have run Hadoop jobs at some point (which is also a great testament to the quality of technical talent here at Facebook).
The rapid adoption of Hadoop at Facebook has been aided by a couple of key decisions. First, developers are free to write map-reduce programs in the language of their choice. Second, we have embraced SQL as a familiar paradigm to address and operate on large data sets. Most data stored in Hadoop’s file system is published as Tables. Developers can explore the schemas and data of these tables much like they would do with a good old database. When they want to operate on these data sets, they can use a small subset of SQL to specify the required dataset. Operations on datasets can be written as map and reduce scripts or using standard query operators (like joins and group-bys) or as a mix of the two. Over time, we have added classic data warehouse features like partitioning, sampling and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and we are looking forward to releasing an open source version of this project in the near future.
At Facebook, it is incredibly important that we use the information generated by and from our users to make decisions about improvements to the product. Hadoop has enabled us to make better use of the data at our disposal.