Hadoop clusters provides two main functionality
* HDFS : hadoop distributed FileSystem
* Data Processing : using Distributed way of processing data .
A cluster is build of multiple machines runing together in tandem to complete a specific task . Every node is executing the same piece of “Framework code “. If one is able to appreciate the working of a single node the cocenpts and abstraction build by distributed systems becomes very easy to understand .
A single node provides 4 resource to use
DISK
RAM
CPU
NETWORK
DISK : A given file on machine is saved on a persistent storage namely HDD(Hard disk Drive). The file which looks sequential in nature to a user(open a file saved on hardisk using gedit ) in reality can be sparse across multiple platters and sectors on HDD (http://www.cyberciti.biz/tips/understanding-unixlinux-filesystem-superblock.html). One can corelate this with a linked List way of storing the data . The Link info (metadata) maintained by the FileSystem and actual data is stored in Physical blocks.
Various linux command can be used to figure out where a file has been physically placed on a disk :
sudo hdparm –fibmap fileName
filefrag -b512 -v fileName
The fileSystem is the one which keep a logical binding of the sparsely stored data . In Unix ext2/ext3 are different fileSystems to provide for abstraction of the underlying storage . The Filesystem thus needs to maintain a metadata where it can save the logical linking of the file blocks.
The metadata has info about fileSystem logical Id : Inode number to HDD physical Storage ID :
In Unix the metadata of the fileSystem can be seen using
df -hk : gives one the list of hard disk on that machine
dumpe2fs /dev/HHD_name | grep -i superblock
Lets analyse the same concept in terms of a Distributed Hadoop File System .
* Instead of Single machine the File System spans across multiple machines.
* The metada is so huge that it required a Seperate machine to hold the meta data
* The machine holding the metadata is known as NameNode .
* all other nodes have a deamon process known as DataNode running on those machines . Namenode and datanodes work in Master-Slave relatioship with NameNode being the master.
Processing
The processing power of any machine is determined by the number of Cpu that machine has and amount of RAM at disposal to CPU .
various commands which can help figure out about these resources are
top
free -h
iostatc -x -c 3
mpStat -p ALL
In a single Node the data is present in the HDD for processing , a program needs to figure out where the data is present/saved by interactiing with the fileSystem and and processing/transforming the data using the CPU .
In Distributed Cluster (hadoop) the data is spanned across multiple machines and so is the CPU and RAMS.
A efficient way of processing the data is to move the processing where the data is located (This reduces network transfer ) and with this strategy one can use more number of CPU , RAM , HDD bandwidth to process the data (rather than fetching the data to a single machine and processing there).
Hadoop provides JobTracker and TaskTracher as the deamons which can help cordinate the distributed processing .
JobTracker and TaskTracker works in Master-Slave relationship . Given a job Namenode is clonsulted for nodes on which data is available and jobtracker spawns processing (on a best effort basis ) to process the data partially . The task of monitoring the individual job on individual nodes is taken care by task tracker.
* HDFS : hadoop distributed FileSystem
* Data Processing : using Distributed way of processing data .
A cluster is build of multiple machines runing together in tandem to complete a specific task . Every node is executing the same piece of “Framework code “. If one is able to appreciate the working of a single node the cocenpts and abstraction build by distributed systems becomes very easy to understand .
A single node provides 4 resource to use
DISK
RAM
CPU
NETWORK
DISK : A given file on machine is saved on a persistent storage namely HDD(Hard disk Drive). The file which looks sequential in nature to a user(open a file saved on hardisk using gedit ) in reality can be sparse across multiple platters and sectors on HDD (http://www.cyberciti.biz/tips/understanding-unixlinux-filesystem-superblock.html). One can corelate this with a linked List way of storing the data . The Link info (metadata) maintained by the FileSystem and actual data is stored in Physical blocks.
Various linux command can be used to figure out where a file has been physically placed on a disk :
sudo hdparm –fibmap fileName
filefrag -b512 -v fileName
The fileSystem is the one which keep a logical binding of the sparsely stored data . In Unix ext2/ext3 are different fileSystems to provide for abstraction of the underlying storage . The Filesystem thus needs to maintain a metadata where it can save the logical linking of the file blocks.
The metadata has info about fileSystem logical Id : Inode number to HDD physical Storage ID :
In Unix the metadata of the fileSystem can be seen using
df -hk : gives one the list of hard disk on that machine
dumpe2fs /dev/HHD_name | grep -i superblock
Lets analyse the same concept in terms of a Distributed Hadoop File System .
* Instead of Single machine the File System spans across multiple machines.
* The metada is so huge that it required a Seperate machine to hold the meta data
* The machine holding the metadata is known as NameNode .
* all other nodes have a deamon process known as DataNode running on those machines . Namenode and datanodes work in Master-Slave relatioship with NameNode being the master.
Processing
The processing power of any machine is determined by the number of Cpu that machine has and amount of RAM at disposal to CPU .
various commands which can help figure out about these resources are
top
free -h
iostatc -x -c 3
mpStat -p ALL
In a single Node the data is present in the HDD for processing , a program needs to figure out where the data is present/saved by interactiing with the fileSystem and and processing/transforming the data using the CPU .
In Distributed Cluster (hadoop) the data is spanned across multiple machines and so is the CPU and RAMS.
A efficient way of processing the data is to move the processing where the data is located (This reduces network transfer ) and with this strategy one can use more number of CPU , RAM , HDD bandwidth to process the data (rather than fetching the data to a single machine and processing there).
Hadoop provides JobTracker and TaskTracher as the deamons which can help cordinate the distributed processing .
JobTracker and TaskTracker works in Master-Slave relationship . Given a job Namenode is clonsulted for nodes on which data is available and jobtracker spawns processing (on a best effort basis ) to process the data partially . The task of monitoring the individual job on individual nodes is taken care by task tracker.
No comments:
Post a Comment