A distribute file system that like gfs or ndfs in c++.
Introduction
CGFS, aims for using a set of software for storing very large stream-oriented files over a set of commodity computers. Files are replicated across machines for safety, and load is balanced fairly across the machine set.
For more detail, please see the paper bellow.
System Design
There are three types of machines in the CGFS system:
-
Master, which manage the file namespace
-
ChunkServers, which actually store blocks of data
-
Clients, which acually use the files
Master
In the design of version 1, there has a single Master, which is responsible for storing the entire namespace and filesystem layout.
This basically consists of three data structures that need to be written to disk:
FsImage --> Version, Count, [Filename, NodeLength,[BlockID]NodeLength]Count
FsImage.add --> Filename, NodeLength,[BlockID]NodeLength]
FsImage.del --> [Filename]
Version: Int32
Count: VLong
Filename: string
NodeLength: Int32
BlockID: string
.del and .add files will be merged into FsImage periodically.
The Master is a critical failure point, but it shouldn't be an issue for load-management. It needs to do very little actual work, mainly serving to guide the large team of Datanodes.
The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
ChunkServers
There can be an arbitrary number of ChunkServers, all of which are configured to communicate with the single Master.
ChunkServers are responsible for actually storing data. A datastore consists of a table of the following tuples:
BlockID_X --> [array of bytes, no longer than BLOCK_SIZE] BlockID_Y --> [array of bytes, no longer than BLOCK_SIZE] etc.
This is the only structure that the Datanode needs to keep on disk. It can reconstruct everything else at runtime.
Upon startup, all ChunkServers contact the central Master. They upload to the Master the blocks they have on the local disk. The Master thus builds a picture of where to find each copy of every block in the system. This picture will always be a little bit out of date, as ChunkServers might become unavailable at any time.
Clients
CGFS clients code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers.
TODO
Version 1
In this version, we are going to complete the basic function of a distributed file system for files can only be written once and without user permissions.
Version 2
Complete all features referred from the GFS paper. To support some other os platform such as windows in the DataNodes. Inprove the performance and reliability.
Last modified by foxhawk. 15/11/2005