首页 | 主题 | 图库 | 问答 | 文摘 | 原创 | 百科

历史 | 地理 | 人物 | 艺术 | 体育 | 科学 | 音乐 | 电影 | 信息技术 | 世界遗产

 开放、中立,源自维基百科

个人工具


ZFS

维库,知识与思想的自由文库

跳转到: 导航, 搜索


ZFS
开发者 Sun Microsystems
全称 ZFS
引入 2005年11月 (OpenSolaris)
分区标识
结构
目录内容 可扩展哈希表
文件分配
坏块
Limits
最大文件尺寸 16 exabytes
最大文件数量 248
最长文件名
最大卷容量 16 exabytes
文件名字符集
Features
日期记录
日期范围
Forks 是 (called Extended Attributes)
属性 POSIX
文件系统权限 POSIX
透明压缩
透明加密
操作系统支持 Sun Solaris, Apple Mac OS X 10.5, FreeBSD, Linux (通过用户空间文件系统

ZFS 源自于Sun MicrosystemsSolaris操作系统开发的文件系统。ZFS是一个具有高存储容量、文件系统与卷管理概念整合、崭新的磁盘逻辑结构的轻量级文件系统,同时也是一个便捷的存储池管理系统。ZFS是一个使用Common Development and Distribution License (CDDL)协议条款授权的开源项目。

目录

[编辑] 历史

ZFS的设计与开发由Sun公司的Jeff Bonwick所领导的一支团队完成。最早宣布于2004年9月14日[1]2005年10月31日并入了Solaris开发的主干源代码。[2] 并在2005年11月16日作为OpenSolaris build 27的一部分发布。 Sun在OpenSolaris社区开张1年后的2006年六月,将ZFS整合进了Solaris 10 6/06版本更新。[3]

ZFS之名最早代表"Zettabyte File System", 但现在仅是无意义的首字缩写。[4]

[编辑] 容量

ZFS是一个128位的文件系统,这意味着它能存储1800亿亿(18.4 × 1018倍于当前64位文件系统的数据。ZFS的设计如此超前以至于这个极限就当前现实际可能永远无法遇到。项目领导Bonwick曾说:“要填满一个128位的文件系统,将耗尽地球上所有存储设备。除非你拥有煮沸整个海洋的能量,不然你不可能将其填满。(Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans.)”[1]

一下是ZFS的一些理论极限:

  • 248 — 任意文件系统的快照数量 (2 × 1014)
  • 248 — 任何单独文件系统的文件数 (2 × 1014)
  • 16 exabytes (264 byte) — 文件系统最大尺寸
  • 16 exabytes (264 byte) — 最大单个文件尺寸
  • 16 exabytes (264 byte) — 最大属性大小
  • 3 × 1023 petabytes (278 byte) — 最大zpool大小
  • 256 — 单个文件的属性数量(受ZFS文件数量的约束,实际为248)
  • 256 — 单个目录的文件数(受ZFS文件数量的约束,实际为248)
  • 264 — 单一zpool的设备数
  • 264 — 系统的zpools数量
  • 264 — 单一zpool的文件系统数量

作为对这些数字的感性认识,假设每秒钟创建1,000个新文件,达到ZFS文件数极限需要大约9,000年。

在辩解填满ZFS与煮沸海洋的关系时,Bonwick写到:

尽管我们都希望摩尔定律永远延续,但是量子力学给定了任何物理设备上计算速率(computation rate)与信息量的理论极限。举例而言,一个质量为1公斤,体积为1的物体,每秒至多在1031信息 上进行1051次运算。[参考 Seth Lloyd, "Ultimate physical limits to computation(计算的终极物理限制)." Nature 406, 1047-1054 (2000)]。一个完全的128位存储池将包含2128 个块 = 2137 字节 = 2140 位;应此,保存这些数据位至少需要(2140 位) / (1031 位/公斤) = 1360亿公斤的物质。

To operate at the 1031 bits/kg limit, however, the entire mass of the computer must be in the form of pure energy. By E=mc², the rest energy of 136 billion kg is 1.2x1028 J. The mass of the oceans is about 1.4x1021 kg. It takes about 4,000 J to raise the temperature of 1 kg of water by 1 degree Celsius, and thus about 400,000 J to heat 1 kg of water from freezing to boiling. The latent heat of vaporization adds another 2 million J/kg. Thus the energy required to boil the oceans is about 2.4x106 J/kg * 1.4x1021 kg = 3.4x1027 J. Thus, fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans.[5]

[编辑] 存储池

不同于传统文件系统需要驻留于单独设备或者需要一个卷管理系统去使用一个一上的设备,ZFS建立在虚拟的,被称为“zpools”的存储池之上。每个存储池由若干虚拟设备(virtual devices, vdevs)组成。这些虚拟设备可以是原始磁盘,也可能是一个RAID1镜像设备,或是非标准RAID等级的多磁盘组。于是zpool上的文件系统可以使用这些虚拟设备的总存储容量。

可以使用磁盘限额以及设置磁盘预留空间来限制存储池中单个文件系统所占用的空间。

[编辑] 写时拷贝事务模型

ZFS uses a copy-on-write, transactional object model. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log is used when synchronous write semantics are required.

[编辑] 快照与克隆

The ZFS copy-on-write model has another powerful advantage: when ZFS writes new data, instead of releasing the blocks containing the old data, it can instead retain them, creating a snapshot version of the file system. ZFS snapshots are created very quickly, since all the data comprising the snapshot is already stored; they are also space efficient, since any unchanged data is shared among the file system and its snapshots.

Writable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist.

[编辑] Dynamic striping

Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them, thus all disks in a pool are used, which balances the write load across them.

[编辑] 可变块尺寸

ZFS uses variable-sized blocks of up to 128 kilobytes. The currently available code allows the administrator to tune the maximum block size used as certain workloads do not perform well with large blocks. Automatic tuning to match workload characteristics is contemplated.

If compression is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations)

[编辑] 轻量化文件系统创建

在ZFS中,存储池中文件系统的操作相比传统文件系统的卷管理更加便捷。创建ZFS文件系统或者改变一个ZFS文件系统的大小接近于传统技术中的管理目录而非管理卷。

[编辑] Additional capabilities

  • Explicit I/O priority with deadline scheduling.
  • Globally optimal I/O sorting and aggregation.
  • Multiple independent prefetch streams with automatic length and stride detection.
  • Parallel, constant-time directory operations.
  • End-to-end checksumming, allowing data corruption detection and recovery (if you have redundancy in the pool).
  • Intelligent scrubbing and resilvering.[6]
  • Load and space usage sharing between disks in the pool.[7]
  • Ditto blocks: Metadata is replicated inside the pool, two or three times (according to metadata importance).[8] If the pool has several devices, ZFS tries to replicate over different devices. So a pool without redundancy can lose data if you find bad sectors, but metadata should be fairly safe even in this scenario.
  • ZFS design (copy-on-write + uberblocks) is safe when using disks with write cache enabled, if they "obey" the "cache flush" commands issued by ZFS. This feature provides safety and a considerable performance boost compared with other filesystems.
  • Given previous point, when given entire disks to a ZFS pool, ZFS automatically enables their write cache. This is not done if the ZFS only manages discrete slices of the disk, since it doesn't know if other slices are managed by non write cache safe filesystems, like UFS (and most others).

[编辑] Cache Management

ZFS also introduces the ARC, a new method for cache management instead of the traditional Solaris virtual memory page cache.

[编辑] 限制

ZFS尚不支持透明加密(如NTFS),但有相关的OpenSolaris项目正在从事开发此功能。[9]

ZFS不支持用户/组等级的磁盘限额。作为替代,可以创建用户所有的文件系统并设定其容量限制。ZFS does not support per-user or per-group quotas. Instead, it is possible to create user-owned filesystems, each with its own size limit. The low overhead of ZFS filesystems makes this practical even with many users (but, as noted in the current implementation issues, may slow system startup considerably). Intrinsically, there is no practical quota solution for the file systems shared among several users (such as team projects, for example), where the data cannot be separated per user, although it could be implemented on top of the ZFS stack.

Capacity expansion is normally achieved by adding groups of disk as vdev (stripe, RAID-Z, RAID-Z2, or mirrored). Newly written data will dynamically start to use all available vdevs. It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself - the heal time will depend on amount of store information, not the disk size. One should refrain from taking snapshots during the process (as this will cause the heal to be restarted).

It is currently not possible to reduce the number of vdevs in a pool nor otherwise reduce pool capacity. However, it is expected to be implemented in the near future.[來源請求]


It is not possible to add a disk to a raidz or raidz2 vdev. This feature appears very difficult to implement. It should also be noted that adding disk to a raidz would degrade the data protection by reducing the proportion of parity to data bits.

Reconfiguring storage requires copying data offline, destroying the pool, and recreating the pool with the new policy.

[编辑] Current implementation issues

Current ZFS implementation (Solaris 10 11/06) has some issues admins should know before deploying it. These issues are NOT inherent to ZFS, and might be solved in future releases:

  • ZFS is currently not available as a root filesystem on Solaris 10, since there is no ZFS boot support. The ZFS Boot project recently successfully added boot support to the OpenSolaris project, and is available in recent builds of Solaris Nevada.[10][11] ZFS boot is currently (20070208) planned for a Solaris 10 update in late 2007.
  • If a Solaris Zone is put on ZFS, the system cannot be upgraded — the OS will need to be reinstalled. This issue is planned to be addressed in a Solaris 10 update in 2007.
  • A file "fsync" will commit to disk all pending modifications on the filesystem. That is, an "fsync" on a file will flush out all deferred (cached) operations to the filesystem (not the pool) in which the file is located. This can make some fsync() slow when running alongside a workload which writes a lot of data to filesystem cache.[12]. The issue is currently fixed in Solaris Nevada.
  • New "vdev's" can be added to a storage pool, but they cannot be removed. A "vdev" can be exchanged for using a bigger new one, but it cannot be removed, in the process reducing the total pool storage size even if the pool has enough unused space. The ability to shrink a zpool is a work in progress, currently targeted for a Solaris 10 update in late 2007.
  • ZFS encourages creation of many filesystems inside the pool (for example, for quota control), but importing a pool with thousands of filesystems is a slow operation (can take minutes).
  • ZFS filesystem on-the-fly compression/decompression is single-threaded. So, only one CPU per zpool is used. The issue is now fixed in Solaris Nevada.
  • ZFS eats a lot of CPU when doing small writes (for example, a single byte). There are two root causes, currently being solved: a) Translating from znode to dnode is slower than necessary because ZFS doesn't use translation information it already has, and b) Current partial-block update code is very inefficient.[13]
  • ZFS Copy-on-Write operation can degrade on-disk file layout (file fragmentation) when files are modified, decreasing performance.
  • ZFS blocksize is configurable per filesystem, currently 128KB by default. If your workload reads/writes data in fixed sizes (blocks), for example a database, you should (manually) configure ZFS blocksize equal to the application blocksize, for better performance and to conserve cache memory and disk bandwidth.
  • ZFS only offlines a faulty harddisk if it can't be opened. Read/write errors or slow/timeouted operations are not currently used in the faulty/spare logic.
  • When listing ZFS space usage, the "used" column only shows non-shared usage. So if some of your data is shared (for example, between snapshots), you don't know how much is there. You don't know, for example, which snapshot deletion would give you more free space.
  • There is work in progress to provide automatic and periodic disk scrubbing, in order to provide corruption detection and early disk-rotting detection. Currently the data scrubbing must be done manually with "zpool scrub" command.
  • Current ZFS compression/decompression code is very fast, but the compression ratio is not comparable to gzip or similar algorithms. There is a project to add new compression modules to ZFS.[14][15][16]
  • When taking or destroying a snapshot while the zpool is scrubbing/resilvering, the process will be restarted from the beginning.[17]

[编辑] Platforms

ZFS is part of Sun's own Solaris operating system and is thus available on both SPARC and x86-based systems. Since the code for ZFS is open source, a port to other operating systems and platforms can be produced without Sun's involvement.

Nexenta OS, a complete GNU-based open source operating system built on top of the OpenSolaris kernel and runtime, includes a ZFS implementation, added in version alpha1.

Apple Computer is porting ZFS to their Mac OS X operating system, according to a post by a Sun employee on the opensolaris.org zfs-discuss mailing list, and previewed screenshots of the next version of Apple's Mac OS X.[18] As of Mac OS X 10.5 (Developer Seed 9A321), support for ZFS has been included, but lacks the ability to act as a root partition, noted above. Also, attempts to format local drives using ZFS are unsuccessful; this is a known bug.[19]

Porting ZFS to Linux is complicated by the fact that the GNU General Public License, which governs the Linux kernel prevents from linking with code under other licenses, such as CDDL, the license ZFS is is released under.[20] To work around this problem the Google Summer of Code program is sponsoring a port of ZFS to Linux's FUSE system so the filesystem will run in userspace instead.[21] However, running a file system outside the kernel on traditional unix-like systems has significant performance impact.

There are no plans to port ZFS to HP-UX or AIX.[22]

Pawel Jakub Dawidek has ported and committed ZFS to FreeBSD for inclusion in FreeBSD 7.0, due to be released in 2007.[23]

[编辑] Adaptive Endianness

Pools and their associated ZFS file systems can be moved between different platform architectures, even between systems implementing different byte orders. The ZFS block pointer format allows for filesystem metadata to be stored in an endian-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness doesn't match the endianness of the system, the metadata is byte-swapped in memory.

This does not affect the stored data itself: as is usual in POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.

[编辑] References

  1. ^ 1.0 1.1 ZFS: the last word in file systems Sun Microsystems (September 14 2004) - 於2006-04-30访问。
  2. Jeff Bonwick (October 31, 2005) - ZFS: The Last Word in Filesystems - Jeff Bonwick's Blog - 於2006-04-30访问。
  3. Sun Celebrates Successful One-Year Anniversary of OpenSolaris Sun Microsystems (June 20 2006
  4. Jeff Bonwick (2006-05-04) - You say zeta, I say zetta - Jeff Bonwick's Blog - 於2006-09-08访问。
  5. Jeff Bonwick (September 25 2004) - 128-bit storage: are you high? Sun Microsystems - 於2006-07-12访问。
  6. Smokin' Mirrors - Jeff Bonwick's Weblog (2006-05-02) - 於2007-02-23访问。
  7. ZFS Block Allocation - Jeff Bonwick's Weblog (2006-11-04) - 於2007-02-23访问。
  8. Ditto Blocks - The Amazing Tape Repellent - Flippin' off bits Weblog (2006-05-12) - 於2007-03-01访问。
  9. OpenSolaris Project: ZFS on disk encryption support OpenSolaris Project - 於2006-12-13访问。
  10. Latest ZFS add-ons - milek's blog (2007-03-28) - 於2007-03-29访问。
  11. ZFS Bootable datasets - happily rumbling - Tim Foster's blog (2007-03-29) - 於2007-04-01访问。
  12. The Dynamics of ZFS - Roch Bourbonnais' Weblog (2006-06-21) - 於2007-02-19访问。
  13. Implementing fbarrier() on ZFS - zfs-discuss (2007-02-13) - 於2007-02-13访问。
  14. gzip for ZFS update - Adam Leventhal's Weblog (2007-01-31) - 於2007-03-09访问。
  15. gzip compression support - zfs-discuss (2007-03-23) - 於2007-04-01访问。
  16. Gzip compression for ZFS - zfs-discuss (2007-03-29) - 於2007-04-01访问。
  17. scrub/resilver has to start over when a snapshot is taken - OpenSolaris Bug Tracker (2005-10-30) - 於2007-03-14访问。
  18. Porting ZFS to OSX - zfs-discuss (April 27 2006) - 於2006-04-30访问。
  19. Mac OS X 10.5 9A326 Seeded - InsanelyMac Forums (December 14 2006) - 於2006-12-14访问。
  20. Jeremy Andrews (April 19 2007) - Linux: ZFS, Licenses and Patents - 於2007-04-21访问。
  21. Ricardo Correia (May 26 2006) - Announcing ZFS on FUSE/Linux - 於2006-07-15访问。
  22. Fast Track to Solaris 10 Adoption: ZFS Technology - Solaris 10 Technical Knowledge Base Sun Microsystems - 於2006-04-24访问。
  23. Dawidek·Pawel (April 6 2007) - ZFS committed to the FreeBSD base - 於2007-04-06访问。

[编辑] See also

[编辑] External links

其它语言
AD Links