8 年前 · 454b588f73
--- a/DESIGN.md
+++ b/DESIGN.md
@@ -200,7 +200,7 @@ Now we could just leave files here, copying the entire file on write
 
															 provides the synchronization without the duplicated memory requirements
														
 
															 of the metadata blocks. However, we can do a bit better.
														
 
															-## CTZ linked-lists
														
 
															+## CTZ skip-lists
														
 
															 There are many different data structures for representing the actual
														
 
															 files in filesystems. Of these, the littlefs uses a rather unique [COW](https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg)
														
@@ -246,19 +246,19 @@ runtime to just _read_ a file? That's awful. Keep in mind reading files are
 
															 usually the most common filesystem operation.
														
 
															 To avoid this problem, the littlefs uses a multilayered linked-list. For
														
 
															-every block that is divisible by a power of two, the block contains an
														
 
															-additional pointer that points back by that power of two. Another way of
														
 
															-thinking about this design is that there are actually many linked-lists
														
 
															-threaded together, with each linked-lists skipping an increasing number
														
 
															-of blocks. If you're familiar with data-structures, you may have also
														
 
															-recognized that this is a deterministic skip-list.
														
 
															+every nth block where n is divisible by 2^x, the block contains a pointer
														
 
															+to block n-2^x. So each block contains anywhere from 1 to log2(n) pointers
														
 
															+that skip to various sections of the preceding list. If you're familiar with
														
 
															+data-structures, you may have recognized that this is a type of deterministic
														
 
															+skip-list.
														
 
															-To find the power of two factors efficiently, we can use the instruction
														
 
															-[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros),
														
 
															-which is where this linked-list's name comes from.
														
 
															+The name comes from the use of the
														
 
															+[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros)
														
 
															+instruction, which allows us to calculate the power-of-two factors efficiently.
														
 
															+For a given block n, the block contains ctz(n)+1 pointers.
														
 
															 ```
														
 
															-Exhibit C: A backwards CTZ linked-list
														
 
															+Exhibit C: A backwards CTZ skip-list
														
 
															 .--------.  .--------.  .--------.  .--------.  .--------.  .--------.
														
 
															 | data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 |
														
 
															 |        |<-|        |--|        |<-|        |--|        |  |        |
														
@@ -266,6 +266,9 @@ Exhibit C: A backwards CTZ linked-list
 
															 '--------'  '--------'  '--------'  '--------'  '--------'  '--------'
														
 
															 ```
														
 
															+The additional pointers allow us to navigate the data-structure on disk
														
 
															+much more efficiently than in a single linked-list.
														
 
															+
														
 
															 Taking exhibit C for example, here is the path from data block 5 to data
														
 
															 block 1. You can see how data block 3 was completely skipped:
														
 
															 ```
														
@@ -285,15 +288,57 @@ The path to data block 0 is even more quick, requiring only two jumps:
 
															 '--------'  '--------'  '--------'  '--------'  '--------'  '--------'
														
 
															 ```
														
 
															-The CTZ linked-list has quite a few interesting properties. All of the pointers
														
 
															-in the block can be found by just knowing the index in the list of the current
														
 
															-block, and, with a bit of math, the amortized overhead for the linked-list is
														
 
															-only two pointers per block.  Most importantly, the CTZ linked-list has a
														
 
															-worst case lookup runtime of O(logn), which brings the runtime of reading a
														
 
															-file down to O(n logn). Given that the constant runtime is divided by the
														
 
															-amount of data we can store in a block, this is pretty reasonable.
														
 
															-
														
 
															-Here is what it might look like to update a file stored with a CTZ linked-list:
														
 
															+We can find the runtime complexity by looking at the path to any block from
														
 
															+the block containing the most pointers. Every step along the path divides
														
 
															+the search space for the block in half. This gives us a runtime of O(log n).
														
 
															+To get to the block with the most pointers, we can perform the same steps
														
 
															+backwards, which keeps the asymptotic runtime at O(log n). The interesting
														
 
															+part about this data structure is that this optimal path occurs naturally
														
 
															+if we greedily choose the pointer that covers the most distance without passing
														
 
															+our target block.
														
 
															+
														
 
															+So now we have a representation of files that can be appended trivially with
														
 
															+a runtime of O(1), and can be read with a worst case runtime of O(n logn).
														
 
															+Given that the the runtime is also divided by the amount of data we can store
														
 
															+in a block, this is pretty reasonable.
														
 
															+
														
 
															+Unfortunately, the CTZ skip-list comes with a few questions that aren't
														
 
															+straightforward to answer. What is the overhead? How do we handle more
														
 
															+pointers than we can store in a block?
														
 
															+
														
 
															+One way to find the overhead per block is to look at the data structure as
														
 
															+multiple layers of linked-lists. Each linked-list skips twice as many blocks
														
 
															+as the previous linked-list. Or another way of looking at it is that each 
														
 
															+linked-list uses half as much storage per block as the previous linked-list.
														
 
															+As we approach infinity, the number of pointers per block forms a geometric
														
 
															+series. Solving this geometric series gives us an average of only 2 pointers
														
 
															+per block.
														
 
															+
														
 
															+![overhead per block](https://latex.codecogs.com/gif.latex?%5Clim_%7Bn%5Cto%5Cinfty%7D%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D0%7D%5E%7Bn%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29&plus;1%5Cright%29%20%3D%20%5Csum_%7Bi%3D0%7D%5E%7B%5Cinfty%7D%5Cfrac%7B1%7D%7B2%5Ei%7D%20%3D%202)
														
 
															+
														
 
															+Finding the maximum number of pointers in a block is a bit more complicated,
														
 
															+but since our file size is limited by the integer width we use to store the
														
 
															+size, we can solve for it. Setting the overhead of the maximum pointers equal
														
 
															+to the block size we get the following equation. Note that a smaller block size
														
 
															+results in more pointers, and a larger word width results in larger pointers.
														
 
															+
														
 
															+![maximum overhead](https://latex.codecogs.com/gif.latex?B%20%3D%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%5Clceil%5Clog_2%5Cleft%28%5Cfrac%7B2%5Ew%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%29%5Cright%5Crceil)
														
 
															+
														
 
															+where:  
														
 
															+B = block size in bytes  
														
 
															+w = word width in bits  
														
 
															+
														
 
															+Solving the equation for B gives us the minimum block size for various word
														
 
															+widths:  
														
 
															+32 bit CTZ skip-list = minimum block size of 104 bytes  
														
 
															+64 bit CTZ skip-list = minimum block size of 448 bytes  
														
 
															+
														
 
															+Since littlefs uses a 32 bit word size, we are limited to a minimum block
														
 
															+size of 104 bytes. This is a perfectly reasonable minimum block size, with most
														
 
															+block sizes starting around 512 bytes. So we can avoid the additional logic
														
 
															+needed to avoid overflowing our block's capacity in the CTZ skip-list.
														
 
															+
														
 
															+Here is what it might look like to update a file stored with a CTZ skip-list:
														
 
															 ```
														
 
															                                       block 1   block 2
														
 
															                                     .---------.---------.
														
@@ -367,7 +412,7 @@ v
 
															 ## Block allocation
														
 
															 So those two ideas provide the grounds for the filesystem. The metadata pairs
														
 
															-give us directories, and the CTZ linked-lists give us files. But this leaves
														
 
															+give us directories, and the CTZ skip-lists give us files. But this leaves
														
 
															 one big [elephant](https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg)
														
 
															 of a question. How do we get those blocks in the first place?
														
@@ -653,9 +698,17 @@ deorphan step that simply iterates through every directory in the linked-list
 
															 and checks it against every directory entry in the filesystem to see if it
														
 
															 has a parent. The deorphan step occurs on the first block allocation after
														
 
															 boot, so orphans should never cause the littlefs to run out of storage
														
 
															-prematurely.
														
 
															+prematurely. Note that the deorphan step never needs to run in a readonly
														
 
															+filesystem.
														
 
															+
														
 
															+## The move problem
														
 
															-And for my final trick, moving a directory:
														
 
															+Now we have a real problem. How do we move things between directories while
														
 
															+remaining power resilient? Even looking at the problem from a high level,
														
 
															+it seems impossible. We can update directory blocks atomically, but atomically
														
 
															+updating two independent directory blocks is not an atomic operation.
														
 
															+
														
 
															+Here's the steps the filesystem may go through to move a directory:
														
 
															 ```
														
 
															          .--------.
														
 
															          |root dir|-.
														
@@ -716,18 +769,135 @@ v
 
															      '--------'
														
 
															 ```
														
 
															-Note that once again we don't care about the ordering of directories in the
														
 
															-linked-list, so we can simply leave directories in their old positions. This
														
 
															-does make the diagrams a bit hard to draw, but the littlefs doesn't really
														
 
															-care.
														
 
															+We can leave any orphans up to the deorphan step to collect, but that doesn't
														
 
															+help the case where dir A has both dir B and the root dir as parents if we
														
 
															+lose power inconveniently.
														
 
															+
														
 
															+Initially, you might think this is fine. Dir A _might_ end up with two parents,
														
 
															+but the filesystem will still work as intended. But then this raises the
														
 
															+question of what do we do when the dir A wears out? For other directory blocks
														
 
															+we can update the parent pointer, but for a dir with two parents we would need
														
 
															+work out how to update both parents. And the check for multiple parents would
														
 
															+need to be carried out for every directory, even if the directory has never
														
 
															+been moved.
														
 
															+
														
 
															+It also presents a bad user-experience, since the condition of ending up with
														
 
															+two parents is rare, it's unlikely user-level code will be prepared. Just think
														
 
															+about how a user would recover from a multi-parented directory. They can't just
														
 
															+remove one directory, since remove would report the directory as "not empty".
														
 
															+
														
 
															+Other atomic filesystems simple COW the entire directory tree. But this
														
 
															+introduces a significant bit of complexity, which leads to code size, along
														
 
															+with a surprisingly expensive runtime cost during what most users assume is
														
 
															+a single pointer update.
														
 
															+
														
 
															+Another option is to update the directory block we're moving from to point
														
 
															+to the destination with a sort of predicate that we have moved if the
														
 
															+destination exists. Unfortunately, the omnipresent concern of wear could
														
 
															+cause any of these directory entries to change blocks, and changing the
														
 
															+entry size before a move introduces complications if it spills out of
														
 
															+the current directory block.
														
 
															+
														
 
															+So how do we go about moving a directory atomically?
														
 
															+
														
 
															+We rely on the improbableness of power loss.
														
 
															+
														
 
															+Power loss during a move is certainly possible, but it's actually relatively
														
 
															+rare. Unless a device is writing to a filesystem constantly, it's unlikely that
														
 
															+a power loss will occur during filesystem activity. We still need to handle
														
 
															+the condition, but runtime during a power loss takes a back seat to the runtime
														
 
															+during normal operations.
														
 
															+
														
 
															+So what littlefs does is unelegantly simple. When littlefs moves a file, it
														
 
															+marks the file as "moving". This is stored as a single bit in the directory
														
 
															+entry and doesn't take up much space. Then littlefs moves the directory,
														
 
															+finishing with the complete remove of the "moving" directory entry.
														
 
															+
														
 
															+```
														
 
															+         .--------.
														
 
															+         |root dir|-.
														
 
															+         | pair 0 | |
														
 
															+.--------|        |-'
														
 
															+|        '--------'
														
 
															+|        .-'    '-.
														
 
															+|       v          v
														
 
															+|  .--------.  .--------.
														
 
															+'->| dir A  |->| dir B  |
														
 
															+   | pair 0 |  | pair 0 |
														
 
															+   |        |  |        |
														
 
															+   '--------'  '--------'
														
 
															+
														
 
															+|  update root directory to mark directory A as moving
														
 
															+v
														
 
															+
														
 
															+        .----------.
														
 
															+        |root dir  |-.
														
 
															+        | pair 0   | |
														
 
															+.-------| moving A!|-'
														
 
															+|       '----------'
														
 
															+|        .-'    '-.
														
 
															+|       v          v
														
 
															+|  .--------.  .--------.
														
 
															+'->| dir A  |->| dir B  |
														
 
															+   | pair 0 |  | pair 0 |
														
 
															+   |        |  |        |
														
 
															+   '--------'  '--------'
														
 
															+
														
 
															+|  update directory B to point to directory A
														
 
															+v
														
 
															+
														
 
															+        .----------.
														
 
															+        |root dir  |-.
														
 
															+        | pair 0   | |
														
 
															+.-------| moving A!|-'
														
 
															+|       '----------'
														
 
															+|    .-----'    '-.
														
 
															+|    |             v
														
 
															+|    |           .--------.
														
 
															+|    |        .->| dir B  |
														
 
															+|    |        |  | pair 0 |
														
 
															+|    |        |  |        |
														
 
															+|    |        |  '--------'
														
 
															+|    |     .-------'
														
 
															+|    v    v   |
														
 
															+|  .--------. |
														
 
															+'->| dir A  |-'
														
 
															+   | pair 0 |
														
 
															+   |        |
														
 
															+   '--------'
														
 
															+
														
 
															+|  update root to no longer contain directory A
														
 
															+v
														
 
															+     .--------.
														
 
															+     |root dir|-.
														
 
															+     | pair 0 | |
														
 
															+.----|        |-'
														
 
															+|    '--------'
														
 
															+|        |
														
 
															+|        v
														
 
															+|    .--------.
														
 
															+| .->| dir B  |
														
 
															+| |  | pair 0 |
														
 
															+| '--|        |-.
														
 
															+|    '--------' |
														
 
															+|        |      |
														
 
															+|        v      |
														
 
															+|    .--------. |
														
 
															+'--->| dir A  |-'
														
 
															+     | pair 0 |
														
 
															+     |        |
														
 
															+     '--------'
														
 
															+```
														
 
															+
														
 
															+Now, if we run into a directory entry that has been marked as "moved", one
														
 
															+of two things is possible. Either the directory entry exists elsewhere in the
														
 
															+filesystem, or it doesn't. This is a O(n) operation, but only occurs in the
														
 
															+unlikely case we lost power during a move.
														
 
															-It's also worth noting that once again we have an operation that isn't actually
														
 
															-atomic. After we add directory A to directory B, we could lose power, leaving
														
 
															-directory A as a part of both the root directory and directory B. However,
														
 
															-there isn't anything inherent to the littlefs that prevents a directory from
														
 
															-having multiple parents, so in this case, we just allow that to happen. Extra
														
 
															-care is taken to only remove a directory from the linked-list if there are
														
 
															-no parents left in the filesystem.
														
 
															+And we can easily fix the "moved" directory entry. Since we're already scanning
														
 
															+the filesystem during the deorphan step, we can also check for moved entries.
														
 
															+If we find one, we either remove the "moved" marking or remove the whole entry
														
 
															+if it exists elsewhere in the filesystem.
														
 
															 ## Wear awareness
														
@@ -955,18 +1125,18 @@ So, to summarize:
 
															 1. The littlefs is composed of directory blocks
														
 
															 2. Each directory is a linked-list of metadata pairs
														
 
															-3. These metadata pairs can be updated atomically by alternative which
														
 
															+3. These metadata pairs can be updated atomically by alternating which
														
 
															    metadata block is active
														
 
															 4. Directory blocks contain either references to other directories or files
														
 
															-5. Files are represented by copy-on-write CTZ linked-lists
														
 
															-6. The CTZ linked-lists support appending in O(1) and reading in O(n logn)
														
 
															-7. Blocks are allocated by scanning the filesystem for used blocks in a
														
 
															+5. Files are represented by copy-on-write CTZ skip-lists which support O(1)
														
 
															+   append and O(n logn) reading
														
 
															+6. Blocks are allocated by scanning the filesystem for used blocks in a
														
 
															    fixed-size lookahead region is that stored in a bit-vector
														
 
															-8. To facilitate scanning the filesystem, all directories are part of a
														
 
															+7. To facilitate scanning the filesystem, all directories are part of a
														
 
															    linked-list that is threaded through the entire filesystem
														
 
															-9. If a block develops an error, the littlefs allocates a new block, and
														
 
															+8. If a block develops an error, the littlefs allocates a new block, and
														
 
															    moves the data and references of the old block to the new.
														
 
															-10. Any case where an atomic operation is not possible, it is taken care of
														
 
															+9. Any case where an atomic operation is not possible, mistakes are resolved
														
 
															    by a deorphan step that occurs on the first allocation after boot
														
 
															 That's the little filesystem. Thanks for reading!
														
--- a/SPEC.md
+++ b/SPEC.md
@@ -121,13 +121,18 @@ Here's the layout of entries on disk:
 
															 **Entry type** - Type of the entry, currently this is limited to the following:
														
 
															 - 0x11 - file entry
														
 
															 - 0x22 - directory entry
														
 
															-- 0xe2 - superblock entry
														
 
															+- 0x2e - superblock entry
														
 
															-Additionally, the type is broken into two 4 bit nibbles, with the lower nibble
														
 
															+Additionally, the type is broken into two 4 bit nibbles, with the upper nibble
														
 
															 specifying the type's data structure used when scanning the filesystem. The
														
 
															-upper nibble clarifies the type further when multiple entries share the same
														
 
															+lower nibble clarifies the type further when multiple entries share the same
														
 
															 data structure.
														
 
															+The highest bit is reserved for marking the entry as "moved". If an entry
														
 
															+is marked as "moved", the entry may also exist somewhere else in the
														
 
															+filesystem. If the entry exists elsewhere, this entry must be treated as
														
 
															+though it does not exist.
														
 
															+
														
 
															 **Entry length** - Length in bytes of the entry-specific data. This does
														
 
															 not include the entry type size, attributes, or name. The full size in bytes
														
 
															 of the entry is 4 + entry length + attribute length + name length.
														
@@ -175,7 +180,7 @@ Here's the layout of the superblock entry:
 
															 | offset | size                   | description                            |
														
 
															 |--------|------------------------|----------------------------------------|
														
 
															-| 0x00   | 8 bits                 | entry type (0xe2 for superblock entry) |
														
 
															+| 0x00   | 8 bits                 | entry type (0x2e for superblock entry) |
														
 
															 | 0x01   | 8 bits                 | entry length (20 bytes)                |
														
 
															 | 0x02   | 8 bits                 | attribute length                       |
														
 
															 | 0x03   | 8 bits                 | name length (8 bytes)                  |
														
@@ -208,7 +213,7 @@ Here's an example of a complete superblock:
 
															 (32 bits) revision count   = 3                    (0x00000003)
														
 
															 (32 bits) dir size         = 52 bytes, end of dir (0x00000034)
														
 
															 (64 bits) tail pointer     = 3, 2                 (0x00000003, 0x00000002)
														
 
															-(8 bits)  entry type       = superblock           (0xe2)
														
 
															+(8 bits)  entry type       = superblock           (0x2e)
														
 
															 (8 bits)  entry length     = 20 bytes             (0x14)
														
 
															 (8 bits)  attribute length = 0 bytes              (0x00)
														
 
															 (8 bits)  name length      = 8 bytes              (0x08)
														
@@ -220,7 +225,7 @@ Here's an example of a complete superblock:
 
															 (32 bits) crc              = 0xc50b74fa
														
 
															 00000000: 03 00 00 00 34 00 00 00 03 00 00 00 02 00 00 00  ....4...........
														
 
															-00000010: e2 14 00 08 03 00 00 00 02 00 00 00 00 02 00 00  ................
														
 
															+00000010: 2e 14 00 08 03 00 00 00 02 00 00 00 00 02 00 00  ................
														
 
															 00000020: 00 04 00 00 01 00 01 00 6c 69 74 74 6c 65 66 73  ........littlefs
														
 
															 00000030: fa 74 0b c5                                      .t..
														
 
															 ```
														
@@ -262,15 +267,19 @@ Here's an example of a directory entry:
 
															 Files are stored in entries with a pointer to the head of the file and the
														
 
															 size of the file. This is enough information to determine the state of the
														
 
															-CTZ linked-list that is being referenced.
														
 
															+CTZ skip-list that is being referenced.
														
 
															 How files are actually stored on disk is a bit complicated. The full
														
 
															-explanation of CTZ linked-lists can be found in [DESIGN.md](DESIGN.md#ctz-linked-lists).
														
 
															+explanation of CTZ skip-lists can be found in [DESIGN.md](DESIGN.md#ctz-skip-lists).
														
 
															 A terribly quick summary: For every nth block where n is divisible by 2^x,
														
 
															-the block contains a pointer that points x blocks towards the beginning of the
														
 
															-file. These pointers are stored in order of x in each block of the file
														
 
															-immediately before the data in the block.
														
 
															+the block contains a pointer to block n-2^x. These pointers are stored in
														
 
															+increasing order of x in each block of the file preceding the data in the
														
 
															+block.
														
 
															+
														
 
															+The maximum number of pointers in a block is bounded by the maximum file size
														
 
															+divided by the block size. With 32 bits for file size, this results in a
														
 
															+minimum block size of 104 bytes.
														
 
															 Here's the layout of a file entry:
														
@@ -286,7 +295,7 @@ Here's the layout of a file entry:
 
															 | 0xc+a  | name length bytes      | directory name                     |
														
 
															 **File head** - Pointer to the block that is the head of the file's CTZ
														
 
															-linked-list.
														
 
															+skip-list.
														
 
															 **File size** - Size of file in bytes.