Browse Source

Updated SPEC.md and DESIGN.md based on recent changes

- Added math behind CTZ limits
- Added documentation over atomic moves
Christopher Haster 8 years ago
parent
commit
454b588f73
2 changed files with 232 additions and 53 deletions
  1. 211 41
      DESIGN.md
  2. 21 12
      SPEC.md

+ 211 - 41
DESIGN.md

@@ -200,7 +200,7 @@ Now we could just leave files here, copying the entire file on write
 provides the synchronization without the duplicated memory requirements
 provides the synchronization without the duplicated memory requirements
 of the metadata blocks. However, we can do a bit better.
 of the metadata blocks. However, we can do a bit better.
 
 
-## CTZ linked-lists
+## CTZ skip-lists
 
 
 There are many different data structures for representing the actual
 There are many different data structures for representing the actual
 files in filesystems. Of these, the littlefs uses a rather unique [COW](https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg)
 files in filesystems. Of these, the littlefs uses a rather unique [COW](https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg)
@@ -246,19 +246,19 @@ runtime to just _read_ a file? That's awful. Keep in mind reading files are
 usually the most common filesystem operation.
 usually the most common filesystem operation.
 
 
 To avoid this problem, the littlefs uses a multilayered linked-list. For
 To avoid this problem, the littlefs uses a multilayered linked-list. For
-every block that is divisible by a power of two, the block contains an
-additional pointer that points back by that power of two. Another way of
-thinking about this design is that there are actually many linked-lists
-threaded together, with each linked-lists skipping an increasing number
-of blocks. If you're familiar with data-structures, you may have also
-recognized that this is a deterministic skip-list.
+every nth block where n is divisible by 2^x, the block contains a pointer
+to block n-2^x. So each block contains anywhere from 1 to log2(n) pointers
+that skip to various sections of the preceding list. If you're familiar with
+data-structures, you may have recognized that this is a type of deterministic
+skip-list.
 
 
-To find the power of two factors efficiently, we can use the instruction
-[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros),
-which is where this linked-list's name comes from.
+The name comes from the use of the
+[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros)
+instruction, which allows us to calculate the power-of-two factors efficiently.
+For a given block n, the block contains ctz(n)+1 pointers.
 
 
 ```
 ```
-Exhibit C: A backwards CTZ linked-list
+Exhibit C: A backwards CTZ skip-list
 .--------.  .--------.  .--------.  .--------.  .--------.  .--------.
 .--------.  .--------.  .--------.  .--------.  .--------.  .--------.
 | data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 |
 | data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 |
 |        |<-|        |--|        |<-|        |--|        |  |        |
 |        |<-|        |--|        |<-|        |--|        |  |        |
@@ -266,6 +266,9 @@ Exhibit C: A backwards CTZ linked-list
 '--------'  '--------'  '--------'  '--------'  '--------'  '--------'
 '--------'  '--------'  '--------'  '--------'  '--------'  '--------'
 ```
 ```
 
 
+The additional pointers allow us to navigate the data-structure on disk
+much more efficiently than in a single linked-list.
+
 Taking exhibit C for example, here is the path from data block 5 to data
 Taking exhibit C for example, here is the path from data block 5 to data
 block 1. You can see how data block 3 was completely skipped:
 block 1. You can see how data block 3 was completely skipped:
 ```
 ```
@@ -285,15 +288,57 @@ The path to data block 0 is even more quick, requiring only two jumps:
 '--------'  '--------'  '--------'  '--------'  '--------'  '--------'
 '--------'  '--------'  '--------'  '--------'  '--------'  '--------'
 ```
 ```
 
 
-The CTZ linked-list has quite a few interesting properties. All of the pointers
-in the block can be found by just knowing the index in the list of the current
-block, and, with a bit of math, the amortized overhead for the linked-list is
-only two pointers per block.  Most importantly, the CTZ linked-list has a
-worst case lookup runtime of O(logn), which brings the runtime of reading a
-file down to O(n logn). Given that the constant runtime is divided by the
-amount of data we can store in a block, this is pretty reasonable.
-
-Here is what it might look like to update a file stored with a CTZ linked-list:
+We can find the runtime complexity by looking at the path to any block from
+the block containing the most pointers. Every step along the path divides
+the search space for the block in half. This gives us a runtime of O(log n).
+To get to the block with the most pointers, we can perform the same steps
+backwards, which keeps the asymptotic runtime at O(log n). The interesting
+part about this data structure is that this optimal path occurs naturally
+if we greedily choose the pointer that covers the most distance without passing
+our target block.
+
+So now we have a representation of files that can be appended trivially with
+a runtime of O(1), and can be read with a worst case runtime of O(n logn).
+Given that the the runtime is also divided by the amount of data we can store
+in a block, this is pretty reasonable.
+
+Unfortunately, the CTZ skip-list comes with a few questions that aren't
+straightforward to answer. What is the overhead? How do we handle more
+pointers than we can store in a block?
+
+One way to find the overhead per block is to look at the data structure as
+multiple layers of linked-lists. Each linked-list skips twice as many blocks
+as the previous linked-list. Or another way of looking at it is that each 
+linked-list uses half as much storage per block as the previous linked-list.
+As we approach infinity, the number of pointers per block forms a geometric
+series. Solving this geometric series gives us an average of only 2 pointers
+per block.
+
+![overhead per block](https://latex.codecogs.com/gif.latex?%5Clim_%7Bn%5Cto%5Cinfty%7D%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D0%7D%5E%7Bn%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29&plus;1%5Cright%29%20%3D%20%5Csum_%7Bi%3D0%7D%5E%7B%5Cinfty%7D%5Cfrac%7B1%7D%7B2%5Ei%7D%20%3D%202)
+
+Finding the maximum number of pointers in a block is a bit more complicated,
+but since our file size is limited by the integer width we use to store the
+size, we can solve for it. Setting the overhead of the maximum pointers equal
+to the block size we get the following equation. Note that a smaller block size
+results in more pointers, and a larger word width results in larger pointers.
+
+![maximum overhead](https://latex.codecogs.com/gif.latex?B%20%3D%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%5Clceil%5Clog_2%5Cleft%28%5Cfrac%7B2%5Ew%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%29%5Cright%5Crceil)
+
+where:  
+B = block size in bytes  
+w = word width in bits  
+
+Solving the equation for B gives us the minimum block size for various word
+widths:  
+32 bit CTZ skip-list = minimum block size of 104 bytes  
+64 bit CTZ skip-list = minimum block size of 448 bytes  
+
+Since littlefs uses a 32 bit word size, we are limited to a minimum block
+size of 104 bytes. This is a perfectly reasonable minimum block size, with most
+block sizes starting around 512 bytes. So we can avoid the additional logic
+needed to avoid overflowing our block's capacity in the CTZ skip-list.
+
+Here is what it might look like to update a file stored with a CTZ skip-list:
 ```
 ```
                                       block 1   block 2
                                       block 1   block 2
                                     .---------.---------.
                                     .---------.---------.
@@ -367,7 +412,7 @@ v
 ## Block allocation
 ## Block allocation
 
 
 So those two ideas provide the grounds for the filesystem. The metadata pairs
 So those two ideas provide the grounds for the filesystem. The metadata pairs
-give us directories, and the CTZ linked-lists give us files. But this leaves
+give us directories, and the CTZ skip-lists give us files. But this leaves
 one big [elephant](https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg)
 one big [elephant](https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg)
 of a question. How do we get those blocks in the first place?
 of a question. How do we get those blocks in the first place?
 
 
@@ -653,9 +698,17 @@ deorphan step that simply iterates through every directory in the linked-list
 and checks it against every directory entry in the filesystem to see if it
 and checks it against every directory entry in the filesystem to see if it
 has a parent. The deorphan step occurs on the first block allocation after
 has a parent. The deorphan step occurs on the first block allocation after
 boot, so orphans should never cause the littlefs to run out of storage
 boot, so orphans should never cause the littlefs to run out of storage
-prematurely.
+prematurely. Note that the deorphan step never needs to run in a readonly
+filesystem.
+
+## The move problem
 
 
-And for my final trick, moving a directory:
+Now we have a real problem. How do we move things between directories while
+remaining power resilient? Even looking at the problem from a high level,
+it seems impossible. We can update directory blocks atomically, but atomically
+updating two independent directory blocks is not an atomic operation.
+
+Here's the steps the filesystem may go through to move a directory:
 ```
 ```
          .--------.
          .--------.
          |root dir|-.
          |root dir|-.
@@ -716,18 +769,135 @@ v
      '--------'
      '--------'
 ```
 ```
 
 
-Note that once again we don't care about the ordering of directories in the
-linked-list, so we can simply leave directories in their old positions. This
-does make the diagrams a bit hard to draw, but the littlefs doesn't really
-care.
+We can leave any orphans up to the deorphan step to collect, but that doesn't
+help the case where dir A has both dir B and the root dir as parents if we
+lose power inconveniently.
+
+Initially, you might think this is fine. Dir A _might_ end up with two parents,
+but the filesystem will still work as intended. But then this raises the
+question of what do we do when the dir A wears out? For other directory blocks
+we can update the parent pointer, but for a dir with two parents we would need
+work out how to update both parents. And the check for multiple parents would
+need to be carried out for every directory, even if the directory has never
+been moved.
+
+It also presents a bad user-experience, since the condition of ending up with
+two parents is rare, it's unlikely user-level code will be prepared. Just think
+about how a user would recover from a multi-parented directory. They can't just
+remove one directory, since remove would report the directory as "not empty".
+
+Other atomic filesystems simple COW the entire directory tree. But this
+introduces a significant bit of complexity, which leads to code size, along
+with a surprisingly expensive runtime cost during what most users assume is
+a single pointer update.
+
+Another option is to update the directory block we're moving from to point
+to the destination with a sort of predicate that we have moved if the
+destination exists. Unfortunately, the omnipresent concern of wear could
+cause any of these directory entries to change blocks, and changing the
+entry size before a move introduces complications if it spills out of
+the current directory block.
+
+So how do we go about moving a directory atomically?
+
+We rely on the improbableness of power loss.
+
+Power loss during a move is certainly possible, but it's actually relatively
+rare. Unless a device is writing to a filesystem constantly, it's unlikely that
+a power loss will occur during filesystem activity. We still need to handle
+the condition, but runtime during a power loss takes a back seat to the runtime
+during normal operations.
+
+So what littlefs does is unelegantly simple. When littlefs moves a file, it
+marks the file as "moving". This is stored as a single bit in the directory
+entry and doesn't take up much space. Then littlefs moves the directory,
+finishing with the complete remove of the "moving" directory entry.
+
+```
+         .--------.
+         |root dir|-.
+         | pair 0 | |
+.--------|        |-'
+|        '--------'
+|        .-'    '-.
+|       v          v
+|  .--------.  .--------.
+'->| dir A  |->| dir B  |
+   | pair 0 |  | pair 0 |
+   |        |  |        |
+   '--------'  '--------'
+
+|  update root directory to mark directory A as moving
+v
+
+        .----------.
+        |root dir  |-.
+        | pair 0   | |
+.-------| moving A!|-'
+|       '----------'
+|        .-'    '-.
+|       v          v
+|  .--------.  .--------.
+'->| dir A  |->| dir B  |
+   | pair 0 |  | pair 0 |
+   |        |  |        |
+   '--------'  '--------'
+
+|  update directory B to point to directory A
+v
+
+        .----------.
+        |root dir  |-.
+        | pair 0   | |
+.-------| moving A!|-'
+|       '----------'
+|    .-----'    '-.
+|    |             v
+|    |           .--------.
+|    |        .->| dir B  |
+|    |        |  | pair 0 |
+|    |        |  |        |
+|    |        |  '--------'
+|    |     .-------'
+|    v    v   |
+|  .--------. |
+'->| dir A  |-'
+   | pair 0 |
+   |        |
+   '--------'
+
+|  update root to no longer contain directory A
+v
+     .--------.
+     |root dir|-.
+     | pair 0 | |
+.----|        |-'
+|    '--------'
+|        |
+|        v
+|    .--------.
+| .->| dir B  |
+| |  | pair 0 |
+| '--|        |-.
+|    '--------' |
+|        |      |
+|        v      |
+|    .--------. |
+'--->| dir A  |-'
+     | pair 0 |
+     |        |
+     '--------'
+```
+
+Now, if we run into a directory entry that has been marked as "moved", one
+of two things is possible. Either the directory entry exists elsewhere in the
+filesystem, or it doesn't. This is a O(n) operation, but only occurs in the
+unlikely case we lost power during a move.
 
 
-It's also worth noting that once again we have an operation that isn't actually
-atomic. After we add directory A to directory B, we could lose power, leaving
-directory A as a part of both the root directory and directory B. However,
-there isn't anything inherent to the littlefs that prevents a directory from
-having multiple parents, so in this case, we just allow that to happen. Extra
-care is taken to only remove a directory from the linked-list if there are
-no parents left in the filesystem.
+And we can easily fix the "moved" directory entry. Since we're already scanning
+the filesystem during the deorphan step, we can also check for moved entries.
+If we find one, we either remove the "moved" marking or remove the whole entry
+if it exists elsewhere in the filesystem.
 
 
 ## Wear awareness
 ## Wear awareness
 
 
@@ -955,18 +1125,18 @@ So, to summarize:
 
 
 1. The littlefs is composed of directory blocks
 1. The littlefs is composed of directory blocks
 2. Each directory is a linked-list of metadata pairs
 2. Each directory is a linked-list of metadata pairs
-3. These metadata pairs can be updated atomically by alternative which
+3. These metadata pairs can be updated atomically by alternating which
    metadata block is active
    metadata block is active
 4. Directory blocks contain either references to other directories or files
 4. Directory blocks contain either references to other directories or files
-5. Files are represented by copy-on-write CTZ linked-lists
-6. The CTZ linked-lists support appending in O(1) and reading in O(n logn)
-7. Blocks are allocated by scanning the filesystem for used blocks in a
+5. Files are represented by copy-on-write CTZ skip-lists which support O(1)
+   append and O(n logn) reading
+6. Blocks are allocated by scanning the filesystem for used blocks in a
    fixed-size lookahead region is that stored in a bit-vector
    fixed-size lookahead region is that stored in a bit-vector
-8. To facilitate scanning the filesystem, all directories are part of a
+7. To facilitate scanning the filesystem, all directories are part of a
    linked-list that is threaded through the entire filesystem
    linked-list that is threaded through the entire filesystem
-9. If a block develops an error, the littlefs allocates a new block, and
+8. If a block develops an error, the littlefs allocates a new block, and
    moves the data and references of the old block to the new.
    moves the data and references of the old block to the new.
-10. Any case where an atomic operation is not possible, it is taken care of
+9. Any case where an atomic operation is not possible, mistakes are resolved
    by a deorphan step that occurs on the first allocation after boot
    by a deorphan step that occurs on the first allocation after boot
 
 
 That's the little filesystem. Thanks for reading!
 That's the little filesystem. Thanks for reading!

+ 21 - 12
SPEC.md

@@ -121,13 +121,18 @@ Here's the layout of entries on disk:
 **Entry type** - Type of the entry, currently this is limited to the following:
 **Entry type** - Type of the entry, currently this is limited to the following:
 - 0x11 - file entry
 - 0x11 - file entry
 - 0x22 - directory entry
 - 0x22 - directory entry
-- 0xe2 - superblock entry
+- 0x2e - superblock entry
 
 
-Additionally, the type is broken into two 4 bit nibbles, with the lower nibble
+Additionally, the type is broken into two 4 bit nibbles, with the upper nibble
 specifying the type's data structure used when scanning the filesystem. The
 specifying the type's data structure used when scanning the filesystem. The
-upper nibble clarifies the type further when multiple entries share the same
+lower nibble clarifies the type further when multiple entries share the same
 data structure.
 data structure.
 
 
+The highest bit is reserved for marking the entry as "moved". If an entry
+is marked as "moved", the entry may also exist somewhere else in the
+filesystem. If the entry exists elsewhere, this entry must be treated as
+though it does not exist.
+
 **Entry length** - Length in bytes of the entry-specific data. This does
 **Entry length** - Length in bytes of the entry-specific data. This does
 not include the entry type size, attributes, or name. The full size in bytes
 not include the entry type size, attributes, or name. The full size in bytes
 of the entry is 4 + entry length + attribute length + name length.
 of the entry is 4 + entry length + attribute length + name length.
@@ -175,7 +180,7 @@ Here's the layout of the superblock entry:
 
 
 | offset | size                   | description                            |
 | offset | size                   | description                            |
 |--------|------------------------|----------------------------------------|
 |--------|------------------------|----------------------------------------|
-| 0x00   | 8 bits                 | entry type (0xe2 for superblock entry) |
+| 0x00   | 8 bits                 | entry type (0x2e for superblock entry) |
 | 0x01   | 8 bits                 | entry length (20 bytes)                |
 | 0x01   | 8 bits                 | entry length (20 bytes)                |
 | 0x02   | 8 bits                 | attribute length                       |
 | 0x02   | 8 bits                 | attribute length                       |
 | 0x03   | 8 bits                 | name length (8 bytes)                  |
 | 0x03   | 8 bits                 | name length (8 bytes)                  |
@@ -208,7 +213,7 @@ Here's an example of a complete superblock:
 (32 bits) revision count   = 3                    (0x00000003)
 (32 bits) revision count   = 3                    (0x00000003)
 (32 bits) dir size         = 52 bytes, end of dir (0x00000034)
 (32 bits) dir size         = 52 bytes, end of dir (0x00000034)
 (64 bits) tail pointer     = 3, 2                 (0x00000003, 0x00000002)
 (64 bits) tail pointer     = 3, 2                 (0x00000003, 0x00000002)
-(8 bits)  entry type       = superblock           (0xe2)
+(8 bits)  entry type       = superblock           (0x2e)
 (8 bits)  entry length     = 20 bytes             (0x14)
 (8 bits)  entry length     = 20 bytes             (0x14)
 (8 bits)  attribute length = 0 bytes              (0x00)
 (8 bits)  attribute length = 0 bytes              (0x00)
 (8 bits)  name length      = 8 bytes              (0x08)
 (8 bits)  name length      = 8 bytes              (0x08)
@@ -220,7 +225,7 @@ Here's an example of a complete superblock:
 (32 bits) crc              = 0xc50b74fa
 (32 bits) crc              = 0xc50b74fa
 
 
 00000000: 03 00 00 00 34 00 00 00 03 00 00 00 02 00 00 00  ....4...........
 00000000: 03 00 00 00 34 00 00 00 03 00 00 00 02 00 00 00  ....4...........
-00000010: e2 14 00 08 03 00 00 00 02 00 00 00 00 02 00 00  ................
+00000010: 2e 14 00 08 03 00 00 00 02 00 00 00 00 02 00 00  ................
 00000020: 00 04 00 00 01 00 01 00 6c 69 74 74 6c 65 66 73  ........littlefs
 00000020: 00 04 00 00 01 00 01 00 6c 69 74 74 6c 65 66 73  ........littlefs
 00000030: fa 74 0b c5                                      .t..
 00000030: fa 74 0b c5                                      .t..
 ```
 ```
@@ -262,15 +267,19 @@ Here's an example of a directory entry:
 
 
 Files are stored in entries with a pointer to the head of the file and the
 Files are stored in entries with a pointer to the head of the file and the
 size of the file. This is enough information to determine the state of the
 size of the file. This is enough information to determine the state of the
-CTZ linked-list that is being referenced.
+CTZ skip-list that is being referenced.
 
 
 How files are actually stored on disk is a bit complicated. The full
 How files are actually stored on disk is a bit complicated. The full
-explanation of CTZ linked-lists can be found in [DESIGN.md](DESIGN.md#ctz-linked-lists).
+explanation of CTZ skip-lists can be found in [DESIGN.md](DESIGN.md#ctz-skip-lists).
 
 
 A terribly quick summary: For every nth block where n is divisible by 2^x,
 A terribly quick summary: For every nth block where n is divisible by 2^x,
-the block contains a pointer that points x blocks towards the beginning of the
-file. These pointers are stored in order of x in each block of the file
-immediately before the data in the block.
+the block contains a pointer to block n-2^x. These pointers are stored in
+increasing order of x in each block of the file preceding the data in the
+block.
+
+The maximum number of pointers in a block is bounded by the maximum file size
+divided by the block size. With 32 bits for file size, this results in a
+minimum block size of 104 bytes.
 
 
 Here's the layout of a file entry:
 Here's the layout of a file entry:
 
 
@@ -286,7 +295,7 @@ Here's the layout of a file entry:
 | 0xc+a  | name length bytes      | directory name                     |
 | 0xc+a  | name length bytes      | directory name                     |
 
 
 **File head** - Pointer to the block that is the head of the file's CTZ
 **File head** - Pointer to the block that is the head of the file's CTZ
-linked-list.
+skip-list.
 
 
 **File size** - Size of file in bytes.
 **File size** - Size of file in bytes.