Squeeze

Problem The problem-definition.

If you've never archived files off-line (& don't suffer from OCD), then you'll probably never have found the requirement to find the subset of them which best fits the available space. Let's say you've got a collection of files, which you can no longer afford to store on your computer's HDD, or which you want to backup (as one tends to after suffering one's first catastrophic disk-failure). The backup-medium could be a CD, DVD, or flash-drive; it makes no difference, the available space is finite & of well-defined size. Adding the total space-requirement of the files, & dividing it by the space available on a DVD, tells you the minimum number of DVDs that will be required if no space is wasted, but when you try to group the files into subsets (each of which must fit within the space available on one DVD, without wasting any), you find it unexpectedly laborious.

Analysis Analysis of the problem.

An analogy of dubious validity.

If all the files were the same size, then the number of files in each subset would be known1, & the choice of files in any subset would only be relevant for reasons outside the domain of this problem, since this choice doesn't affect the space which is wasted. The fact that they're of different sizes, makes the choice of files in each subset, the determining factor in the amount of space wasted when copied to the off-line storage-medium.

Though one could gzip the files to reduce the required space, that merely changes the parameters of the problem, rather than side-stepping it. Since the size of any subset is independent of the order in which the files are listed, we're not interested in the permutations of files into subsets for archiving; just the combinations.

Demonstration of exponential growth.

This is one of a set of problems with O(cn) time-complexity. Such problems are computationally intractable since the time required to solve them explodes as the size of the input-data increases. Having a computer (or brain) which is c times as fast, only allows one to solve problems with just one more input datum, in a given time-period. The solution-time grows like the apocryphal grains of wheat on a chess-board; or in this case, the number of files to store.

This problematic time-complexity has a corresponding advantage; with so many file-combinations from which to select one's solution, the likelihood of a good fit can be very high, & with a favourable distribution of file-sizes relative to the available space, exact matches are not uncommon
 … you might just have to wait until the heat-death of the universe4 before it's found.

If one doesn't take care to dispose of sub-optimal solutions as soon as possible, retaining only those of significance, then one may stumble over the potential for O(2n) space-complexity, & the resulting insatiable appetite for RAM will quickly cripple your computer.

Solution The problem-solution.

The exponential time-complexity of the problem, makes an exhaustive search of possible solutions tractable only when the number of files is small. The algorithm coded in squeeze deals with the it by a variety of strategies:

Example

Take a directory containing thirty video files, in this case raw MPEG-2 transport-streams, arranged into five sub-directories:

Film-reel Wall Switch.
$ ls -l */*
-rw-r--r-- 1 al users 1370849940 2010-02-11 01:14 Deadwood_1/Deadwood_1_06_Plague.m2t
-rw-r--r-- 1 al users 1579066144 2010-02-18 01:14 Deadwood_1/Deadwood_1_07_BullockReturnsToCamp.m2t
-rw-r--r-- 1 al users 1628667500 2010-02-25 01:19 Deadwood_1/Deadwood_1_08_SufferTheLittleChildren.m2t
-rw-r--r-- 1 al users 1475753752 2010-03-04 01:24 Deadwood_1/Deadwood_1_09_NoOtherSonsOrDaughters.m2t
-rw-r--r-- 1 al users 1455273972 2010-03-11 01:19 Deadwood_1/Deadwood_1_10_MisterWu.m2t
-rw-r--r-- 1 al users 1536891916 2010-03-18 02:24 Deadwood_1/Deadwood_1_11_JewelsBootIsMadeForWalking.m2t
-rw-r--r-- 1 al users 1506550596 2010-03-25 01:24 Deadwood_1/Deadwood_1_12_SoldUnderSin.m2t
-rw-r--r-- 1 al users 1316473008 2009-12-23 01:14 Deadwood_3/Deadwood_3_02_IAmNotTheFineManYouTakeMeFor.m2t
-rw-r--r-- 1 al users 1354711456 2009-12-24 01:14 Deadwood_3/Deadwood_3_03_TrueColors.m2t
-rw-r--r-- 1 al users 1317241928 2009-12-25 01:14 Deadwood_3/Deadwood_3_04_FullFaithAndCredit.m2t
-rw-r--r-- 1 al users 1339957592 2009-12-26 01:44 Deadwood_3/Deadwood_3_05_ATwoHeadedBeast.m2t
-rw-r--r-- 1 al users 1330129328 2009-12-27 01:44 Deadwood_3/Deadwood_3_06_ARichFind.m2t
-rw-r--r-- 1 al users 1134503672 2009-12-28 01:09 Deadwood_3/Deadwood_3_07_UnauthorizedCinnamon.m2t
-rw-r--r-- 1 al users 1279459944 2009-12-29 01:14 Deadwood_3/Deadwood_3_08_LeviathanSmiles.m2t
-rw-r--r-- 1 al users 1258818672 2009-12-30 01:09 Deadwood_3/Deadwood_3_09_AmateurNight.m2t
-rw-r--r-- 1 al users 1175977224 2009-12-31 01:09 Deadwood_3/Deadwood_3_10_AConstantThrob.m2t
-rw-r--r-- 1 al users 1364831120 2010-01-01 01:14 Deadwood_3/Deadwood_3_11_TheCatBirdSeat.m2t
-rw-r--r-- 1 al users 1260693784 2010-01-02 01:09 Deadwood_3/Deadwood_3_12_TellHimSomethingPretty.m2t
-rw-r--r-- 1 al users 3525287076 2010-06-06 00:54 Films/BatmanBegins.m2t
-rw-r--r-- 1 al users 1736242604 2010-06-15 00:59 Films/OfficeSpace.m2t
-rw-r--r-- 1 al users 2086362148 2010-07-18 23:24 Films/TheAmericanPresident.m2t
-rw-r--r-- 1 al users 2631969732 2010-07-27 00:39 Films/TheImportanceOfBeingEarnest.m2t
-rw-r--r-- 1 al users 2112377964 2010-07-10 00:19 Films/ThePaper.m2t
-rw-r--r-- 1 al users  914662864 2010-03-01 20:04 LeadBalloon_1/LeadBalloon_1_02_Wayne.m2t
-rw-r--r-- 1 al users  831265500 2010-03-15 20:04 LeadBalloon_1/LeadBalloon_1_03_5000Pounds.m2t
-rw-r--r-- 1 al users  795453568 2010-03-29 20:04 LeadBalloon_1/LeadBalloon_1_04_Allergic.m2t
-rw-r--r-- 1 al users  914268816 2010-04-19 03:04 LeadBalloon_1/LeadBalloon_1_05_Pistachio.m2t
-rw-r--r-- 1 al users 1113087652 2010-04-20 03:04 LeadBalloon_1/LeadBalloon_1_06_Fatty.m2t
-rw-r--r-- 1 al users  724333356 2010-07-26 03:04 LeadBalloon_2/LeadBalloon_2_06_Debacle.m2t
-rw-r--r-- 1 al users  605369024 2010-07-22 02:49 LeadBalloon_2/LeadBalloon_2_07_Rita.m2t
DVD

Suppose one wants to find the subset of them, which fits into the 4.7GB available on a DVD, with minimal waste. The total space-requirement is ten disks5, which represents the barrier beneath which no packing-solution can go, & which the optimal solution can only approach.

If one just filled DVDs, arbitrarily using the order in which the files are listed above, then the sum of the wasted tail-ends would be equivalent to nearly two disks.

DVD Video Files Bytes Wasted
1 Deadwood_1_06_Plague Deadwood_1_07_BullockReturnsToCamp Deadwood_1_08_SufferTheLittleChildren 121,416,416
2 Deadwood_1_09_NoOtherSonsOrDaughters Deadwood_1_10_MisterWu Deadwood_1_11_JewelsBootIsMadeForWalking 232,080,360
3 Deadwood_1_12_SoldUnderSin Deadwood_3_02_IAmNotTheFineManYouTakeMeFor Deadwood_3_03_TrueColors 522,264,940
4 Deadwood_3_04_FullFaithAndCredit Deadwood_3_05_ATwoHeadedBeast Deadwood_3_06_ARichFind 712,671,152
5 Deadwood_3_07_UnauthorizedCinnamon Deadwood_3_08_LeviathanSmiles Deadwood_3_09_AmateurNight 1,027,217,712
6 Deadwood_3_10_AConstantThrob Deadwood_3_11_TheCatBirdSeat Deadwood_3_12_TellHimSomethingPretty 898,497,872
7 BatmanBegins 1,174,712,924
8 OfficeSpace TheAmericanPresident 877,395,248
9 TheImportanceOfBeingEarnest 2,068,030,268
10 ThePaper LeadBalloon_1_02_Wayne LeadBalloon_1_03_5000Pounds LeadBalloon_1_04_Allergic 46,240,104
11 LeadBalloon_1_05_Pistachio LeadBalloon_1_06_Fatty LeadBalloon_2_06_Debacle LeadBalloon_2_07_Rita 1,342,941,152
Total Bytes Wasted 9,023,468,148

If you were expecting some bloated GUI, then prepare yourself for disappointment; "squeeze" is deliberately just a command-line utility, to better integrate with other such utilities, & because a graphical interface, besides looking good, adds no obvious value.

$ cabal install squeeze
Resolving dependencies…
Configuring squeeze-1.0.4.8…
Building squeeze-1.0.4.8…
Preprocessing library squeeze-1.0.4.8…
[1 of 5] Compiling Squeeze.Control.Concurrent.DivideAndConquer ( src-lib/Squeeze/Control/Concurrent/DivideAndConquer.hs, dist/build/Squeeze/Control/Concurrent/DivideAndConquer.o )
[2 of 5] Compiling Squeeze.Data.File ( src-lib/Squeeze/Data/File.hs, dist/build/Squeeze/Data/File.o )
[3 of 5] Compiling Squeeze.Data.FileCombination ( src-lib/Squeeze/Data/FileCombination.hs, dist/build/Squeeze/Data/FileCombination.o )
[4 of 5] Compiling Squeeze.Data.CommandOptions ( src-lib/Squeeze/Data/CommandOptions.hs, dist/build/Squeeze/Data/CommandOptions.o )
[5 of 5] Compiling Squeeze.Squeeze  ( src-lib/Squeeze/Squeeze.hs, dist/build/Squeeze/Squeeze.o )
In-place registering squeeze-1.0.4.8…
Preprocessing executable 'squeeze' for squeeze-1.0.4.8…
[1 of 3] Compiling Squeeze.Test.Performance ( src-exe/Squeeze/Test/Performance.hs, dist/build/squeeze/squeeze-tmp/Squeeze/Test/Performance.o )
[2 of 3] Compiling Paths_squeeze    ( dist/build/autogen/Paths_squeeze.hs, dist/build/squeeze/squeeze-tmp/Paths_squeeze.o )
[3 of 3] Compiling Main             ( src-exe/Main.hs, dist/build/squeeze/squeeze-tmp/Main.o )
Linking dist/build/squeeze/squeeze …
Installing library in /home/al/.cabal/lib/squeeze-1.0.4.8/ghc-7.6.3
Installing executable(s) in /home/al/.cabal/bin
Registering squeeze-1.0.4.8 …
Installed squeeze-1.0.4.8

$ ~/.cabal/bin/squeeze --maximumBytes=4700000000 */*.m2t
4672110012	["Deadwood_1/Deadwood_1_08_SufferTheLittleChildren.m2t","Deadwood_1/Deadwood_1_11_JewelsBootIsMadeForWalking.m2t","Deadwood_1/Deadwood_1_12_SoldUnderSin.m2t"]
4683487396	["Deadwood_1/Deadwood_1_07_BullockReturnsToCamp.m2t","Deadwood_1/Deadwood_1_08_SufferTheLittleChildren.m2t","Deadwood_1/Deadwood_1_09_NoOtherSonsOrDaughters.m2t"]
4698067172	["Deadwood_1/Deadwood_1_10_MisterWu.m2t","Deadwood_1/Deadwood_1_12_SoldUnderSin.m2t","Films/OfficeSpace.m2t"]
4699892276	["Deadwood_3/Deadwood_3_03_TrueColors.m2t","Deadwood_3/Deadwood_3_09_AmateurNight.m2t","Films/TheAmericanPresident.m2t"]

The fact that each of the solutions presented, contains exactly three files, is merely a coincidence promoted by the statistical likelihood of this type of solution, resulting from the average file-size relative to the available storage-space; other possibilities were tested, but none made the grade.

Grouping Related Files

The solution found so far is adequate, but one may want to keep the episodes of a series together in one archive. In their current file-format, this is unlikely to work; the files are too large to fit onto one DVD in any order. The file-size can be dramatically reduced, by snipping out the advert-breaks using dvbcut, & then using ffmpeg & x.264, to convert the video stream to the more efficient H.264-format; in this case wrapped with the original audio stream inside a MKV-container. For the purposes of this example, I'll just edit & re-encode a single directory of video files, corresponding to the episodes of one season of a series.

$ ls -l 'Deadwood_3/'
-rw-r--r-- 1 al users 315054028 2010-01-05 08:57 Deadwood_3_02_IAmNotTheFineManYouTakeMeFor.mkv
-rw-r--r-- 1 al users 287254198 2010-01-05 08:57 Deadwood_3_03_TrueColors.mkv
-rw-r--r-- 1 al users 288151700 2010-01-05 08:57 Deadwood_3_04_FullFaithAndCredit.mkv
-rw-r--r-- 1 al users 340534828 2010-01-05 08:57 Deadwood_3_05_ATwoHeadedBeast.mkv
-rw-r--r-- 1 al users 303210766 2010-01-05 08:57 Deadwood_3_06_ARichFind.mkv
-rw-r--r-- 1 al users 186886160 2010-01-05 08:58 Deadwood_3_07_UnauthorizedCinnamon.mkv
-rw-r--r-- 1 al users 263138794 2010-01-05 08:58 Deadwood_3_08_LeviathanSmiles.mkv
-rw-r--r-- 1 al users 324496358 2010-01-05 08:58 Deadwood_3_09_AmateurNight.mkv
-rw-r--r-- 1 al users 228694008 2010-01-05 08:58 Deadwood_3_10_AConstantThrob.mkv
-rw-r--r-- 1 al users 313909470 2010-01-05 08:58 Deadwood_3_11_TheCatBirdSeat.mkv
-rw-r--r-- 1 al users 285970588 2010-01-05 08:58 Deadwood_3_12_TellHimSomethingPretty.mkv

This directory is just 22%6 of the original size, which now represents 67% of the space available on a DVD, leaving space into which a subset of the remaining files can fit.

One can now re-run the test, requiring that any solution includes either this directory of associated files in its entirety, or doesn't include any member of it at all, to see if it can now be packed acceptably with the remaining uncompressed MPEG-2 files. This requirement is communicated to the executable "squeeze", merely by specifying the directory-name, rather than, as before, specifying the individual file-names it contains.

~/.cabal/bin $ time squeeze 'Deadwood_3/' */*.m2t	#NB: by specifying a directory, we've implicitly requested atomicity.
4674192814	["Deadwood_3/","Deadwood_1/Deadwood_1_11_JewelsBootIsMadeForWalking.m2t"]
4683487396	["Deadwood_1/Deadwood_1_07_BullockReturnsToCamp.m2t","Deadwood_1/Deadwood_1_08_SufferTheLittleChildren.m2t","Deadwood_1/Deadwood_1_09_NoOtherSonsOrDaughters.m2t"]
4698067172	["Deadwood_1/Deadwood_1_10_MisterWu.m2t","Deadwood_1/Deadwood_1_12_SoldUnderSin.m2t","Films/OfficeSpace.m2t"]

This time, I didn't explicitly specify the available space, but rather relied on a hard-coded default-value.

An acceptable (though sub-optimal) solution was found, which includes the entire re-encoded directory, & which wasted only 0.55% of the available space. As a side-effect of the request to treat eleven files, as one atomic directory, the total number of file-combinations has been reduced by a factor of 210 (i.e. 1024), & in consequence, the time required for an exhaustive search reduced proportionately.

Performance

Execution-time vs. File-count. Execution-time vs. File-count vs. (Available Space / Mean File-size), with Standard-deviation = Mean.

My usual test-platform was used, to gather data for these graphs. Though from version 1.0.3.0, the application is multi-threaded, it was run using only a single thread. The measured execution-time is the time required to find the optimal solution; an acceptable solution is typically returned much more quickly.

The tests were performed by asking the application to artificially generate the file-data it normally reads from the filesystem, which permits testing with an arbitrary number of files, without requiring them to actually exist. The frequency-distribution observed using fishfood, for the size of Matroska video-files in a real file-system (mine), has a mean value of about a twelfth of the space on a DVD & a standard-deviation approximately equal to the mean file-size. "A Large-Scale Study of File-System Contents by John R. Douceur and William J. Bolosky" concludes that for arbitrary file-types, file-size matches a log-normal distribution.

The execution-time has approximately exponential O(1.7n) time-complexity with respect to the number of files. The exponential curve appears straight, since the ordinate is plotted on a logarithmic scale.

The 3-D plot has contours of constant execution-time, projected onto the base. These contours are separated by a constant time-ratio, rather than a constant absolute difference, in accordance with the logarithmic scale.

The second graph illustrates the effect on the execution-time, of the number of files (as before), & also the ratio of the total available space to the mean file-size.

Conclusions

Finding the optimal solution is only feasible with a small number of files or directories; though a sub-optimal solution is typically adequate & can be found much more quickly.

For a given mean file-size, the required execution-time is highly dependent on the file-size distribution. Though not depicted above, changing the assumption of a log-normal distribution of file-sizes to:

Footnotes

Footnotes
  1. (Space available ÷ file-size), rounded-down.
  2. Relatively.
  3. The 0-1 Knapsack-problem assigns a value to each of the items which can become part of the solution, whereas this problem values all files equally.
  4. 2038 on UNIX or GNU/Linux.
  5. 42,676,531,852 bytes, i.e. 9.08011316 DVDs, which naturally must be rounded-up.
  6. (3,137,300,898 ÷ 14,132,797,728) bytes.