handwritten/storage/system-test/data/long-html-file.html
Take the 2-minute tour × Super User is a question and answer site for computer enthusiasts and power users. It's 100% free, no registration required.
| up vote25down votefavorite 6
|
What is the largest size a gzip (say 10kb for the sake of an example) can be decompressed to?
| share|improve this question |
seriousdev
785614
|
asked May 9 '10 at 11:47
Zombies
1,22652546
|
| | |
| | |
add a comment | |
| up vote46down voteaccepted |
It very much depends ont he data being compressed. A quick test with a 1Gb file full for zeros give a compressed size of ~120Kb, so your 10Kb file could potentially expand into ~85Mbytes.
If the data has low redundancy to start with, for instance the archive contains images files in a format that is compressed natively (gif, jpg, png, ...), then gzip may add not further compression at all. For binary files like program executables you might see up to 2:1 compression, for plain text, HTML or other markup 3:1 or 4:1 or more is not unlikely. You might see 10:1 in some cases but the ~8700:1 seen with a file filled with a single symbol is something you are not going to see outside similarly artificial circumstances.
You can check how much data would result from unpacking a gzip file, without actually writing its uncompressed content to disk, with gunzip -c file.gz | wc --bytes - this will uncompress the file but not store the results, instead passing them to wc which will count the number of bytes as they pass then discard them. If compressed content is a tar file containing many many small files you might find that noticably more disk space is required to unpack the full archive, but in most circumstances the count returned from piping gunzip output through wc is going to be as accurate as you need.
Keltari
28.5k656102
|
answered May 9 '10 at 13:11
David Spillett
19.1k3149
|
| | |
|
| 1 | |
| Nice. Limiting case, discussion of common cases and a a "how-to" on answering the question at hand. A model for a good answer. – dmckeeMay 9 '10 at 15:09 | |
| | |
| I've seen HTML expand to 10x (of course x3 and x4 was the most common!).... perhaps a lot of redundant data for those ones that were exploding +8x. I think the page in question that was doing that was a php info page. – ZombiesMay 10 '10 at 12:10 | |
| | |
|
Repetitive markup, as seen in the output of phpinfo(), compresses very well. The technical information in that output contains more direct repetition than the average chunk of natural language would too, and the alphabet distribution is probably less smooth which could help the Huffman stage get better results. – David SpillettMay 10 '10 at 12:55
|
|
| | |
| This answer doesn't account for intentionally malicious compressed data. One can craft a malicious zip file around 10KB that can expand to a bit over 4GB. – David SchwartzJan 2 '13 at 2:36 | |
| | |
| Zip bombs of that scale rely on nested archives though, so as a human unpacking the file you would noticed something odd before long. They can be used as an effective DoS attack against automated scanners (on mail services and so forth) though. – David SpillettJan 2 '13 at 11:47 |
| up vote7down vote |
Usually you don't get more than 95% compression (so that 10kB gzipped data would decompress to ~200kB), but there are specially crafted files that expand exponentially. Look for 42.zip, it decompresses to few petabytes of (meaningless) data.
answered May 9 '10 at 12:04
liori
2,2581023
|
| | |
|
| 2 | |
| Wikipedia says 42.zip is "containing five layers of nested zip files in sets of 16", so that is not a valid example for decompression (only for recursive decompression). – TgrJul 10 '13 at 13:59 | |
| 1 | |
| Indeed, 42.zip is specifically a danger to tools that automatically scan zip files recursively, for example virus scanners. – thomasrutterFeb 5 '14 at 0:42 |
add a comment | |
| up vote3down vote |
The compression ratio of any compression algorithm will be a function of the data being compressed (besides the length of that data).
Here is an analysis at MaximumCompression,
Look at one of the samples like,
Summary of the multiple file compression benchmark tests
File type : Multiple file types (46 in total)
# of files to compress in this test : 510
Total File Size (bytes) : 316.355.757
Average File Size (bytes) : 620,305
Largest File (bytes) : 18,403,071
Smallest File (bytes) : 3,554
answered May 9 '10 at 12:03
nik
40.1k565116
|
| | |
| | |
add a comment | |
| up vote2down vote |
A huge file containing only one symbol will compress very well.
answered May 9 '10 at 12:44
geek
3,27431115
|
| | |
| | |
add a comment | |
| up vote1down vote |
10 MB of zeros in file, compress with gzip -9 to 10217. So maximal ratio looks to be around 1000x.
answered Apr 7 '13 at 13:12
nikos
111
|
| | |
| | |
add a comment | |
draft saved
draft discarded
Sign up using Google
Sign up using Facebook
Sign up using Stack Exchange
| Name
Email |
By posting your answer, you agree to the privacy policy and terms of service.
|
asked
|
5 years ago
| |
viewed
|
47698 times
| |
active
|
|
3 What type of data is compressed more effectively by bzip2 than gzip/zlib?
0 Weird GZip / Disk space issue
2 Modern command line file compression utilities
2 Zip/gzip compression ratio much less in Linux than in Windows
1 How to unpack a compressed executable without knowing the compression algorithm
12 How to obtain maximum compression with .tar.gz?
0 What is the difference in various commands for compressing files in linux?
3 Compressing compressed .tar.gz files deterministically
3 Is Gzip supposed to honor original filename during decompress?
0 Unix gzip more than one file at once