15-418 · Parallel Computer Architecture

LZ Compression

Jack Woodson (jrwoodso) & Alex Li (alexli3)

Summary

Will investigate how to parallelize the serial compression algorithms LZ77 and LZW. As we try to parallelize, we would also compare how much performance (compression ratio) we can retain while speeding up the computation. We will target the GHC machine, and extend to the PSC to test scalability on higher core counts.

8
GHC Cores
128
PSC Cores
4–6×
Target Speedup
<15%
Ratio Degradation

LZ Compression

LZ-family compression algorithms form the foundation of widely-used formats like ZIP, PNG, and GIF. Both algorithms exploit repetition in data: rather than storing repeated sequences in full, they encode references to prior occurrences, achieving significant size reduction.

There are various different forms of LZ compression and the methods that either benefit speed with low compression ratio, vice versa, or memory consumption. The different LZ compression algorithms that we will be increasing their performance are LZ77 and LZW.

LZ77

Makes use of a sliding windows technique to process current data with previously processed / compressed data.

LZW

Takes in a set of symbols and converts them into a code that takes less file size. These codes and symbols are stored in a dictionary for easy, low cost lookup. If these symbols appear again they are substituted for the short code.

Challenge

These algorithms heavily rely on finding code or symbols that are repeated early on in the compression process to be able to shorten the repetition. When working with multiple threads this task becomes increasingly more difficult as they will either create unique codes for these symbols individually taking up too much memory leading to a poor compression ratio, or waiting for threads to load to memory creating specific bottlenecks.

Our project will tread this fine line to be able to optimize our code working with multiple threads while maintaining a good compression ratio.

  • Memory access characteristics
  • Communication to computation issues
  • Divergent execution and load balancing
  • Compression ratio performance constraints

Goals and Deliverables

Baseline Success
  • Working sequential LZ77 and LZW with competitive compression ratios on standard benchmarks
  • Parallel LZ77 with at least 4–6× speedup on 8 GHC cores
  • Less than 15% degradation in compression ratio
  • Extend to LZW and begin PSC scaling exploration
  • Analysis of what limits scaling (memory, synchronization, load imbalance)
Hope to Achieve
  • The project is already quite aggressive; it almost already encompasses what we hope to achieve
  • Alternative pre-initialization strategies
  • Speculative approaches to further improve speed and performance
Fallback
  • Focus only on LZ77 if work is significantly harder to implement or parallelize
  • Refine parallel LZ77 implementation
  • Deeper analysis on LZ77 with surface-level observations on LZW

Platform Choice

Our platform choice is going to primarily consist of work done on the GHC machines. This is due to the availability we have as students and able to play around with different thread counts with these compression algorithms. Additionally, as we are trying to create the most parallel efficient algorithms we also plan to make use of the PSC machines to test and see how our algorithm runs with higher thread counts 16–128.

Also, a multicore CPU is also a better fit for this workload than a GPU. LZ compression involves irregular, backward-looking memory accesses (LZ77's sliding window) and dynamic dictionary lookups (LZW) — both of which map poorly to GPU architectures that require coalesced memory access patterns to perform well.

GHC Machines

Up to 8 cores. Primary development platform. Available as students to experiment with different thread counts.

PSC Machines

Up to 128 cores. Used to demonstrate how algorithms scale with significantly more threads and increased computing power.

Resources

We plan to write this code from scratch while making use of helpful references. No additional resources beyond GHC and PSC should be needed. We anticipate potential issues with scaling on the PSC that will not be noticed on the GHC machines at 8 cores.

Schedule

Mar 30
Apr 4

Working LZ77 and LZW sequential algorithms that yield an optimal compression ratio.

Important to be able to have a good sequential baseline to measure our efforts in increasing speedup.

Additionally create a good test set that comprises files of different information, highly repetitive, no repetition and of different sizes.

Apr 5
Apr 11

Working compression algorithm for LZ77 that are able to achieve relatively high speedup on lower thread counts but are in the right direction in being able to get a good workload balance and scaling at higher thread counts.

As there are similarities between LZ77 and LZW we will primarily focus on LZ77 first as a lot of the ideas for increased performance should translate to the LZW method.

Keep up to date with changes and requirements in milestone for April 14.

Apr 12
Apr 18

Have a LZ77 compression algorithm that has good scalability and good compression ratio with some small improvements.

Keep website up to date with milestone and changes made.

Apr 19
Apr 30

Have a LZW compression algorithm that has good scalability and good compression ratio with some small improvements. Compare these two different algorithms that we've been working to find the best compression algorithm with lower thread counts, to higher, and memory usage.

Additionally, meet all other requirements for website.

15-418 · Milestone Report 1

Progress Update

Jack Woodson (jrwoodso) & Alex Li (alexli3) · April 2025

Work Done So Far

Our goal is to compare LZW and LZ77 across compression ratio, runtime, and speedup at higher core counts. Both serial and parallel implementations are working.

LZ77

Parallel implementation using OpenMP chunking is working. Experimenting with window sizes across file types and thread counts. Single-core compression ratio ~0.4× original size; ratio degrades slightly with more threads. Speedup is very linear on GHC machines.

LZW

Baseline serial implementation matches Unix compress in both speed and ratio. Parallel version uses OpenMP over file sections. On 1 GB Wikipedia data, achieves ~5× speedup with 8 GHC cores; overhead dominates on small files.

Goals and Deliverables

We have made good progress and are confident we can achieve the goals from our proposal. Working serial and parallel implementations of both algorithms exist.

~6×
Speedup @ 8 GHC Cores
~32×
Speedup @ 64 PSC Cores
0.430
LZW Compression Ratio
0.449
Unix compress Ratio
LZ77 Goal

Parallelize enough to compare and compete with other LZ algorithms while maintaining the simplicity of the sliding-window approach.

LZW Goal

Move beyond the naive blocked approach. Improve performance on small files and fix compression-ratio degradation on highly repetitive inputs like novels.

Preliminary Results

LZ77 — Canterbury Corpus

Runtime scales well with thread count. Compression ratio stays approximately constant (~0.45) across 1, 2, 4, and 8 threads, with only slight degradation. Raw runtime still lags behind other LZ algorithms — a key focus for the next two weeks.

LZ77 runtime on Canterbury Corpus
Fig 1 — LZ77 runtime (s) vs. thread count on Canterbury Corpus
LZ77 compression ratio vs thread count
Fig 2 — LZ77 compression ratio vs. thread count on Canterbury Corpus

LZW — Enwik9 (1 GB)

Naive chunked approach scales well up to 8 cores. Some deterioration from 4→8 cores; expected to continue at 16, 32, and 64. Our compression ratio of 0.430 slightly beats Unix compress at 0.449.

Unix vs LZW runtime on Enwik9
Fig 3 — Unix compress vs. parallel LZW runtime (s) on Enwik9

Updated Schedule

Apr 13
Apr 16
Alex Jack
Explore parallel dictionary for LZW beyond naive blocked approach.
Experiment with CUDA threads for string matching in LZ77.
Apr 17
Apr 19
Alex Jack
Test LZW on small files and high core counts; improve scaling.
Focus on better compression ratios after achieving better runtime with CUDA LZ77.
Apr 20
Apr 23
Both
Tune both algorithms. Compare LZ77 vs LZW, push each toward its extreme.
Apr 24
Apr 26
Jack Alex Both
Finalize LZ77 code.
Finalize LZW code.
Ensure codebase is well organized and documented.
Apr 27
Apr 30
Both
Finalize report, complete testing, generate graphs.
Update website with findings.
Create and rehearse presentation poster.

Presentation Plan

We plan to show comparison graphs of runtime and compression ratio for serial and parallel implementations at nprocs ∈ {1, 2, 4, 8}, with higher core counts if PSC time is available. For LZW we will also include Unix compress as a reference. All results will be presented graphically.

Runtime Graphs

Runtime vs. thread count for LZ77 and LZW. Unix compress shown as a flat reference line for LZW.

Compression Ratio Graphs

Ratio vs. thread count showing how much quality is preserved as we scale up parallelism.

Issues and Concerns

No specific problems meeting deadlines so far. The main open question is whether the final results will be sufficiently differentiated between the two algorithms to tell a compelling story.

  • Poor scaling on small files where parallelism overhead dominates.
  • Insufficient work per core at very high thread counts (64+ cores on PSC).
  • Fast concurrent access to a shared LZW dictionary — likely the hardest remaining implementation challenge and biggest bottleneck for hitting scaling targets.