CS 561 · Fall 2023 · Boston University
Cloud Computing — PageRank on a Mini-Internet
Generated a synthetic internet of cross-linked HTML pages, parsed the link graph, and computed PageRank to convergence — locally and on GCP — to feel where cloud actually compounds.
● What I built
- Page generator emits N HTML pages with random link distributions.
- Adjacency matrix built from regex-extracted hrefs across the corpus.
- Iterative PageRank with damping factor d = 0.85, convergence at ‖Pₜ − Pₜ₋₁‖ < ε.
- Local (numba JIT) vs. GCP execution — quantified the breakeven crossover.
● Stack
PageRank is a deceptively simple algorithm: rank web pages by how often a random surfer would visit them. The elegance hides a lot of practical complexity—especially when your graph has 10,000 nodes and you're deciding whether to compute locally or push to the cloud.
In CS 561 (Cloud Computing), I implemented PageRank on a synthetically generated web graph, exploring both the mathematics of convergence and the engineering trade-off between local hardware and cloud elasticity. The assignment forced you to actually think about when cloud compute makes sense (spoiler: not always).
The algorithm
PageRank models web surfing as a random walk. At each step, a surfer either:
- Follows a random outgoing link from the current page (with probability d, the damping factor)
- Jumps to any page uniformly at random (with probability 1 − d, usually 0.15)
This gives the recurrence:
PR(A) = (1 - d) / N + d * sum(PR(T) / C(T) for T → A)
where T → A means T links to A, C(T) is T's outgoing link count, and N is the total number of pages.
The algorithm repeats this update until convergence—when the change in the sum of all PageRanks falls below epsilon (typically 0.5%).
Parsing HTML and building the graph
The first practical challenge: extracting the link graph from generated HTML files. Each file i.html contained <a href="j.html"> tags. A regex extract-and-parse pipeline with concurrent file I/O was essential to avoid bottleneck:
def extract_numbers_from_html(html_file_path: str, pattern: str = r'HREF="(\d+)\.html"') -> list[int]:
with open(html_file_path, "r", encoding="utf-8") as f:
html_content = f.read()
matches = re.findall(pattern, html_content)
return [int(match) for match in matches]
With 10,000 HTML files, serial parsing was unacceptable. The solution: ThreadPoolExecutor with workers pegged at cpu_count() * 5, extracting in parallel while a progress bar tracked throughput:
with ThreadPoolExecutor(max_workers=os.cpu_count() * 5) as executor:
futures = [executor.submit(extract_numbers_from_html, blob.name) for blob in blobs]
for future in tqdm(as_completed(futures), total=len(futures)):
arr, file_path = future.result()
links_dict[int(file_path.split("/")[-1].split(".")[0])] = arr
This was the hidden work: parsing scales linearly, but I/O doesn't parallelize by default.
The adjacency matrix
Once you have the link graph as a dictionary links_dict[page_i] = [list of pages i links to], construct an N × N adjacency matrix:
def construct_adjacency_mat(links_dict: dict[int, list[int]]) -> np.ndarray:
adj_mat = np.zeros((MAX_NUM_FILES, MAX_NUM_FILES))
for key, value in links_dict.items():
for val in value:
adj_mat[key][val] = 1
return adj_mat
This is O(E) where E is the number of edges. Dense matrix operations are expensive, but the pre-computed adjacency matrix enables the optimization that matters most.
Convergence with smart precomputation
The naive PageRank loop is O(N² * iterations), which is brutal. The optimizations:
- Precompute incoming links per page. For each column i, find which rows have a 1 (those are incoming links):
for i in range(MAX_NUM_FILES):
elems_dict[i] = np.where(adj_mat[:, i] == 1)[0]
- Precompute outgoing counts. For each page, sum its row to get the outgoing degree:
for elem in elems_dict:
sums_dict[elem] = adj_mat[elem].sum()
- Inner loop processes only incoming edges.
for i in range(MAX_NUM_FILES):
sum_rows = 0.0
for elem in elems_dict[i]: # Only incoming links
sum_rows += page_rank_mat[elem] / sums_dict[elem]
page_rank_mat[i] = 0.15 + (0.85 * sum_rows)
This drops the inner loop from O(N) per page to O(degree), which matters for sparse graphs. The whole pass is now O(E) instead of O(N²).
The outer loop runs until convergence:
while True:
old_mat = page_rank_mat.copy()
# ... update all pages ...
percent_change = abs(page_rank_mat.sum() - old_mat.sum()) / old_mat.sum()
if percent_change <= epsilon:
return page_rank_mat
Typically 20–40 iterations on a well-formed web graph.
Local vs cloud: the trade-off
The assignment had you implement two paths:
Local: Load the HTML files from disk, parse, compute. This is fast if you have enough RAM and your SSD is responsive. With good thread pooling, parsing 10K files took ~30s on modern hardware.
Cloud: Upload the HTML to Google Cloud Storage, use Compute Engine or run in a container. Cloud wins when:
- You don't want to maintain local infrastructure
- You need to scale to 100K+ pages
- You're integrating into a larger pipeline
But cloud loses on latency and cost:
- Network round-trips add up fast
- Storage I/O is cheaper than egress
- A 2-minute local run costs $0; a 2-minute GCP run costs a few cents
I measured both in the assignment. Local was ~3x faster for the 10K case, but cloud scaled predictably. The lesson: benchmark your graph size, don't assume.
What I learned
On the algorithm:
-
The damping factor is crucial. Set it too low (d = 0.5) and low-indegree pages get inflated scores. Too high (d = 0.99) and computation crawls. d = 0.85 is a sweet spot because it balances random jumps with following links.
-
Convergence is usually fast. Despite N² potential, the sum-change criterion typically exits in 20–50 iterations. This is why PageRank is practical.
-
Sparse > dense. A 10K × 10K dense matrix is 800 MB. The actual web is sparse—average page has ~10 outgoing links. Exploit it.
On systems:
-
Precomputation wins big. The difference between an "obvious" O(N²) loop and a careful O(E) loop is 10–100x in real benchmarks.
-
Profiling before optimizing saves days. The threaded I/O was the bottleneck, not the PageRank math. I almost optimized the wrong thing.
-
Local + cloud is a spectrum. It's not binary. You can parse locally, push to cloud for distributed aggregation, pull results back. The assignment showed the endpoints; real systems live in the middle.
References
The PageRank algorithm is from Brin & Page (1998): "The Anatomy of a Large-Scale Hypertextual Web Search Engine." MapReduce (Dean & Ghemawat 2004) is the standard for distributed PageRank at scale. For approximate PageRank on streaming graphs, Dremel (Melnik et al. 2010) offers columnar approaches to cut I/O.
The code is at gh api repos/ArkashJ/CloudComputing/contents if you want to see the full pipeline—link extraction, matrix construction, timing harness, and GCP integration.
What I'd do differently
- Use sparse matrices from the start.
scipy.sparse.csr_matrixwould have saved memory and made the precomputation cleaner. I didn't know about it at the time. - Stream, don't batch. The current approach loads all 10K links into memory. For 100K+ pages, a streaming aggregator (like a Kafka topic) is better.
- Benchmark network latency explicitly. I measured total wall time but not the GCP upload/download. A breakdown would have clarified the cloud trade-off better.
● Papers
- → Dremel: Interactive Analysis of Web-Scale DatasetsMelnik et al. (2010)
- → Kafka: A Distributed Messaging System for Log ProcessingKreps, Narkhede, Rao (2011)
● Code
Note Code excerpts illustrate concepts. Full homework solutions are not published.