src/concurrency/sync-exercises/link-checker.md
Let us use our new knowledge to create a multi-threaded link checker. It should start at a webpage and check that links on the page are valid. It should recursively check other pages on the same domain and keep doing this until all pages have been validated.
For this, you will need an HTTP client such as reqwest. You will also
need a way to find links, we can use scraper. Finally, we'll need some
way of handling errors, we will use thiserror.
Create a new Cargo project and reqwest it as a dependency with:
cargo new link-checker
cd link-checker
cargo add --features blocking reqwest
cargo add scraper
cargo add thiserror
If
cargo addfails witherror: no such subcommand, then please edit theCargo.tomlfile by hand. Add the dependencies listed below.
The cargo add calls will update the Cargo.toml file to look like this:
[package]
name = "link-checker"
version = "0.1.0"
edition = "2024"
publish = false
[dependencies]
reqwest = { version = "0.13.1", features = ["blocking"] }
scraper = "0.25.0"
thiserror = "2.0.18"
You can now download the start page. Try with a small site such as
https://www.google.org/.
Your src/main.rs file should look something like this:
# // Copyright 2024 Google LLC
# // SPDX-License-Identifier: Apache-2.0
#
{{#include link-checker.rs:setup}}
{{#include link-checker.rs:visit_page}}
fn main() {
let client = Client::new();
let start_url = Url::parse("https://www.google.org").unwrap();
let crawl_command = CrawlCommand{ url: start_url, extract_links: true };
match visit_page(&client, &crawl_command) {
Ok(links) => println!("Links: {links:#?}"),
Err(err) => println!("Could not extract links: {err:#}"),
}
}
Run the code in src/main.rs with
cargo run
www.google.org domain. Put an upper limit of 100 pages or so so that you
don't end up being blocked by the site.