Archive Team

Warrior/Tracker system

Summarize

Perspective

Archive Team is composed of a loose community of independent contributors/users.^[13]^[14]^[15] Their archival process makes use of a "Warrior", a virtual machine environment. Individuals use the Warrior in their desktop environments to download content without requiring technical expertise. Tasks are allocated by a centrally-managed Tracker that networks with and allocates items to Warriors. The tracker also monitors user upload activity and displays a leader board.^[16]

Warrior Projects

There are several long-running Warrior projects:

Imgur: The image host Imgur updated their terms of service on April 19, 2023. This update focused on removing old, unused, and inactive content that is not tied to a user account, along with NSFW content.^[17]
Blogger: In May 2023, Google announced that inactive accounts would be deleted starting on 2023-12-01 across their platform, including Blogger blogs.^[18]
Reddit: Banning communities that generate bad PR for Reddit Inc. Restricting access to APIs and data on June 19, 2023.^[19]
Russian invasion of Ukraine: Archiving various .ua sites in the wake of the Russian government's invasion.^[20]
Telegram: Archiving public messages in various newsworthy and/or otherwise notable Telegram channels.^[21]
GitHub: When it was bought by Microsoft in 2018, many archivists and users were worried the site would become more restrictive. This project archives the UI parts of GitHub and the code of each repository.^[22]
Mediafire: On 2020-12-18, users reported that they began receiving emails from MediaFire how they plan to classify accounts as abandoned if they fail to meet certain criteria, starting in January.^[23]
Coronavirus Outbreak: Documenting and preserving data, events, and impacts of COVID-19 on society.^[24]
YouTube: Saving metadata, thumbnails, comments and selected videos. Videos and channels are to be limited to: Channels that may be deleted because company went bankrupt, channel owner died, YouTube banning certain content, and channels related to world events and politics.^[25]
Wikiteam: Saving wiki xml dumps.^[26]
Urlteam: Saving URL shorteners.^[27]
URLs: Archiving URLs from various sources.^[28]

As of 12 December 2024^[update], the largest project on ArchiveTeam is URLs, with over 10 petabytes archived.^[29]^[b]

Remove ads

ArchiveBot

ArchiveBot is a web archiving system operated by the Archive Team for conducting curated crawls of websites. Controlled through an IRC channel, ArchiveBot allows volunteers to submit URLs for archiving, typically in response to site shutdowns, policy changes, or other events threatening online data.

Jobs are processed by a network of worker systems known as pipelines, which crawl and save content in the WARC (Web ARChive) format. Volunteers monitor active crawls (jobs) via a public dashboard and may apply ignore rules to handle problematic areas of websites—such as calendars, infinite scroll, or session-based content that can disrupt recursive crawling.^[30]

The results of ArchiveBot crawls are uploaded to the Internet Archive and are typically accessible through the Wayback Machine, where they can be viewed by the public.^[31] ArchiveBot has been used to preserve a wide range of content, including user-generated platforms, news outlets, and government websites.^[32]

Remove ads

Warrior/Tracker system

Warrior Projects

ArchiveBot

See also

Notes

References

External links

Wikiwand - on