While I had every intention of writing more about music today, an article about Common Crawl came across my feed this morning.
Before reading the article, I had never heard of Common Crawl (which is apparently fairly normal: the very first sentence of the article acknowledges that basically nobody knows about Common Crawl outside of Silicon Valley). Tax-exempt since 2009, Common Crawl crawls the internet, makes an archive of the information, and provides it to researchers for free.
In recent years, it has also provided these archives to AI companies to train their large language models. Most of the big names in AI, including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon, have used Common Crawl’s data as training material.
Common Crawl claims that it doesn’t archive any paywalled data, or data that otherwise requires a sign-in (though its executives seem to really want to do it); Alex Reisner found that not only does Common Crawl archive (and provide to AI companies) paywalled content, it also ignores requests to take the content down.
So anyway, is archiving paywalled material and providing it to researchers and AI companies consonant with tax exemption? If I were Common Crawl, I’d be at least a little worried.
Because, while I’m not intimately familiar with the Computer Fraud and Abuse Act, it’s not out of the realm of possibility that accessing this paywalled data is a criminal offense. (Remember that about a decade ago, the federal government indicted and prosecuted Aaron Schwartz for downloading and sharing paywalled JSTOR articles.)
Now, there’s no indication that the government is planning on prosecuting anybody associated with Common Crawl. But it’s also worth remembering that an exempt organization cannot be organized for an illegal purpose. And if Common Crawl is violating the CFAA, I’d be at least nervous that it’s organized, in part, for an illegal purpose.
Photo by Nahrizul Kadri on Unsplash