Known Bugs On My List
The journalist’s name field can often come up wrong, sometimes quite wrong
As the web is becoming harder and harder to scrape, many news sites try to obscure certain metadata from crawlers. We’re better than most since our users load the page before it’s sent for archiving, but we’re still missing some, especially on Zerohedge.com where names are often scraped from the articles themselves, sometimes resulting in the subject of an article being the claimed author of it.
We’re working on improving our scraping algorithms, and intend to update incorrect records (transparently, since nothing can be removed from the blockchain metadata records).
The date (or headline, or any other metadata) is wrong!
See above
None of the fields are filling in
Sometimes some function on a page hangs, and since the scraping function waits for the page to fully load, it never runs - simply reload the page and click the Scribes of Alexandria button again and it should work fine.
I do plan to implement a timeout where it attempts to scrape even if the page hasn’t reported an idle status yet.