← BackJan 8, 2026

Show HN: Open database of link metadata for large-scale analysis

Skip to content Navigation Menu AI CODE CREATIONGitHub CopilotWrite better code with AIGitHu...

Skip to content Navigation Menu AI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsView all featuresPricing Provide feedback Saved searches Use saved searches to filter your results more quickly Sign up Appearance settings Notifications You must be signed in to change notification settings Fork 2 Star 28 Folders and filesNameNameLast commit messageLast commit dateLatest commitHistory359 Commits20252025LICENSELICENSEREADME.mdREADME.mdsources.jsonsources.jsonREADMEGPL-3.0 licenseLink database for year 2025 This repository contains link metadata: title, description, publish date, etc. Suite of projects Captured using Django application: https://github.com/rumca-js/Django-link-archive Bookmarked links https://github.com/rumca-js/RSS-Link-Database daily RSS Git repository for the year 2026 https://github.com/rumca-js/RSS-Link-Database-2026 daily RSS Git repository for the year 2025 https://github.com/rumca-js/RSS-Link-Database-2025 daily RSS Git repository for the year 2024 https://github.com/rumca-js/RSS-Link-Database-2024 daily RSS Git repository for the year 2023 https://github.com/rumca-js/RSS-Link-Database-2023 daily RSS Git repository for the year 2022 https://github.com/rumca-js/RSS-Link-Database-2022 daily RSS Git repository for the year 2021 https://github.com/rumca-js/RSS-Link-Database-2021 daily RSS Git repository for the year 2020 https://github.com/rumca-js/RSS-Link-Database-2020 Goal Archive purposes Data analysis - possible to verify link rot, etc. Inspirations I Tracked Everything I Read on the Internet for a Year https://www.tdpain.net/blog/a-year-of-reading. Automating a Reading List https://zanshin.net/2022/09/11/automating-a-reading-list/ Google Search Is Dying https://dkb.io/post/google-search-is-dying Luke Smith: Search Engines are Totally Useless Now... https://www.youtube.com/watch?v=N8P6MTOQlyk Luke Smith: Remember to Consoom Next Content on YouTube https://www.youtube.com/watch?v=nI3GVw2JSEI. As a society we provide news instead of building a data base of important information Ryan George What Google Search Is Like In 2022 https://www.youtube.com/watch?v=NT7_SxJ3oSI Data Daily Data Are stored in the year directory. data are in %Y%M%Y-%M-%D directories, where %Y stands for year, %M for month, %D for day Most links are captured via RSS. Some entries were added manually for each source two files are provided: JSON and markdown markdown file is provided for data preview this repo contains many links captured via automated process. They are here, but that does not mean I endorse them all Sources file: sources.json provides information about sources, like title, url, langugage Domains file: domains.json provides information about domains, like title, url, langugage Data analysis With these data we can perform further analysis: analysis of links: how many of old links are not any longer valid analysis of RSS source: how often it publishes data analysis of RSS source: what kind of data it produces, is it reliable analysis of RSS source: is it a content farm, does it contain many links outside of the domain? analysis of domains: is the domain correctly configured? analysis of topics: who was the first to report on certain topic analysis of topics: which source uses which words? For example it seems that left leaning sites, and white leaning sites have a different vocublary. There are different kind of words and ideads in. With this file history, you can analyze which sites have which ideas Problems, notes This solution does not replace Internet Archive. We do not store all link data Internet Archive (archive.org) is sometimes slow This solution does not replace Google. This would be futile. However Google provides only 31 pages of news (in news filter) and around 10 pages for ordinary search. This is a very small number. It is like looking through keyhole at the Internet You cannot discover new content using Google. For example write 'blog' into search. You will not be able to find new blogs. You cannot discover new content using YouTube. For example write 'trailer' into search. It shows me certain amount of new trailers from time span of 10 days, nothing older, or just a few older titles. I prefer capturing data from a movie trailer channel and continue to use it's data Link rot is real. Many github pages do not work at all. Many pages stop working after significant amount of time I am using raspberry PI at the moment. With it I cannot track all of the sources in the world. Therefore I track only 'well established' sources, or the ones I am really interested in Is the data relevant, or useful for anyone but me? Q & A Q: Why are you using sources like daily mail? This does not make any sense! A: This project aims to capture time capsule of day, or year. In statistics sometimes it is too late to capture "new" data. For analysis only credible, and reliable sources could be used Ending notes All links belong to us!