For context I created a video search engine last year, I shut it down and put the data online. You can read about it here: https://www.bendangelo.me/2024/07/16/failed-attempt-at-creating-a-video-search-engine/

I put that project on hold because of scaling issues, anyway I’m back with an other idea. I’ve been frustrated with how AI slop is ruining the internet and recently it’s been hitting Youitube pretty hard with AI videos. I’m brainstorming a tool for people to selfhost:

Self-hosted crawler: Pick which sites/videos to index (blogs, forums, YT channels, etc.). AI chat interface: Ask questions like, “Show me Rust tutorials from 2023” or “Summarize recent posts about homelab backups.” Optional sharing: Pool indexes with trusted friends/communities.

Why? No Google/YouTube spam—only content you choose. Works offline (archive forums, videos, docs). Local AI (Mistral) or cloud (paid) for smarter searches.

Would this be useful to you? What sites would you crawl? Any killer features I’m missing?

Prototype in progress—just testing interest!

  • Zwuzelmaus@feddit.org
    link
    fedilink
    English
    arrow-up
    23
    arrow-down
    1
    ·
    3 days ago

    No. Never would I self-host a search engine.

    The crawler would eat up so much more ressources than I am ever willing to spend.

    • daniskarma@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      3 days ago

      I’ve been thinking for years. Maybe there’s a way to do a collaborative crawler and indexer. In a similar way on how collaborative science is done. And probably using p2p protocols.

      Get a bunch of people together to create the perfect search engine in these dire times.

    • wise_pancake@lemmy.ca
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 days ago

      Potentially that would be a good application of federation and distributed computing

      An Internet archive like distributed tool, that then feeds into local tokenization and indexing.

      Alternatively a centralized service that generates indices and then locally they are queried would save a lot of energy.