• flux@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    4
    ·
    3 days ago

    This is not about training data, though.

    Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

    Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

    I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

    Though technically making all this happen flawlessly is quite a big task.

    • snooggums@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      3 days ago

      Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

      They are one of the sources!

      The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

      Scraping once extensively and scraping a bit less but far more frequently have similar impacts.

      • flux@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        3 days ago

        When user enters a prompt, the backend may retrieve a handful a pages to serve that prompt. It won’t retrieve all the pages of a site. Hardly different from a user using a search engine and clicking 5 topmost links into tabs. If that is not a DoS attack, then an agent doing the same isn’t a DDoS attack.

        Constructing the training material in the first place is a different matter, but if you’re asking about fresh events or new APIs, the training data just doesn’t cut it. The training, and subsequenctly the material retrieval, has been done a long time ago.