RenaiApp/workspace/ideas/site-crawling.md

# Website Crawling

The application needs to be able to read data from established hentai sites. This includes the manga themselves, their metadata (tags, author, etc.), and user data (lists, rating, etc.).

This is derived from user stories [#4], [#5], [#6].

None of these sites have an official API. Which means data needs to be read from html.

Depending on whether the sites use captcha or not, authentication could also be difficult. It might be easier to leverage the chromium in electron to load the sites themselves and read their data from there. On the other hand, this might be vulnerable to dubious redirects to ad sites. Another idea is to use `<iframe>`.

| website                            | provides api | user data | metadata | torrent download |
| ---------------------------------- | :----------: | :-------: | :------: | :--------------: |
| [nhentai](https://nhentai.net)     |      -       |     ✓     |    ✓     |        ✓         |
| [Tsumino](https://www.tsumino.com) |      -       |     ✓     |    ✓     |        -         |
| [E-Hentai](https://e-hentai.org)   |      -       |     ✓     |    ✓     |        ✓         |
| [Hentai Cafe](https://hentai.cafe) |      -       |     -     |    ✓     |        -         |

**nhentai**

- probably most popular (either this or e-hentai)

**Tsumino**

- provides direct zip download, but locked behind Google's reCaptcha
- the normal image view seems to has some kind of authentication key shenanigans as well

**E-Hentai**

- https://exhentai.org/
- will probably be archived in the near future

**Hentai Cafe**

- the most bare functionality, probably easiest to crawl

[#4]: ../user-stories.md#4
[#5]: ../user-stories.md#5
[#6]: ../user-stories.md#6