Feature Request: Allow room character dialogue to be dumped #1

Open
opened 2023-01-01 07:29:21 +01:00 by Goy288 · 4 comments

I don't care too much for room dialogue, as it's mostly more of a gimmicky thing, but it could be very useful in getting more training data for AI models. The main caveat in getting training data the typical way is that half the effort is still gonna be on the user's part to type out proper roleplay interactions for the bot.

However, using rooms, character data acquirement could not only be more automated, but effectively multiplied via the multiple characters. I think that training data could be more effectively acquired via this method.

I don't care too much for room dialogue, as it's mostly more of a gimmicky thing, but it could be very useful in getting more training data for AI models. The main caveat in getting training data the typical way is that half the effort is still gonna be on the user's part to type out proper roleplay interactions for the bot. However, using rooms, character data acquirement could not only be more automated, but effectively multiplied via the multiple characters. I think that training data could be more effectively acquired via this method.
Owner

Thank you for the suggestion!

[...] using rooms, character data acquirement could not only be more automated, but effectively multiplied via the multiple characters. I think that training data could be more effectively acquired via this method.

Indeed, you're absolutely right about that.

My concern though is that a significant part of our training data is already half synthetic (for example: normal CAI conversations are always human + bot, so we can expect ~50% of that data to be synthetic).

If we start from the assumption that synthetic data is inferior to real data (because AIs aren't as smart as humans), we should instead be striving to lower its amount in the training dataset, so accepting room dialogue would be counterproductive.

That being the case, I don't plan on implementing room dumping at the moment. I'm open to accepting pull requests if anyone's interested in implementing it, but even then I'd likely keep that data off of the training dataset for the reasons described above.

Thank you for the suggestion! > [...] using rooms, character data acquirement could not only be more automated, but effectively multiplied via the multiple characters. I think that training data could be more effectively acquired via this method. Indeed, you're absolutely right about that. My concern though is that a significant part of our training data is already half synthetic (for example: normal CAI conversations are always human + bot, so we can expect ~50% of that data to be synthetic). If we start from the assumption that synthetic data is inferior to real data (because AIs aren't as smart as humans), we should instead be striving to _lower_ its amount in the training dataset, so accepting room dialogue would be counterproductive. That being the case, I don't plan on implementing room dumping at the moment. I'm open to accepting pull requests if anyone's interested in implementing it, but even then I'd likely keep that data off of the training dataset for the reasons described above.

You're forgetting that a room still has an element of human curation. The human still swipes to pick the best response, and anyone contributing to this project probably wants to make sure they're providing you with quality. I have some pretty damn good room convos I could send over if you just let me.
Not to mention, you've seen these threads, not every anon has better data to input than the AI could just talking to itself. A lot of these people are uhh... not writefags. There's no reason to assume synthetic data is inferior to organic data in this case.

You're forgetting that a room still has an element of human curation. The human still swipes to pick the best response, and anyone contributing to this project probably wants to make sure they're providing you with quality. I have some pretty damn good room convos I could send over if you just let me. Not to mention, you've seen these threads, not every anon has better data to input than the AI could just talking to itself. A lot of these people are uhh... not writefags. There's no reason to assume synthetic data is inferior to organic data in this case.
Owner

Those are all very good points. Very well, I'll investigate how room history works to see how feasible it is to implement into the userscript. Thanks for the input, everyone.

Those are all very good points. Very well, I'll investigate how room history works to see how feasible it is to implement into the userscript. Thanks for the input, everyone.
Owner

Quick update: while I am still interested in getting this implemented, it'd be quite a bit trickier than the regular conversations or character definitions.

In those, there's an endpoint which returns all the data we need, so simply intercepting the request is enough to download everything. In the case of rooms though, I couldn't find any endpoints like that. The room history endpoint is paginated, and manually firing off requests from the userscript to fetch the next page(s) results in HTTP Unauthorized responses.

This shouldn't be too hard to implement, but at the moment I'm working on releasing bigger models and an improved chatting UI (6B model weights are already public on HF, Colab notebook with the UI should be coming soon) so this is way down in the list of priorities for now.

I'll be keeping the issue open in case I have the time to come back to this though. And for any programmers interested, I'm open to reviewing PRs tackling this - feel free to open at https://github.com/0x000011b/characterai-dumper (we're currently in the process of migrating stuff to a GitHub org with proper separate repositories).

Quick update: while I am still interested in getting this implemented, it'd be quite a bit trickier than the regular conversations or character definitions. In those, there's an endpoint which returns all the data we need, so simply intercepting the request is enough to download everything. In the case of rooms though, I couldn't find any endpoints like that. The room history endpoint is paginated, and manually firing off requests from the userscript to fetch the next page(s) results in HTTP Unauthorized responses. This shouldn't be too hard to implement, but at the moment I'm working on releasing bigger models and an improved chatting UI (6B model weights are already public on HF, Colab notebook with the UI should be coming soon) so this is way down in the list of priorities for now. I'll be keeping the issue open in case I have the time to come back to this though. And for any programmers interested, I'm open to reviewing PRs tackling this - feel free to open at https://github.com/0x000011b/characterai-dumper (we're currently in the process of migrating stuff to a GitHub org with proper separate repositories).
Sign in to join this conversation.
No Label
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: waifu-collective/toolbox#1
No description provided.