Commit Graph

4 Commits

Author SHA1 Message Date
11b 96b41dee60 feat: improve handling of special tokens in the Kajiwoto dataset 2022-12-27 12:52:08 -03:00
11b 5dbde00d27 feat: bring down target word count per episode
After tokenization, most stuff was going over the 2048 context window so let's bring this down a little.
2022-12-26 17:31:28 -03:00
11b 60e649f57a feat: some minor filtering to hopefully improve CAI data 2022-12-26 12:04:04 -03:00
11b e0552639fa feat: update CAI dataset/module to handle userscript dumps and use definitions 2022-12-23 16:38:13 -03:00