Commit Graph

65 Commits

Author SHA1 Message Date
11b 50ae8816a1 refactor: archive the old repo 2023-01-08 17:32:42 -03:00
11b beec9ba31f chore: update gitignore since training code is no longer here 2023-01-08 17:09:48 -03:00
11b 23eb4a6ab2 refactor: move non-data related stuff to other repositories in the org 2023-01-08 16:31:37 -03:00
11b 7d385ec13c chore: add packages required by the SODA dataset 2023-01-08 15:49:58 -03:00
TearGosling ea162de2e0 feat: add SODA dataset
* Very first prototype of SODA dataset support

I'm also bringing over the version of PromptConstants from the dev branch due to needing CHAT_START_TOKEN

* More flexibility when fetching speaker names

* Make SODA a PDM instead of a VDM

* Swap order of speakers based on relation

* Oh, and fix a typo too

* Bugfix
2023-01-08 15:48:52 -03:00
11b eb997a3d3f chore: point CAI dumper userscript to the GitHub repo instead 2023-01-08 12:16:47 -03:00
11b 9a3719127c refactor: delete old training code
Now archived under the "colossalai-training-code" repository.
2023-01-08 11:46:16 -03:00
11b 5e34b105dc feat: alternative way of handling and augmenting episode data (wip) 2023-01-04 09:05:51 -03:00
11b 46a552ad28 chore: add link to roadmap on the README 2023-01-01 11:51:46 -03:00
11b 1409bafd2b chore: update ROADMAP 2023-01-01 11:50:30 -03:00
11b 53494a6567 chore: fix linter/style problems 2023-01-01 11:50:23 -03:00
11b e4594338d2 feat: changes to log and discard some not-so-great data 2023-01-01 11:34:31 -03:00
11b 9f55ecfca7 feat: attempt to detect looping in CAI logs and discard from final dataset 2023-01-01 11:32:57 -03:00
11b aebd405bbd feat: proper checkpoint resume in CLM fine-tune script 2022-12-27 13:21:20 -03:00
11b e99277ec52 feat: log LR in CLM fine-tune script 2022-12-27 13:21:00 -03:00
11b 96b41dee60 feat: improve handling of special tokens in the Kajiwoto dataset 2022-12-27 12:52:08 -03:00
11b b95b30cf88 feat: implement arg to skip over episodes when debugging data build 2022-12-27 12:46:36 -03:00
11b 3e798f6767 fix: rename folder so import actually works 2022-12-26 20:44:35 -03:00
11b 93e283daee feat: implement utility to convert ColossalAI checkpoints to HF pre-trained model 2022-12-26 20:43:01 -03:00
11b b79ac657a4 fix: haru's sft being incompatible with the ColossalAI fine-tune script 2022-12-26 20:42:48 -03:00
11b 5dbde00d27 feat: bring down target word count per episode
After tokenization, most stuff was going over the 2048 context window so let's bring this down a little.
2022-12-26 17:31:28 -03:00
11b bcbf0910b4 feat: add supervised fine-tuning code based on haru's work
Warning: Absolutely atrocious code quality. I did just the bare minimum to make it run.
2022-12-26 17:31:00 -03:00
11b 60e649f57a feat: some minor filtering to hopefully improve CAI data 2022-12-26 12:04:04 -03:00
11b 4f794489ac feat: add support for fine-tuning GPT-NeoX-based models, save optimizer and LR scheduler to checkpoint 2022-12-25 15:42:59 -03:00
11b 186df60691 feat: update inference code for pythia/cai data-based models 2022-12-25 15:39:28 -03:00
11b 3bfb623f26 fix: human/bot messages being incorrectly labeled as eachother 2022-12-24 17:58:33 -03:00
11b 5b26097905 feat: implement Gradio UI for proper model inference (WIP) 2022-12-24 12:12:55 -03:00
11b cef8f54fc4 fix: ignore invalid CAI JSON dumps 2022-12-23 16:45:18 -03:00
11b d91367e902 chore: update module list in build_dataset.py 2022-12-23 16:45:18 -03:00
11b a16673ebe0 refactor: adjust Kajiwoto modules to use the proper prompt constants 2022-12-23 16:45:18 -03:00
11b 60e0a21a3c chore: add pdbpp for better debugging experience 2022-12-23 16:38:13 -03:00
11b 3d6def871d refactor: use LIGHT as PDM instead of VDM, ignore actions 2022-12-23 16:38:13 -03:00
11b 1f273f13f3 chore: bump pdm version 2022-12-23 16:38:13 -03:00
11b e0552639fa feat: update CAI dataset/module to handle userscript dumps and use definitions 2022-12-23 16:38:13 -03:00
11b aef9289678 chore: update ROADMAP to add links about contributing with CAI dumps 2022-12-23 10:59:58 -03:00
11b 69aeea85b9 chore: reorganize CAI dumper README 2022-12-21 20:05:15 -03:00
11b 7087f39d5a fix: cai dumper crashing if chat had no messages 2022-12-21 20:05:01 -03:00
11b d6e05e6e5b chore: add changelog to the CAI dumper 2022-12-21 16:24:35 -03:00
11b e612386424 fix: handle edge-case regarding extra whitespace on char name 2022-12-21 16:15:49 -03:00
11b d638bb5625 feat: update userscript to allow dumping of definitions as well 2022-12-21 16:03:13 -03:00
11b ecf2e65e76 fix: anonymization within message text
Hopefully for reals this time.
2022-12-21 14:25:50 -03:00
11b 21ebd5834e chore: fix path links in the README 2022-12-21 13:51:20 -03:00
11b a8dfd396cc fix: more aggressive anonymization within message text 2022-12-21 13:43:46 -03:00
11b 6bc2a03ff9 chore: update ROADMAP 2022-12-20 21:52:28 -03:00
11b 1ddd991471 chore: clarify what DHT means in the Discord module 2022-12-20 21:44:16 -03:00
11b cec59a5511 docs: update CAI dumper README 2022-12-20 21:43:57 -03:00
11b 4f78bb73cb feat: implement userscript to dump CAI chats/basic bot info 2022-12-20 21:34:39 -03:00
11b ecd4efe3ce chore: run isort 2022-12-20 17:55:17 -03:00
11b 009c837439 feat: implement Discord dialogue module 2022-12-20 17:55:05 -03:00
11b b42131191a fix: don't write file when printing the dataset for debugging 2022-12-20 17:41:08 -03:00