11b
50ae8816a1
refactor: archive the old repo
2023-01-08 17:32:42 -03:00
11b
beec9ba31f
chore: update gitignore since training code is no longer here
2023-01-08 17:09:48 -03:00
11b
23eb4a6ab2
refactor: move non-data related stuff to other repositories in the org
2023-01-08 16:31:37 -03:00
11b
7d385ec13c
chore: add packages required by the SODA dataset
2023-01-08 15:49:58 -03:00
TearGosling
ea162de2e0
feat: add SODA dataset
...
* Very first prototype of SODA dataset support
I'm also bringing over the version of PromptConstants from the dev branch due to needing CHAT_START_TOKEN
* More flexibility when fetching speaker names
* Make SODA a PDM instead of a VDM
* Swap order of speakers based on relation
* Oh, and fix a typo too
* Bugfix
2023-01-08 15:48:52 -03:00
11b
eb997a3d3f
chore: point CAI dumper userscript to the GitHub repo instead
2023-01-08 12:16:47 -03:00
11b
9a3719127c
refactor: delete old training code
...
Now archived under the "colossalai-training-code" repository.
2023-01-08 11:46:16 -03:00
11b
5e34b105dc
feat: alternative way of handling and augmenting episode data (wip)
2023-01-04 09:05:51 -03:00
11b
46a552ad28
chore: add link to roadmap on the README
2023-01-01 11:51:46 -03:00
11b
1409bafd2b
chore: update ROADMAP
2023-01-01 11:50:30 -03:00
11b
53494a6567
chore: fix linter/style problems
2023-01-01 11:50:23 -03:00
11b
e4594338d2
feat: changes to log and discard some not-so-great data
2023-01-01 11:34:31 -03:00
11b
9f55ecfca7
feat: attempt to detect looping in CAI logs and discard from final dataset
2023-01-01 11:32:57 -03:00
11b
aebd405bbd
feat: proper checkpoint resume in CLM fine-tune script
2022-12-27 13:21:20 -03:00
11b
e99277ec52
feat: log LR in CLM fine-tune script
2022-12-27 13:21:00 -03:00
11b
96b41dee60
feat: improve handling of special tokens in the Kajiwoto dataset
2022-12-27 12:52:08 -03:00
11b
b95b30cf88
feat: implement arg to skip over episodes when debugging data build
2022-12-27 12:46:36 -03:00
11b
3e798f6767
fix: rename folder so import actually works
2022-12-26 20:44:35 -03:00
11b
93e283daee
feat: implement utility to convert ColossalAI checkpoints to HF pre-trained model
2022-12-26 20:43:01 -03:00
11b
b79ac657a4
fix: haru's sft being incompatible with the ColossalAI fine-tune script
2022-12-26 20:42:48 -03:00
11b
5dbde00d27
feat: bring down target word count per episode
...
After tokenization, most stuff was going over the 2048 context window so let's bring this down a little.
2022-12-26 17:31:28 -03:00
11b
bcbf0910b4
feat: add supervised fine-tuning code based on haru's work
...
Warning: Absolutely atrocious code quality. I did just the bare minimum to make it run.
2022-12-26 17:31:00 -03:00
11b
60e649f57a
feat: some minor filtering to hopefully improve CAI data
2022-12-26 12:04:04 -03:00
11b
4f794489ac
feat: add support for fine-tuning GPT-NeoX-based models, save optimizer and LR scheduler to checkpoint
2022-12-25 15:42:59 -03:00
11b
186df60691
feat: update inference code for pythia/cai data-based models
2022-12-25 15:39:28 -03:00
11b
3bfb623f26
fix: human/bot messages being incorrectly labeled as eachother
2022-12-24 17:58:33 -03:00
11b
5b26097905
feat: implement Gradio UI for proper model inference (WIP)
2022-12-24 12:12:55 -03:00
11b
cef8f54fc4
fix: ignore invalid CAI JSON dumps
2022-12-23 16:45:18 -03:00
11b
d91367e902
chore: update module list in build_dataset.py
2022-12-23 16:45:18 -03:00
11b
a16673ebe0
refactor: adjust Kajiwoto modules to use the proper prompt constants
2022-12-23 16:45:18 -03:00
11b
60e0a21a3c
chore: add pdbpp for better debugging experience
2022-12-23 16:38:13 -03:00
11b
3d6def871d
refactor: use LIGHT as PDM instead of VDM, ignore actions
2022-12-23 16:38:13 -03:00
11b
1f273f13f3
chore: bump pdm version
2022-12-23 16:38:13 -03:00
11b
e0552639fa
feat: update CAI dataset/module to handle userscript dumps and use definitions
2022-12-23 16:38:13 -03:00
11b
aef9289678
chore: update ROADMAP to add links about contributing with CAI dumps
2022-12-23 10:59:58 -03:00
11b
69aeea85b9
chore: reorganize CAI dumper README
2022-12-21 20:05:15 -03:00
11b
7087f39d5a
fix: cai dumper crashing if chat had no messages
2022-12-21 20:05:01 -03:00
11b
d6e05e6e5b
chore: add changelog to the CAI dumper
2022-12-21 16:24:35 -03:00
11b
e612386424
fix: handle edge-case regarding extra whitespace on char name
2022-12-21 16:15:49 -03:00
11b
d638bb5625
feat: update userscript to allow dumping of definitions as well
2022-12-21 16:03:13 -03:00
11b
ecf2e65e76
fix: anonymization within message text
...
Hopefully for reals this time.
2022-12-21 14:25:50 -03:00
11b
21ebd5834e
chore: fix path links in the README
2022-12-21 13:51:20 -03:00
11b
a8dfd396cc
fix: more aggressive anonymization within message text
2022-12-21 13:43:46 -03:00
11b
6bc2a03ff9
chore: update ROADMAP
2022-12-20 21:52:28 -03:00
11b
1ddd991471
chore: clarify what DHT means in the Discord module
2022-12-20 21:44:16 -03:00
11b
cec59a5511
docs: update CAI dumper README
2022-12-20 21:43:57 -03:00
11b
4f78bb73cb
feat: implement userscript to dump CAI chats/basic bot info
2022-12-20 21:34:39 -03:00
11b
ecd4efe3ce
chore: run isort
2022-12-20 17:55:17 -03:00
11b
009c837439
feat: implement Discord dialogue module
2022-12-20 17:55:05 -03:00
11b
b42131191a
fix: don't write file when printing the dataset for debugging
2022-12-20 17:41:08 -03:00