If you're interested in the project's current status, please take a look at the [ROADMAP](./ROADMAP.md) instead, or join the Matrix server for more frequent updates.
- These are basically classes reponsible for giving us raw data. They might, for example, download a `.zip` off the internet, unzip it, read a `.json` file from in there and then return its contents.
- These are heavily inspired by the papers that introduced LaMDA and BlenderBot3 (and their relevant supporting papers as well).
- In general, each module is responsible for using a dataset as an input, and processing that data down into text that will be used in the fine-tuning process.
- A final data file is produced by concatenating the outputs of all the modules. This file is used as an input for the fine-tuning process.
File to write the dataset to. Should not include a file extension.
-m MODULES, --modules MODULES
List of modules to use, comma-separated.
-p PRINT, --print PRINT
If given, print this many episodes instead of writing out to a file.
-v, --verbose Enable verbose logging.
```
The default behavior is to write a file called `rev-{GIT_REVISION_HASH}-args{HASH_OF_USED_ARGS}.jsonl` to the current directory, with all the modules enabled. Behavior can be customized via the flags shown above.
The script also has an option to print some examples instead of writing to a file, for debugging/dev purposes. Example usage:
The tower is the largest section of the castle. It contains an observatory for nighttime scouting, but is also used by the wise men to study the stars. Armed
guardsmen are always to be found keeping watch.
There's an alarm horn here.
A soldier is here. You are carrying nothing.
Court Wizard: A quiet night this evening...
Soldier: Yes it is
Court Wizard: *ponder* Have any else come up this eve? I had hoped for a quiet night to examine the stars
Soldier: *nod* Yes, a few came through, but it is a cold night for me, I am used to warmer weather
Court Wizard: *sigh* Well, you are but a common soldier. No doubt you are used to such a lot. Thankfully I have my spells to keep me warm.
Soldier: *grin* I am a soldier doing my job
Court Wizard: Yes... Well... Very well then. See that you do! No slacking off while your betters are about.
Soldier: No sir
Court Wizard: When, for example, was this horn last tested? It looks dented. How can we be sure it will work?
Soldier: A year ago, test it out or cause a need to use it
Court Wizard: *frown* Mayhap I will speak to the king about such lackness. Or perhaps I can sell him a spell that will serve just as well.
Soldier: Good idea, I agree, go do that *hug court wizard*
Court Wizard: Get off of me, you fool! Who gave you permission to touch me! *hit soldier*
Due to hardware limitations (read: lack of GPUs with massive amounts of VRAM), I need to make use of ColossalAI's optimizations to be able to fine-tune models. However, their example code for fine-tuning OPT lacks some important stuff. Notably: metric logging (so we can know what is going on) and checkpoint saving/loading.
I've gone ahead and, using [their example scripts](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/opt) as a starting point, made a slightly adjusted version that's actually usable for real-world scenarios. All that stuff is inside the [training folder](./training/).
If you don't want to mess with anything, all you need to do is put the built data file at `/training/data/train.json` and invoke [finetune.bash](./training/finetune.bash). To see metrics, you can use Tensorboard by visiting http://localhost:6006 after starting the server like this: