toolbox/waifu/core/consts.py

class PromptConstants:
    '''String constants related to prompt engineering.'''

    # Prefix for user messages.
    USER_PREFIX = "You"

    # Token to be replaced with the user's display name within bot messages.
    USER_TOKEN = "<USER>"

    # Token to be replaced by the bot's name.
    BOT_TOKEN = "<BOT>"

    # Should be kept in sync with the relevant model that will be trained. This
    # is taken from EleutherAI's Pythia (so, GPT-NeoX).
    EOS_TOKEN = "<|endoftext|>"

    # Token to separate prompt trickery from actual dialogue.
    CHAT_START_TOKEN = "<START>"

    # Global target word count. The word count is chosen in such a way that we
    # can fit all the required prompt trickery into the model's input, but still
    # leave enough space for the user's input message and the inference result.
    TARGET_WORD_COUNT_PER_EPISODE = 1024

    @staticmethod
    def pdm_prefix_for(name: str) -> str:
        '''Builds the Persona Dialogue Module prefix for a given `name`.'''
        return f"{name}'s Persona"
feat: update CAI dataset/module to handle userscript dumps and use definitions 2022-12-23 20:20:53 +01:00			`class PromptConstants:`
			`'''String constants related to prompt engineering.'''`

			`# Prefix for user messages.`
			`USER_PREFIX = "You"`

feat: some minor filtering to hopefully improve CAI data 2022-12-26 16:04:04 +01:00			`# Token to be replaced with the user's display name within bot messages.`
			`USER_TOKEN = "<USER>"`

feat: improve handling of special tokens in the Kajiwoto dataset 2022-12-27 16:46:57 +01:00			`# Token to be replaced by the bot's name.`
			`BOT_TOKEN = "<BOT>"`

feat: alternative way of handling and augmenting episode data (wip) 2023-01-04 13:05:51 +01:00			`# Should be kept in sync with the relevant model that will be trained. This`
			`# is taken from EleutherAI's Pythia (so, GPT-NeoX).`
			`EOS_TOKEN = "<\|endoftext\|>"`

			`# Token to separate prompt trickery from actual dialogue.`
			`CHAT_START_TOKEN = "<START>"`

feat: update CAI dataset/module to handle userscript dumps and use definitions 2022-12-23 20:20:53 +01:00			`# Global target word count. The word count is chosen in such a way that we`
			`# can fit all the required prompt trickery into the model's input, but still`
feat: alternative way of handling and augmenting episode data (wip) 2023-01-04 13:05:51 +01:00			`# leave enough space for the user's input message and the inference result.`
feat: bring down target word count per episode After tokenization, most stuff was going over the 2048 context window so let's bring this down a little. 2022-12-26 21:31:28 +01:00			`TARGET_WORD_COUNT_PER_EPISODE = 1024`
feat: update CAI dataset/module to handle userscript dumps and use definitions 2022-12-23 20:20:53 +01:00
			`@staticmethod`
			`def pdm_prefix_for(name: str) -> str:`
			'''Builds the Persona Dialogue Module prefix for a given `name`.'''
			`return f"{name}'s Persona"`