- Parse only known sentence structures in order to avoid the irbO-problem
- Language separation
- Structure of sentence must be stored in xml
- Prohibited references to avoid false friends
- Use Levenstein or similar to match and reference words in different languages
- Keep some few references «in mind» in order to detect logically coherent topics
- Data format of memories is xml, structured in grammatical order:
<entry sha1="..."> <grammar verb="yes" noun="yes" adjective="no" adverb="no"> <structure>...</structure> <verb> <infinitive>...</infinitive> ... </verb> <noun> ... </noun> </grammar> <references> <reference author="..." timestamp="1234" lang="fr" file="913fc115a31323....xml" /> <reference prohibited="..." ... /> </references> <synonyms> <synonym lang="..." file="..." /> <synonym context="contextfile.xml" lang="..." file="..." /> </synonyms> </entry> - Port references from synonyms to other languages
- Maybe use sha1 hashes and symlinks to create filenames to speed up lookups
- Maybe add common typos with contextual references to an extra section
- Attributes section to cache common reference lookups?
- Ask before registering changes <- maybe in debug mode only?
- Ask about unknown senteces/structures
- Ask whenever unsure/information is missing -> Maybe use scoring to determine whether or not to be sure, and use a threshold to determine when to ask even if information is available?
- Dontref attributes for common words like «is», «and», etc.
- Data format example for sentences (to parse and generate
sentences):
<sentence> <noun person="1" tense="silgular"> <verb person="1" tense="singular"/> <adjective person="1" tense="singular"/> </sentence> - Alternate idea: references="index" -> easier way to identify person and tense
- Maybe add tempus to references in sentence data format
- Maybe add lookup function to find suitable sentence structures to express things
- Special case: try to detect questions (special mark in sentence structures?) and try to answer them based on the information gathered
- Ignore patterns for stuff like «ok» and «heh»