Belirtilen diller için kırılma Word bir linguistic approach , for example one that uses a dictionary temel stemming rules bir anlayış ile birlikte gerekli.
I've heard of relatively successful full text search applications which simply split every single character as a separate word, in Chinese, simply applying the same "tokenization" of the search criteria supplied by the end-users. The search engine then provides a better ranking for the documents which supply the characters-words in the same order as the search criteria.
I'm not sure this could be extended to Language such as Japanese, as the Hirakana and Katagana character sets make the text more akin to European languages with a short alphabet.
EDIT:
Resources
This word breaking problem, as well as related issues, is so non-trivial that whole books are written about it. See for example CJKV Information Processing (CJKV stands for Chinese, Japanese, Korean and Vietnamese; you may also use the CJK keyword, for in many texts, Vietnamese is not discussed). See also Word Breaking in Japanese is hard for a one-pager on this topic.
Understandingly, the majority of the material covering this topic is written in one of the underlying native languages, and is therefore of limited use for people without a relative fluency in these languages. For that reason, and also to help you validate the search engine once you start implementing the word breaker logic, you should seek the help of a native speaker or two.
Various ideas
Your idea of identifying characters which systematically imply a word break (say quotes, parenthesis, hyphen-like characters and such) is good, and that is probably one heuristic used by some of the professional grade word breakers. Yet, you should seek an authoritative source for such a list, rather than assembling one from scratch, based on anecdotal findings.
A related idea is to break words at Kana-to-Kanji transitions (but I'm guessing not the other way around), and possibly at Hiragana-to-Katakana or vice-versa transitions.
Unrelated to word-breaking proper, the index may [ -or may not- ;-)] benefit from the systematic conversion of every, say, hiragana character to the corresponding katakana character. Just an uneducated idea! I do not know enough about the Japanese language to know if that would help; intuitively, it would be loosely akin to the systematic conversion of accentuated letters and such to the corresponding non-accentuated letter, as practiced with several European languages.
Belki de daha önce bahsettiğim fikir, sistematik bireysel karakterini indeksleme (ve onların yakınlık sırasına-bilge arama kriterlerine dayalı arama sonuçlarını rütbeli) biraz diğer bazı kurallar daha sonra birlikte ardışık kana karakterleri tutarak örneğin, değişmiş ve olabilir ... ve kusurlu ama pratik yeterince arama motoru üretmek.
Bu kadar önemsiz olduğu belirtildiği gibi ... böyle değil ise hayal kırıklığına olmayın, ve bir duraklama alarak ve bir kitap veya iki okuyarak, uzun vadede size zaman ve paradan tasarruf edebilir. "Teori" ve en iyi uygulamaların daha denemek ve öğrenmek için başka bir nedeni, bir anda word breaking but soon, the search engine may also benefit from stemming-awareness strong> odaklanmış gibi görünüyor olduğunu; gerçekten bu iki konu dilsel, en azından, ilgili,, ve tandem işlenen yararlanabilir.
Bu üzücü ama layık çaba iyi şanslar.