40 lines
1.7 KiB
Markdown
40 lines
1.7 KiB
Markdown
# 17.13. Re-space
|
|
|
|
## The spaces from a sentence have been removed. Most words are present in the dictionary but some aren't. Given a dictionary (a list of strings) and the document (a string), unconcatenate the document in a way that minimizes the number of unrecognized characters
|
|
|
|
> example: 'jesslookedjustliketimherbrother' -> 'jess looked just like tim her brother'
|
|
|
|
## First idea
|
|
|
|
This is complicated... I could start first by unconcatenating the sentences considering they're all present in the dictionary, and then optimize. How do I unconcatenate in the first place? .tolower() first
|
|
|
|
* Easiest case: 'iamalive' -> 'i am alive', only one possibility
|
|
* Concatenation: 'theylookedible' -> 'they look edible', but 'looked' and 'edible' concatenate.
|
|
|
|
I could do swaps of n-grams, start with the longest words and go downwards:
|
|
|
|
> example: 'iamalive'
|
|
|
|
1. swap of len(s) = 8, 'iamalive' in dict? no
|
|
2. swap = 7, 'iamaliv', 'amalive' in dict? no
|
|
3. swap = 6, 'iamali', 'amaliv', 'malive' in dict? no
|
|
4. swap = 5, ¡'iamal', ..., 'alive'. 'alive' found and saved. How?
|
|
5. swap = 4, nothing
|
|
6. swap = 3, nothing
|
|
7. swap = 2, 'am' found
|
|
8. swap = 1, 'i' found.
|
|
|
|
All the string is 'covered', done.
|
|
|
|
> example: 'theylookedible'
|
|
|
|
1. swap of len(s) = 10, nothing
|
|
2. wap = 6, 'looked', 'edible' saved
|
|
3. swap = 4, 'they', 'look'
|
|
4. swap = 3, 'the'
|
|
5. swap = 2, 'he' found
|
|
6. swap = 1, nothing
|
|
|
|
How to resolve the conflict of swap = 6? We could look for 'look' and 'ible'. If we find none, choose one randomly and 4 unrecognized words. If we find 'look', assign 'look' and 'edible' and discard 'looked'. How to save these? Don't know. This algorithm is O(n<sup>2</sup>)
|
|
|
|
Also, once the substrings are saved, how to check for string completeness to break the loop and return? |