820.Short Encoding of Words¶

820.Short Encoding of Words¶

Tags: String Medium

Links: https://leetcode.com/problems/short-encoding-of-words/

Given a list of words, we may encode it by writing a reference string S and a list of indexes A.

For example, if the list of words is ["time", "me", "bell"], we can write it as S = "time#bell#" and indexes = [0, 2, 5].

Then for each index, we will recover the word by reading from the reference string from that index until we reach a "#" character.

What is the length of the shortest reference string S possible that encodes the given words?

Example:

Input: words = ["time", "me", "bell"]
Output: 10
Explanation: S = "time#bell#" and indexes = [0, 2, 5].

Note:

1 <= words.length <= 2000.
1 <= words[i].length <= 7.
Each word has only lowercase letters.

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        std::ios_base::sync_with_stdio(false);
        cin.tie(NULL);
        cout.tie(NULL);

        sort(words.begin(), words.end(), [](const auto & s1, const auto &s2){
            return s1.size() > s2.size();
        });    

        int n = words.size(); int pre = 0;
        vector<bool> used(n, false);
        vector<string> res;
        for (int i = 0; i < n; ++i) {
            if (!used[i]) {
                res.push_back(words[i]);
                used[i] = true;
                for (int j = i + 1; j < n; ++j) {
                    if (used[j]) continue;
                    int pos = words[i].size() - words[j].size();
                    if (words[i].substr(pos) == words[j]) {
                        used[j] = true;
                    }
                }
            }
        }
        int len = 0;
        for (const auto & e : res) len += e.size();
        len += res.size(); //每一个单词后面加上'#'

        return len;
    }
};

最初的暴力写法，将字符串按长度排序，长的排在前面，因为很显然，只有短的字符串可以是长的字符串的后缀，然后用一个标记数组去标记每个单词是否被用过了，时间复杂度 $O(n^2)$ 。

上面的思路起始已经很接近最终结果了，仍然是先按字符串长度排序，维护一个最终结果的字符串，对于一个新的单词，去查找这个新单词在结果字符串里面的位置，看看其是否能够成某个字符串的后缀，找到了就无需添加，没有找到就加入到结果字符串里面。

就拿样例来说，排序后是

time bell me
//初始结果字符串res为空，访问time，不在res里面
res += "time" + "#"
//访问bell，不在res里面，加入
res += "bell" + "#
//访问me，在res里面，不做处理
//于是得到最终结果是"time#bell#"，长度为10

另外，不能迷信测试用例，下面这种写法也可以AC，但是可以用下面这个测试用例把它hack掉：

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        std::ios_base::sync_with_stdio(false);
        cin.tie(NULL);
        cout.tie(NULL);

        sort(words.begin(), words.end(), [](const auto & s1, const auto &s2){
            return s1.size() > s2.size();
        });    

        string res;
        for (auto & e: words) {
            auto pos = res.find(e);
            if (pos == string::npos || res[pos + e.size()] != '#') {
                res += e + "#";
            }
        }

        return res.size();
    }
};

["smetmhh", "time", "me"]
因为find会找第一个匹配的，所以先搜索到的是smetmhh里面的me，显然不满足，于是就会把me当作一个无法完成匹配的加入到res里面，但很显然，LeetCode上的测试用例并没有构造这种类型，很显然，找第一个匹配和最后一个匹配，都有办法构造用例使其失效，所以上面的方法仍然需要改进。

所以需要改进上面的算法，既然后缀不好处理，那就把每个字符串反转，于是后缀变前缀，这时候我们就可以利用标准库的排序了，因为如果能够成匹配关系的，一定是长度小的在前面，长度大的在后面，于是只需要检测前一个字符串能否构成后一个字符串的前缀，那么只需要用一个整型数res来记录长度，免去了构造字符串。注意最后一个字符串属于边界部分，单独处理。

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        int res = 0, n = words.size();
        for (int i = 0; i < n; ++i) reverse(words[i].begin(), words[i].end());
        sort(words.begin(), words.end());

        for (int i = 0; i < n - 1; ++i) {
            res += (words[i] == words[i + 1].substr(0, words[i].size())) ? 0  : words[i].size() + 1;
        }
        return res + words.back().size() + 1;
    }
};

另外看到一种很巧妙的办法，先把字符串存入到一个set里面，自然把重复的字符串都去掉了，然后对于set里的每个单词，去遍历自身的所有后缀，如set里面存在就erase。

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        int res = 0;
        unordered_set<string> st(words.begin(), words.end());
        for (const string &word : st) {
            for (int i = 1; i < word.size(); ++i) {
                st.erase(word.substr(i));
            }
        }
        for (const string &word : st) res += word.size() + 1;
        return res;
    }
};