跳转至

820.Short Encoding of Words

Tags: String Medium

Links: https://leetcode.com/problems/short-encoding-of-words/


Given a list of words, we may encode it by writing a reference string S and a list of indexes A.

For example, if the list of words is ["time", "me", "bell"], we can write it as S = "time#bell#" and indexes = [0, 2, 5].

Then for each index, we will recover the word by reading from the reference string from that index until we reach a "#" character.

What is the length of the shortest reference string S possible that encodes the given words?

Example:

Input: words = ["time", "me", "bell"]
Output: 10
Explanation: S = "time#bell#" and indexes = [0, 2, 5].

Note:

  1. 1 <= words.length <= 2000.
  2. 1 <= words[i].length <= 7.
  3. Each word has only lowercase letters.

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        std::ios_base::sync_with_stdio(false);
        cin.tie(NULL);
        cout.tie(NULL);

        sort(words.begin(), words.end(), [](const auto & s1, const auto &s2){
            return s1.size() > s2.size();
        });    

        int n = words.size(); int pre = 0;
        vector<bool> used(n, false);
        vector<string> res;
        for (int i = 0; i < n; ++i) {
            if (!used[i]) {
                res.push_back(words[i]);
                used[i] = true;
                for (int j = i + 1; j < n; ++j) {
                    if (used[j]) continue;
                    int pos = words[i].size() - words[j].size();
                    if (words[i].substr(pos) == words[j]) {
                        used[j] = true;
                    }
                }
            }
        }
        int len = 0;
        for (const auto & e : res) len += e.size();
        len += res.size(); //每一个单词后面加上'#'

        return len;
    }
};

最初的暴力写法,将字符串按长度排序,长的排在前面,因为很显然,只有短的字符串可以是长的字符串的后缀,然后用一个标记数组去标记每个单词是否被用过了,时间复杂度O(n^2)

上面的思路起始已经很接近最终结果了,仍然是先按字符串长度排序,维护一个最终结果的字符串,对于一个新的单词,去查找这个新单词在结果字符串里面的位置,看看其是否能够成某个字符串的后缀,找到了就无需添加,没有找到就加入到结果字符串里面。

就拿样例来说,排序后是

time bell me
//初始结果字符串res为空,访问time,不在res里面
res += "time" + "#"
//访问bell,不在res里面,加入
res += "bell" + "#
//访问me,在res里面,不做处理
//于是得到最终结果是"time#bell#",长度为10

另外,不能迷信测试用例,下面这种写法也可以AC,但是可以用下面这个测试用例把它hack掉:

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        std::ios_base::sync_with_stdio(false);
        cin.tie(NULL);
        cout.tie(NULL);

        sort(words.begin(), words.end(), [](const auto & s1, const auto &s2){
            return s1.size() > s2.size();
        });    

        string res;
        for (auto & e: words) {
            auto pos = res.find(e);
            if (pos == string::npos || res[pos + e.size()] != '#') {
                res += e + "#";
            }
        }

        return res.size();
    }
};
["smetmhh", "time", "me"]
因为find会找第一个匹配的,所以先搜索到的是smetmhh里面的me,显然不满足,于是就会把me当作一个无法完成匹配的加入到res里面,但很显然,LeetCode上的测试用例并没有构造这种类型,很显然,找第一个匹配和最后一个匹配,都有办法构造用例使其失效,所以上面的方法仍然需要改进。

所以需要改进上面的算法,既然后缀不好处理,那就把每个字符串反转,于是后缀变前缀,这时候我们就可以利用标准库的排序了,因为如果能够成匹配关系的,一定是长度小的在前面,长度大的在后面,于是只需要检测前一个字符串能否构成后一个字符串的前缀,那么只需要用一个整型数res来记录长度,免去了构造字符串。注意最后一个字符串属于边界部分,单独处理。

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        int res = 0, n = words.size();
        for (int i = 0; i < n; ++i) reverse(words[i].begin(), words[i].end());
        sort(words.begin(), words.end());

        for (int i = 0; i < n - 1; ++i) {
            res += (words[i] == words[i + 1].substr(0, words[i].size())) ? 0  : words[i].size() + 1;
        }
        return res + words.back().size() + 1;
    }
};

另外看到一种很巧妙的办法,先把字符串存入到一个set里面,自然把重复的字符串都去掉了,然后对于set里的每个单词,去遍历自身的所有后缀,如set里面存在就erase

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        int res = 0;
        unordered_set<string> st(words.begin(), words.end());
        for (const string &word : st) {
            for (int i = 1; i < word.size(); ++i) {
                st.erase(word.substr(i));
            }
        }
        for (const string &word : st) res += word.size() + 1;
        return res;
    }
};