820.Short Encoding of Words¶
Tags: String
Medium
Links: https://leetcode.com/problems/short-encoding-of-words/
Given a list of words, we may encode it by writing a reference string S
and a list of indexes A
.
For example, if the list of words is ["time", "me", "bell"]
, we can write it as S = "time#bell#"
and indexes = [0, 2, 5]
.
Then for each index, we will recover the word by reading from the reference string from that index until we reach a "#"
character.
What is the length of the shortest reference string S possible that encodes the given words?
Example:
Input: words = ["time", "me", "bell"]
Output: 10
Explanation: S = "time#bell#" and indexes = [0, 2, 5].
Note:
1 <= words.length <= 2000
.1 <= words[i].length <= 7
.- Each word has only lowercase letters.
class Solution {
public:
int minimumLengthEncoding(vector<string>& words) {
std::ios_base::sync_with_stdio(false);
cin.tie(NULL);
cout.tie(NULL);
sort(words.begin(), words.end(), [](const auto & s1, const auto &s2){
return s1.size() > s2.size();
});
int n = words.size(); int pre = 0;
vector<bool> used(n, false);
vector<string> res;
for (int i = 0; i < n; ++i) {
if (!used[i]) {
res.push_back(words[i]);
used[i] = true;
for (int j = i + 1; j < n; ++j) {
if (used[j]) continue;
int pos = words[i].size() - words[j].size();
if (words[i].substr(pos) == words[j]) {
used[j] = true;
}
}
}
}
int len = 0;
for (const auto & e : res) len += e.size();
len += res.size(); //每一个单词后面加上'#'
return len;
}
};
最初的暴力写法,将字符串按长度排序,长的排在前面,因为很显然,只有短的字符串可以是长的字符串的后缀,然后用一个标记数组去标记每个单词是否被用过了,时间复杂度O(n^2)。
上面的思路起始已经很接近最终结果了,仍然是先按字符串长度排序,维护一个最终结果的字符串,对于一个新的单词,去查找这个新单词在结果字符串里面的位置,看看其是否能够成某个字符串的后缀,找到了就无需添加,没有找到就加入到结果字符串里面。
就拿样例来说,排序后是
time bell me
//初始结果字符串res为空,访问time,不在res里面
res += "time" + "#"
//访问bell,不在res里面,加入
res += "bell" + "#
//访问me,在res里面,不做处理
//于是得到最终结果是"time#bell#",长度为10
另外,不能迷信测试用例,下面这种写法也可以AC,但是可以用下面这个测试用例把它hack
掉:
class Solution {
public:
int minimumLengthEncoding(vector<string>& words) {
std::ios_base::sync_with_stdio(false);
cin.tie(NULL);
cout.tie(NULL);
sort(words.begin(), words.end(), [](const auto & s1, const auto &s2){
return s1.size() > s2.size();
});
string res;
for (auto & e: words) {
auto pos = res.find(e);
if (pos == string::npos || res[pos + e.size()] != '#') {
res += e + "#";
}
}
return res.size();
}
};
["smetmhh", "time", "me"]
因为find会找第一个匹配的,所以先搜索到的是smetmhh里面的me,显然不满足,于是就会把me当作一个无法完成匹配的加入到res里面,但很显然,LeetCode上的测试用例并没有构造这种类型,很显然,找第一个匹配和最后一个匹配,都有办法构造用例使其失效,所以上面的方法仍然需要改进。
所以需要改进上面的算法,既然后缀不好处理,那就把每个字符串反转,于是后缀变前缀,这时候我们就可以利用标准库的排序了,因为如果能够成匹配关系的,一定是长度小的在前面,长度大的在后面,于是只需要检测前一个字符串能否构成后一个字符串的前缀,那么只需要用一个整型数res
来记录长度,免去了构造字符串。注意最后一个字符串属于边界部分,单独处理。
class Solution {
public:
int minimumLengthEncoding(vector<string>& words) {
int res = 0, n = words.size();
for (int i = 0; i < n; ++i) reverse(words[i].begin(), words[i].end());
sort(words.begin(), words.end());
for (int i = 0; i < n - 1; ++i) {
res += (words[i] == words[i + 1].substr(0, words[i].size())) ? 0 : words[i].size() + 1;
}
return res + words.back().size() + 1;
}
};
另外看到一种很巧妙的办法,先把字符串存入到一个set
里面,自然把重复的字符串都去掉了,然后对于set
里的每个单词,去遍历自身的所有后缀,如set
里面存在就erase
。
class Solution {
public:
int minimumLengthEncoding(vector<string>& words) {
int res = 0;
unordered_set<string> st(words.begin(), words.end());
for (const string &word : st) {
for (int i = 1; i < word.size(); ++i) {
st.erase(word.substr(i));
}
}
for (const string &word : st) res += word.size() + 1;
return res;
}
};