用php实现一个敏感词过滤功能

2024-07-26 13:55:01

开发者中心
> 正文

周末空余时间撸了一个敏感词过滤功能，下边记录下实现过程。

敏感词，一方面是我国网监限制，另一方面是我们自己可能也要过滤一些人身攻击或者广告信息等，具体词库可以google下，有很多。

过滤敏感词，使用简单的循环 str_replace是性能很低效的，还会随着词库的增加，性能指数下降，而且简单的替换，不能解决一些不是完全匹配的词。这时候就需要先构建一个字典树(trie)，单纯的字典树占用空间较大，使用 Double-Array Trie或者 Ternary Search Tree可以在保证性能的同时节省一部分空间，但是敏感词基本不会很多，几千甚至上万个词基本没压力，所以就实现就选择先构建一个字典树，然后逐字做匹配。

代码不多，就贴到这里。

<?phpclass SensitiveWordFilter{ private $dict; private $dictPath; public function __construct($dictPath) { $this->dict = array(); $this->dictPath = $dictPath; $this->initDict(); } private function initDict() { $handle = fopen($this->dictPath, 'r'); if (!$handle) { throw new RuntimeException('open dictionary file error.'); } while (!feof($handle)) { $word = trim(fgets($handle, 128)); if (empty($word)) { continue; } $uWord = $this->unicodeSplit($word); $pdict = &$this->dict; $count = count($uWord); for ($i = 0; $i < $count; $i++) { if (!isset($pdict[$uWord[$i]])) { $pdict[$uWord[$i]] = array(); } $pdict = &$pdict[$uWord[$i]]; } $pdict['end'] = true; } fclose($handle); } public function filter($str, $maxDistance = 5) { if ($maxDistance < 1) { $maxDistance = 1; } $uStr = $this->unicodeSplit($str); $count = count($uStr); for ($i = 0; $i < $count; $i++) { if (isset($this->dict[$uStr[$i]])) { $pdict = &$this->dict[$uStr[$i]]; $matchIndexes = array(); for ($j = $i + 1, $d = 0; $d < $maxDistance && $j < $count; $j++, $d++) { if (isset($pdict[$uStr[$j]])) { $matchIndexes[] = $j; $pdict = &$pdict[$uStr[$j]]; $d = -1; } } if (isset($pdict['end'])) { $uStr[$i] = '*'; foreach ($matchIndexes as $k) { if ($k - $i == 1) { $i = $k; } $uStr[$k] = '*'; } } } } return implode($uStr); } public function unicodeSplit($str) { $str = strtolower($str); $ret = array(); $len = strlen($str); for ($i = 0; $i < $len; $i++) { $c = ord($str[$i]); if ($c & 0x80) { if (($c & 0xf8) == 0xf0 && $len - $i >= 4) { if ((ord($str[$i + 1]) & 0xc0) == 0x80 && (ord($str[$i + 2]) & 0xc0) == 0x80 && (ord($str[$i + 3]) & 0xc0) == 0x80) { $uc = substr($str, $i, 4); $ret[] = $uc; $i += 3; } } else if (($c & 0xf0) == 0xe0 && $len - $i >= 3) { if ((ord($str[$i + 1]) & 0xc0) == 0x80 && (ord($str[$i + 2]) & 0xc0) == 0x80) { $uc = substr($str, $i, 3); $ret[] = $uc; $i += 2; } } else if (($c & 0xe0) == 0xc0 && $len - $i >= 2) { if ((ord($str[$i + 1]) & 0xc0) == 0x80) { $uc = substr($str, $i, 2); $ret[] = $uc; $i += 1; } } } else { $ret[] = $str[$i]; } } return $ret; }}

使用方法

<?phprequire 'SensitiveWordFilter.php';/*初始化传入词库文件路径，词库文件每个词一个换行符。如：敏感1敏感2目前只支持UTF-8编码*/$filter = new SensitiveWordFilter(__DIR__ . '/sensitive_words.txt');/*第一个参数传入要过滤的字符串，第二个是匹配的字间距，比如'枪支'是一个敏感词，想过滤'枪||||支'的时候，就需要指定一个两个字的间距，可以根据情况设定，超过指定间距就不会过滤。所有匹配的敏感词会被替换为'*'。*/$filter->filter('这是一个敏感词', 10);

性能没有具体详细的做测试，不过一般场景足够，主要是吃CPU，词库可以把生成好的字典JSON编码后存到Redis或者Memcached中，下次使用直接取出还原。

PHP写WEB的话，不是Daemon这种，所以构建的数据结构不能永久驻留内存，相比来说，C、C++、Java等可能更合适，如果对性能要求苛刻，可以用其他语言写个服务。当然，这里PHP还有个Swoole可用，但是个人不是很看好。

本文是由用户编写整理，所有内容的版权归原作者所有。如果侵犯了您的权益，请联系我删除

上一篇 PHP 5.5.31/PHP 5.6.17 发布

下一篇 PHP截取中英文混合字符

用php实现一个敏感词过滤功能

相关推荐