形態素解析ができるツール
文章を意味のある単語に分割して、加えて品詞を定めてくれる
MeCab
本家様 http://taku910.github.io/mecab/
「ダウンロード」リンクを押下して入手
[root@c ~]# gzip -cd mecab-0.996.tar.gz | tar xf -
[root@c ~]# cd mecab-0.996
[root@c mecab-0.996]#
[root@c mecab-0.996]# ./configure --with-charset=utf8
[root@c mecab-0.996]# make && make check && make install
辞書
[root@c ~]# mkdir /opt/mecab && cd $_
[root@mecab ~]# git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
[root@c mecab]# cd mecab-ipadic-neologd/
[root@c mecab-ipadic-neologd]# ./bin/install-mecab-ipadic-neologd -n
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] : find => ok
[install-mecab-ipadic-NEologd] : sort => ok
[install-mecab-ipadic-NEologd] : head => ok
:
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : create /opt/mecab/mecab-ipadic-neologd/libexec/../build
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 383 0 383 0 0 171 0 --:--:-- 0:00:02 --:--:-- 171
0 0 0 11.6M 0 0 410k 0 --:--:-- 0:00:29 --:--:-- 468k
[make-mecab-ipadic-NEologd] : Decompress original mecab-ipadic file
:
[install-mecab-ipadic-NEologd] : Do you want to install mecab-ipadic-NEologd? Type yes or no. <-- yesと入力
yes
[install-mecab-ipadic-NEologd] : OK. Let's install mecab-ipadic-NEologd.
:
[install-mecab-ipadic-NEologd] : Finish..
[root@c mecab-ipadic-neologd]#
php-mecab †
本家様 https://github.com/rsky/php-mecab
[root@c ~]# yum --enablerepo=epel install re2c
[root@c ~]# which php
/bin/php
[root@c ~]# php -v
PHP 5.4.16 (cli) (built: Apr 12 2018 19:02:01)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2013 Zend Technologies
[root@c ~]#
[root@c ~]# cd src/
[root@c src]# git clone https://github.com/rsky/php-mecab.git
[root@c src]# cd php-mecab/mecab/
[root@c mecab]#
[root@c mecab]# phpize
Configuring for:
PHP Api Version: 20100412
Zend Module Api No: 20100525
Zend Extension Api No: 220100525
[root@c mecab]#
[root@c mecab]# which mecab-config
/usr/local/bin/mecab-config
[root@c mecab]# which php-config
/bin/php-config
[root@c mecab]# ./configure --with-php-config=/usr/bin/php-config --with-mecab=/usr/local/bin/mecab-config
[root@c mecab]# make && make install
Installing shared extensions: /usr/lib64/php/modules/
[root@c mecab]# ls -l /usr/lib64/php/modules/mecab.so
-rwxr-xr-x 1 root root 241552 7月 1 21:54 /usr/lib64/php/modules/mecab.so
[root@c mecab]# echo extension=mecab.so > /etc/php.d/mecab.ini
[root@c mecab]# systemctl reload httpd
とあるホームページのコンテンツを形態素解析するには、下記のようにしてみる。
<?php
$str = strip_tags (file_get_contents('http://xxxxxxxxxxxxxxxx.lg.jp/'));
$options = array('-d', '/usr/local/lib/mecab/dic/mecab-ipadic-neologd');
$mecab = new MeCab_Tagger($options);
$nodes = $mecab->parseToNode($str);
echo "<pre>\n";
foreach ($nodes as $n) {
if ( strpos($n->getFeature(), "名詞") !== false ){
echo $n->getSurface();
echo "\t".$n->getFeature() . "\n";
}
}
echo "</pre>";
?>
ホームページに表示される文章を形態素解析してくれるが、同じ文言を重複して出してきます。
なので一旦重複を外す作業が必要かもしれない。