首頁技術文章正文

Java培訓實戰(zhàn)教程之lucene初探

更新時間:2015年12月29日13時26分來源:傳智播客Java培訓學院瀏覽次數(shù):

全文檢索場景
當你在使用百度、Google搜索信息時，當你在淘寶、京東搜索商品時你知道這些都是使用的什么技術可以很快搜索你想要的東東嗎？正是全文檢索技術。
全文檢索概念
全文檢索是將整本書、整篇文章中的任意內容信息查找出來的檢索。它可以根據(jù)需要獲得全文中有關章、節(jié)、段、句、詞等信息，計算機程序通過掃描文章中的每一個詞，對每一個詞建立一個索引，指明該詞在文章中出現(xiàn)的次數(shù)和位置，當用戶查詢時根據(jù)建立的索引查找，類似于通過字典的檢索字表查字的過程。
經(jīng)過幾年的發(fā)展，全文檢索從最初的字符串匹配程序已經(jīng)演進到能對超大文本、語音、圖像、活動影像等非結構化數(shù)據(jù)進行綜合管理的大型軟件。
什么是Lucene
Lucene是apache下的一個開放源代碼的全文檢索引擎工具包。提供了完整的搜索引擎和索引引擎。Lucene的目的是為軟件開發(fā)人員提供一個簡單易用的工具包，以方便的在目標系統(tǒng)中實現(xiàn)全文檢索的功能。
案例描述
我們以一個案例來研究全文檢索過程：實現(xiàn)一個資源管理器的搜索功能，通過關鍵字搜索文件，凡是文件名或文件內容包括關鍵字的文件都需要找出來。
開發(fā)環(huán)境
從Lucene官方網(wǎng)站（http://lucene.apache.org/）下載Lucene4.3.10，并解壓。

Lucene4.3.10要求Jdk使用1.7以上，本教程使用1.7.0_72版本。
開發(fā)工具：eclipse indigo

Lucene包：
lucene-core-4.10.3.jar---Lucene核心包
lucene-analyzers-common-4.10.3.jar----Lucene分析包
lucene-queryparser-4.10.3.jar ---Lucene查詢包

其它：
commons-io-2.4.jar ---用于讀取磁盤文件內容
junit-4.9.jar---用于單元測試

Lucene全文檢索過程
全文檢索包括索引和搜索兩個過程，先對要搜索的信息創(chuàng)建索引，再從索引中搜索信息。
如下圖：
1、黃色表示索引過程，對要搜索的原始內容進行索引構建一個索引庫，索引過程包括：
確定原始內容即要搜索的內容--》采集文檔--》創(chuàng)建文檔--》分析文檔--》索引文檔

2、藍色表示搜索過程，從索引庫中搜索內容，搜索過程包括：
用戶通過搜索界面--》創(chuàng)建查詢--》執(zhí)行搜索，從索引庫搜索--》渲染搜索結果

第一步：確定原始內容
原始內容是指要索引和搜索的內容。原始內容包括互聯(lián)網(wǎng)上的網(wǎng)頁、數(shù)據(jù)庫中的數(shù)據(jù)、磁盤上的文件等。
本案例中的原始內容就是磁盤上的文件（本教程只搜索.txt文件），如下圖：

第二步：獲取原始內容
從互聯(lián)網(wǎng)上、數(shù)據(jù)庫、文件系統(tǒng)中等獲取需要搜索的原始信息，這個過程就是信息采集，信息采集的目的是為了對原始內容進行索引。
Lucene本身不提供信息采集的功能，這里我們通過Java流程讀取磁盤文件的內容。

第三步：創(chuàng)建文檔
獲取原始內容的目的是為了索引，在索引前需要將原始內容創(chuàng)建成文檔（Document），文檔中包括一個一個的域（Field），域中存儲內容。
這里我們可以將磁盤上的一個文件當成一個document，Document中包括一些Field（file_name文件名稱、file_path文件路徑、file_size文件大小、file_content文件內容），如下圖：

注意：每個Document可以有多個Field，不同的Document可以有不同的Field，同一個Document可以有相同的Field（域名和域值都相同）

下邊代碼實現(xiàn)了從磁盤讀取文件并創(chuàng)建文檔的過程：

// 從文件創(chuàng)建Document
   public static List<Document> file2Document(String folderPath)
           throws IOException {

       List<Document> list = new ArrayList<Document>();

       File folder = new File(folderPath);
       if (!folder.isDirectory()) {
           return null;
       }
       // 獲取目錄中的所有文件
       File[] files = folder.listFiles();
       for (File file : files) {
           //文件名稱
           String fileName = file.getName();
           System.out.println(fileName);
           if (fileName.lastIndexOf(".txt") > 0) {

              // 文件內容
              String fileContent = FileUtils.readFileToString(file);
              //文件路徑
              String filePath = file.getAbsolutePath();
              //文件大小
              long fileSize = FileUtils.sizeOf(file);

              //創(chuàng)建文檔
              Document doc = new Document();

              //創(chuàng)建各各Field域
              //文件名
              Field field_fileName = new StringField("fileName", fileName, Store.YES);
              //文件內容
              Field field_fileContent = new TextField("fileContent", fileContent, Store.NO);

              //文件大小
              Field field_fileSize = new LongField("fileSize", fileSize, Store.YES);
              //文件路徑
              Field field_filePath = new StoredField("filePath", filePath, Store.YES);


              //將各各Field添加到文檔中
              doc.add(field_fileName);
              doc.add(field_fileContent);
              doc.add(field_fileSize);
              doc.add(field_filePath);
              list.add(doc);

           }
       }

       return list;

    }

第四步：分析文檔

將原始內容創(chuàng)建為包含域（Field）的文檔（document），需要再對域中的內容進行分析，分析的過程是經(jīng)過對原始文檔提取單詞、將字母轉為小寫、去除標點符號、去除常用詞等過程生成最終的語匯單元，可以將語匯單元理解為一個一個的單詞。
比如下邊的文檔經(jīng)過分析如下：
原文檔內容：
Lucene is a Java full-text search engine. Lucene is not a complete
application, but rather a code library and API that can easily be used
to add search capabilities to applications.

分析后得到的語匯單元：
lucene、java、full、search、engine。。。。

第五步：創(chuàng)建索引

對所有文檔分析得出的語匯單元進行索引，索引的目的是為了搜索，最終要實現(xiàn)只搜索被索引的語匯單元從而找到Document（文檔）。
注意：創(chuàng)建索引是對語匯單元索引，通過詞語找文檔，這種索引的結構叫倒排索引結構。
傳統(tǒng)方法是根據(jù)文件找到該文件的內容，在文件內容中匹配搜索關鍵字，這種方法是順序掃描方法，數(shù)據(jù)量大、搜索慢。
倒排索引結構是根據(jù)內容（詞語）找文檔，如下圖：

根據(jù)左邊的索引詞典可以找到詞對應的文檔。“springmvc.txt”這個詞在Document1(springmvc.txt) ，“web”和“spring”在Document1、Document2中都存在，詞是通過Field和Document文檔聯(lián)系起來的。

倒排索引結構也叫反向索引結構，包括索引和文檔兩部分，索引即詞匯表，它的規(guī)模較小，而文檔集合較大。

使用分析器分析并創(chuàng)建索引過程代碼如下：

public class IndexTest {

    // 索引源，即源數(shù)據(jù)目錄
    private static String searchSource = "F:\\develop\\lucene\\searchsource";

    // 索引目標地址
    private static String indexFolder = "F:\\develop\\lucene\\indexdata";

    @Test
    public void testCreateIndex() {

       try {

          //從目錄中讀取文件內容并創(chuàng)建Document文檔
           List<Document> docs = IndexUtils.file2Document(searchSource);
           //創(chuàng)建分析器，standardAnalyzer標準分析器
           Analyzer standardAnalyzer = new IKAnalyzer();
           // 指定索引存儲目錄
           Directory directory = FSDirectory.open(new File(indexFolder));
           //創(chuàng)建索引操作配置對象
           IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_3,
                  standardAnalyzer);

           // 定義索引操作對象indexWriter
           IndexWriter indexWriter = new IndexWriter(directory,indexWriterConfig);

           // 遍歷目錄下的文件生成的文檔，調用indexWriter方法創(chuàng)建索引
           for (Document document : docs) {
              indexWriter.addDocument(document);
           }
           // 索引操作流關閉
           indexWriter.close();

       } catch (IOException e) {
           e.printStackTrace();
       }

    }
}

第六步：搜索文件
根據(jù)文件名稱搜索文件，需要經(jīng)過以下步驟：
1）指定索引目錄地址，搜索就是從索引中搜索匹配的詞語
// 指定索引目錄地址，
    private static String indexFolder = "F:\\develop\\lucene\\indexdata";

2）創(chuàng)建Query構建查詢語法（可以理解為和關系數(shù)據(jù)庫的Sql作用一樣）
// 創(chuàng)建查詢對象，根據(jù)文件名稱域搜索匹配文件名稱的文檔
       Query query = new TermQuery(new Term("fileName", "springmvc_test.txt"));

3）創(chuàng)建IndexReader讀取索引文件
// 指定索引目錄
       Directory directory = FSDirectory.open(new File(indexFolder));

       // 定義IndexReader
       IndexReader reader = DirectoryReader.open(directory);

4）創(chuàng)建IndexSearcher執(zhí)行搜索
// 創(chuàng)建indexSearcher
       IndexSearcher indexSearcher = new IndexSearcher(reader);
       // 執(zhí)行搜索
       TopDocs topDocs = indexSearcher.search(query, 100);

5）通過TopDocs獲取搜索結果
// 提取搜索結果
       ScoreDoc[] scoreDocs = topDocs.scoreDocs;

6）遍歷結果，從Document中獲取Field內容
for (ScoreDoc scoreDoc : scoreDocs) {
           // 文檔id
           int docID = scoreDoc.doc;
           // 得到文檔
           Document doc = indexSearcher.doc(docID);
           // 輸出文件內容
           System.out.println("------------------------------");
           System.out.println("文件名稱 =" + doc.get("fileName"));
           System.out.println("文件大小 =" + doc.get("fileSize"));
           System.out.println("文件內容 =" + doc.get("fileContent"));
       }

完整代碼如下：
public class SearchTest {
    // 指定索引目錄地址，
    private static String indexFolder = "F:\\develop\\lucene\\indexdata";

    //查詢方法
    @Test
    public void testTermQuery() throws IOException {

       // 創(chuàng)建查詢對象，根據(jù)文件名稱域搜索匹配文件名稱的文檔
       Query query = new TermQuery(new Term("fileName", "springmvc_test.txt"));

       // 指定索引目錄
       Directory directory = FSDirectory.open(new File(indexFolder));

       // 定義IndexReader
       IndexReader reader = DirectoryReader.open(directory);
       // 創(chuàng)建indexSearcher
       IndexSearcher indexSearcher = new IndexSearcher(reader);
       // 執(zhí)行搜索
       TopDocs topDocs = indexSearcher.search(query, 100);
       // 提取搜索結果
       ScoreDoc[] scoreDocs = topDocs.scoreDocs;

       System.out.println("共搜索到總記錄數(shù)：" + topDocs.totalHits);

       for (ScoreDoc scoreDoc : scoreDocs) {
           // 文檔id
           int docID = scoreDoc.doc;
           // 得到文檔
           Document doc = indexSearcher.doc(docID);
           // 輸出文件內容
           System.out.println("------------------------------");
           System.out.println("文件名稱 =" + doc.get("fileName"));
           System.out.println("文件大小 =" + doc.get("fileSize"));
           System.out.println("文件內容 =" + doc.get("fileContent"));
       }

    }

本文版權歸傳智播客Java培訓學院所有，歡迎轉載，轉載請注明作者出處。謝謝！
作者：傳智播客Java培訓學院
首發(fā)：

http://metathetuscanyresort.com/javaee