User:Dan Nessett/Technical/Notes on setting up Sphinx Search Engine

From Citizendium
Jump to navigation Jump to search


The account of this former contributor was not re-activated after the server upgrade of March 2022.


Installing Sphinx Search Engine Daemon

Installing on Ubuntu

The instructions given below are from the install sphinx using postgres article, modified to update the version of sphinx.

Install build toolchain

sudo aptitude install build-essential checkinstall

Install Postgres

sudo aptitude install postgresql postgresql-client \
postgresql-client-common postgresql-contrib \
postgresql-server-dev-8.3

Note to self: The first two shouldn't be necessary, since postgresql should already be installed. However, the last one probably is required:

sudo aptitude install postgresql-server-dev-8.3

Get Sphinx source

cd /usr/local/src
sudo mkdir sphinx
cd sphinx
sudo wget http://www.sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
sudo tar xzvf sphinx-0.9.9.tar.gz
cd sphinx-0.9.9

Configure and make

sudo ./configure --without-mysql --with-pgsql \
--with-pgsql-includes=/usr/include/postgresql/ \
--with-pgsql-lib=/usr/lib/postgresql/8.3/lib/
sudo make

Run checkinstall

sudo mkdir /usr/local/var
sudo checkinstall

Sphinx is now installed in /usr/local. Check out /usr/local/etc/ for configuration info.

Installing on CentOS 5.4

Installing and Configuring Sphinx Search Extension

The instructions given below are from the Install Sphinx Extension page, modified to give specific directories.

Configure Sphinx

Download and extract the extension to a temporary directory. Copy the sphinx.conf file from this download to /usr/local/etc/sphinx.conf This directory should not be web-accessible, so you should not use the extensions folder. Make sure to adjust all values to suit your setup: Set correct database, username, and password for your MediaWiki database Update table names in SQL queries if your MediaWiki installation uses a prefix (backslash line breaks may need to be removed if the indexer step below fails) Update the file paths (/var/data/sphinx/..., /var/log/sphinx/...) and create folders as necessary If your wiki is very large, you may want to consider specifying a query range in the conf file. If your wiki is not in English, you will need to change (or remove) the morphology attribute.

An example source definition for postgres is:

source src_wiki_main
{
	# data source
	type		= pgsql
	sql_host	= localhost
	sql_user	= cz
	sql_pass	= 
	sql_db		= cz
	sql_port	= 5432

	# pre-query, executed before the main fetch query
	sql_query_pre	= SET NAMES utf8

	# main document fetch query - change the table names if you are using a prefix
	sql_query	= SELECT page_id, page_title, page_namespace, old_id, old_text FROM page, revision, text WHERE rev_id=page_latest AND old_id=rev_text_id 

	# attribute columns
	sql_attr_uint	= page_namespace
	sql_attr_uint	= old_id

	# uncomment next line to collect all category ids for a category filter
	#sql_attr_multi  = uint category from query; SELECT cl_from, page_id AS category FROM categorylinks, page WHERE page_title=cl_to AND page_namespace=14

	# optional - used by command-line search utility to display document information
	sql_query_info	= SELECT page_title, page_namespace FROM page WHERE page_id=$id
}

To enable an incremental index use the following source in sphinx.conf:

# data source definition for the incremental index 
source src_wiki_incremental : src_wiki_main
{
	# adjust this query based on the time you run the full index
	# in this case, full index runs at 3 AM (server time) which translates to 7 AM UTC
	sql_query	= SELECT page_id, page_title, page_namespace, old_id, old_text FROM page, revision, text WHERE rev_id=page_latest AND old_id=rev_text_id AND page_touched>=DATE_FORMAT(CURDATE(), '%Y%m%d070000')

Note: the date/time format specifies those changes to include in the incremental indexing. For the example given above, it is assumed that indexing occurs each night at 0700 A.M. UTC. If main indexing runs at some other time, you need to adjust the DATE_FORMAT specification accordingly.

It is also necessary to create the following two directories and make them writable by the user that runs the apache2 server (for Ubuntu - www-data):

sudo mkdir /var/log/sphinx
sudo chown www-data:www-data /var/log/sphinx
sudo mkdir /var/data
sudo mkdir /var/data/sphinx
sudo chown www-data:www-data /var/data/sphinx

Run Sphinx Indexer

Run the sphinx indexer to prepare for searching:

/usr/local/bin/indexer --config /usr/local/etc/sphinx.conf --all

Once again, make sure to replace the paths to match your installation. This process is actually pretty fast, but clearly depends on how large your wiki is. Just be patient and watch the screen for updates.

Test Out Sphinx

When the indexer is finished, test that sphinx searching is actually working:

/usr/local/bin/search --config /usr/local/etc/sphinx.conf "search string"

You will see the result stats immediately (Sphinx is FAST.) Note that the article data you see at this point comes from the sql_query_info in sphinx.conf file. In the extension we can get to the actual article content because we have text old_id available as an extra attribute. It would be slow to fetch article content on the command line (we would have to join page, revision, and text tables,) so we just fetch page_title and page_namespace at this point.

Start Sphinx Daemon

In order to speed up the searching capability for the wiki, we must run the sphinx in daemon mode. Add the following to whatever server startup script you have access (i.e. /etc/rc.local):

/usr/local/bin/searchd --config /usr/local/etc/sphinx.conf &

Note: without the daemon running, searching will not work. That is why it is critical to make sure the daemon process is started every time the server is restarted.

Configure Incremental Updates

To keep the index for the search engine up to date, the indexer must be scheduled to run at a regular interval. On most UNIX systems edit your crontab file by running the command:

crontab -e

Add this line to set up a cron job for the full index - for example once every night:

0 3 * * * /usr/local/bin/indexer --quiet --config /usr/local/etc/sphinx.conf wiki_main --rotate >/dev/null 2>&1

Add this line to set up a more frequent cron to update the smaller index regularly:

0 9,15,21 * * * /usr/local/bin/indexer --quiet --config /usr/local/etc/sphinx.conf wiki_incremental --rotate >/dev/null 2>&1

As before, make sure to adjust the paths to suit your configuration. Note that --rotate option is needed if searchd deamon is already running, so that the indexer does not modify the index file while it is being used. It creates a new file and copies it over the existing one when it is done.

Extension Preparation - Sphinx PHP API

Create extensions/SphinxSearch directory and copy the Sphinx API file, sphinxapi.php there. This file is part of the sphinx download, under the api/ directory. You will need to copy this file again each time you update the Sphinx engine.

Extension Installation - PHP Files

Copy all remaining files (SphinxSearch.php, SphinxSearch_body.php, etc.) from the temporary directory you extracted the code to in #Step 2 to your extensions/SphinxSearch directory.

Extension Installation - Local Settings

Add the following text to your LocalSettings.php

require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );