Sphider
Users Guide
Versions 2.4, 2.4-PDO
and 3.x-MB
Contents
Introduction 3
About Sphider 4
Installation 6
Using the Admin Panel
Settings Tab 9
Sites Tab 15
Feed Tab 21
Categories Tab 23
Index Tab 25
Clean Tables Tab 27
Statistics Tab 28
Database Tab 32
Log Out Tab 35
Using the Search Features
Using Sphider Search 36
Searching Site Contents 36
Searching RSS Feeds 40
Searching Images 43
Miscellanous Subjects
Sphidering from the
command prompt 44
Database.php 46
Auth.php 47
Creating your own
templates 49
Preventing indexing 50
Indexing tips 51
2
Introduction
Sphider is a lightweight web spider and search engine written in PHP, using MySQL as its backend
database. It is a great tool for adding search functionality to your web site or building your custom
search engine. Sphider is small, easy to set up and modify, and is used by thousands of websites across
the world.
Sphider no only supports all standard search options, but includes a plethora of advanced features such
as word auto-completion, spelling suggestions, etc. The sophisticated administration interface makes
administering the system easy. The full list of Sphider features can be seen on the About Sphider page.
The last official version was 1.3.6 and was released 6 April 2013, and it was only a security update to
address a critical issue. The last release with any functional changes was 1.3.5, and that dates to 2009.
Version 1.3.6 may be obtained from the Sphider PHP search engine site. The official version is a) no
longer supported,
1
b) built upon earlier versions of PHP which contains much deprecated code, c) is
highly vulnerable to SQL injection attacks as well as other forms of remote code execution, d) uses a
suggest system which has grown increasingly unstable and unreliable as browsers change, and e) has
several uncorrected bugs.
This version, 3.1.0, has been updated to using prepared statements and works with the latest PHP
version (7.2 at this writing) and MySQL 5.6. The PDO version is able to work with other databases,
such as SQLite and PostgreSQL with modification.
All queries, which in the official version use the now deprecated MySQL extension, have been updated
since 1.5.1 to use either MySQLi/MySQLnd or PDO prepared statements, virtually eliminating SQL
injection attacks. The unstable and insecure SuggestFramework has been replaced by jQuery, making
spelling suggestions dependable once again. All HTML is now HTML5 compliant. Configuration
settings are now contained in the database, eliminating the horrendous danger presented when an entire
page was completely rewritten using unfiltered $_GET data every time the configuration settings
changed.
Windows operating systems, which was only partially supported in the official versions are now fully
supported. And this represents only SOME of the improvements made in 1.5.1 and later.
1 The official Sphider site also has a forum, which is supposed to provide support, although much of the “advice” is
aimed at directing individuals to a paid Sphider-plus version, rather than giving genuine help and discussion for the free
version.
3
About Sphider
Sphider is a popular open-source web spider and search engine. It includes an automated crawler,
which can follow links found on a site, and an indexer which builds an index of all the search terms
found in the pages. It also catalogs images occurring on each page (link) scanned, as well as the ability
to store links found in a RSS feed. It is written in PHP any uses MySQL as its back end database
(requires version .5 or above for both). For the standard 2.4.x and 3.x.x-MB versions, both MySQLi
and MySQLnd are required. The PDO version requires the PDO module to be installed.
NOTE ABOUT THE ALTERNATE 2.4.x-PDO VERSION:
Not all web hosts support MySQLnd, particularly on shared hosting accounts. Generally, however, they
do support PDO (PHP Data Objects) instead. For this reason, an alternate build of Sphider has been
built to accommodate such a scenario. Version 2.4.x-PDO requires PDO support in lieu of MySQLi and
MySQLnd. Under no circumstance should files between the two separate builds be interchanged! The
method of database access is different between the two. The standard version is preferred because,
while the PDO version is more versatile, there is an overhead involved in performance.
The PDO version is also able to be ported to database types other than MySQL. SQLite and
PostgreSQL are two examples. To use databases other than MySQL, some code modification is
required.
Features
Spidering and indexing
Performs full text indexing.
Can index both static and dynamic pages.
Find links in href, frame, area and meta tags, and can also follow links given in javascript as
strings via window.location and window.open.
Respects robots.txt protocol, and nofollow and noindex tags.
Follows server side redirections.
Allows spidering to be limited by depth (i.e. maximum number of clicks from the starting
page), by (sub)domain or by directory.
Allows spidering only the urls matching (or not matching) certain keywords or regular
expressions.
Supports indexing of pdf, doc, xls, and ppt files (using external binaries, which are NOT
included, for file conversion).
Allows spidering from a site’s sitemap.xml file, if one exists. This can speed up the spidering
process.
Allows resuming paused spidering.
Possibility to exclude common words from being indexed.
Indexes images occurring either directly or by reference to each link spidered.
Indexes RSS feed links.
4
Searching
Default search
Supports AND, OR and Phrase searches.
Supports excluding words (by putting a ‘-’ in front of a word, any page including that word
will be omitted from the results).
Supports wildcard (*) searches.
Option to add and group sites into categories.
Possible to limit searches to a given category and its subcategories.
Possible to search all or a single specified domain.
“Did you mean” search suggestion on mistyped queries.
Context-sensitive auto-completion on search terms (a la Google Suggest).
Word stemming for English (searching for “run” finds “runnings”, “runs”, etc.).
RSS search
Support AND and OR searches.
Supports wildcard (*) searches.
Can search all publication dates, a specific date, or a date range.
Can retrieve all feed items by leaving the query blank.
Possible to search all feed sources or a specific one.
Image search
Can search by the occurrence of a word in the image name, in the image URL, or in the
image ‘alt’ tag.
Can retrieve all images by leaving the query blank.
Supports wildcard (*) searches.
Possible to search all indexed sites or a specified site.
Search language
The default search language is set on the Settings tab, but may be changed on the fly. If the default
language is English, entering a parameter in the browser bar such as “search.php?sl=es” will change the
search language to Spanish. This change will persist until either another language is chosen or the
browser is closed.
Administering
Includes a sophisticated web based administrative interface.
Supports indexing via the web interface as well as from the command line.
Easy to set up as a cron job (or in Windows Task Manager).
Comprehensive site and search statistics.
Simple template system – easy to integrate into a site.
5
Installation
New installation
1. Unpack the files, and copy them to the server, for example to /home/youruser/public_html/sphider.
This will be the '[path_of_sphider]'.
2. In the server, create a database in MySQL to hold Sphider data.
a) at command prompt type (to log into MySQL):
mysql -u <your username> -p
Enter your password when prompted.
b) in MySQL, type:
CREATE DATABASE `sphider_db` CHARACTER SET utf8mb4 COLLATE
utf8mb4_general_ci;
Of course you can use some other name for database instead of sphider_db.
c) Use exit to exit MySQL.
At this point, it would be advisable to create another user and password for use in the next step. For
more information on how to create a database and give/get the necessary permissions, check
MySQL.com
3. In settings directory, edit database.php file and change $database, $mysql_user, $mysql_password
and $mysql_host to correct values. If you don't know what $mysql_host should be, it should probably
stay as it is - 'localhost'. There is also $mysql_table_prefix, defaulted to a null value. If you desire to
change this, the names of the soon to be created tables will all begin with the value of
$mysql_table_prefix. For example, if you set $mysql_table_prefix = "sph_", the table "keywords" will
be created as "sph_keywords". The prefix is optional.
4. Open install.php script (admin directory) in your browser, which will create the tables necessary for
Sphider to operate.
Alternatively, the tables can be created by hand using tables.sql script provided in the sql directory of
the Sphider distribution. At the prompt, type:
mysql -u USERNAME -p sphider_db < [path_of_sphider]/sql/tables.sql
You will be prompted for you password.
** Realize that creating the tables in this manner will NOT recognize any prefix designated by
$mysql_table_prefix in the database.php file.
5. In admin directory, edit auth.php to change the administrator user name and password (default
values are 'admin' and 'admin').
6
6. It is highly recommended that the admin directory be password protected. If at all possible, the
admin directory should also be set to only allow SSL access. When logging into the admin directory
using standard http access, your directory user name and password are not encrypted. With https access,
these items are encrypted and the risk of unauthorized access to the admin directory is greatly reduced.
The common_template, include, and settings directories also be protected. Do NOT restrict
js_suggest or templates!
7. On Linux machines, you should check to be sure your web server has read/write/delete permission
for the admin/log, admin/tmp, and admin/backup directories. There is also another tmp directory in
Sphider home that needs access as well.
8. Open admin/admin.php in a browser and start using Sphider.
9. The first step to take after getting the admin screen should be to click on the "Database" tab to ensure
that all 29 tables have been successfully created.
Upgrading an existing installation
1. If you already have an earlier installation of Sphider, you should first make a backup of your
existing database and store it in a safe place.
2. In the server, alter your database in MySQL to current standards.
a) at command prompt type (to log into MySQL):
mysql -u <your username> -p
Enter your password when prompted.
b) in MySQL, type:
ALTER DATABASE `sphider_db` CHARACTER SET utf8mb4 COLLATE
utf8mb4_general_ci;
Use your current database name in place of sphider_db.
c) Use exit to exit MySQL.
3. Delete these current directories and their contents:
admin
include
common_template
js_suggest
languages
settings
sql
templates
upgrade (if it exists)
Then delete the current files: changelog, install.txt, search.php, and SphiderUserGuid.pdf.
7
4. Unpack the new files to your existing sphider directory which you have just cleared out.
5. In settings directory, edit database.php file and change $database, $mysql_user, $mysql_password
and $mysql_host to correct values. If you don't know what $mysql_host should be, it should probably
stay as it is - 'localhost'. There is also $mysql_table_prefix, defaulted to a null value. If you desire to
change this, the names of the soon to be created tables will all begin with the value of
$mysql_table_prefix. For example, if you set $mysql_table_prefix = "sph_", the table "keywords" will
be created as "sph_keywords". The prefix is optional.
6. Open update_rollup.php script (admin directory) in your browser, which will update the tables
necessary for Sphider to operate. Your existing data should be preserved.
7. In admin directory, edit auth.php to change the administrator user name and password (default
values are 'admin' and 'admin').
8. It is highly recommended that the admin directory be password protected. If at all possible, the
admin directory should also be set to only allow SSL access. When logging into the admin directory
using standard http access, your directory user name and password are not encrypted. With https access,
these items are encrypted and the risk of unauthorized access to the admin directory is greatly reduced.
The common_template, include, and settings directories also be protected. Do NOT restrict
js_suggest or templates!
9. Open admin/admin.php in a browser and start using your updated Sphider.
NOTE ABOUT UPGRADING - The changelog lists which files have changed. It may be tempting to
ONLY replace the changed files and be done with it. While this may be fine on a base level, if you do
so, PLEASE DO RUN the update_rollup.php. It will make needed changes to your database. (In
Sphider 3.x-MB, run version_update.php.)
FINAL NOTE ABOUT INSTALLATION - When you have completed installing or upgrading Sphider
2.4.0, the install.php and update_rollup.php scripts should be deleted. You won't be needing them and
there is no sense leaving them around for someone else to misuse.
8
Using the Admin Panel
Settings Tab
There are 58 user configurable settings on this page.
GENERAL SETTINGS
Language A drop down list of available languages is
provided. This is the default language which will
appear to the user on the various search pages.
Search template This drop down list shows the available templates.
Each template uses a CSS file to determin the look
of the search and results pages.
Administrator e-mail address The e-mail address to which spidering log files
may be sent.
Print spidering logs to standard out If this is checked, the spidering results will also be
displayed in the browser window as spidering
9
progresses.
Temporary directory This is the name and relative or absolute path to
the temporary directory. This directory is used by
Sphider during the parsing of url’s during
indexing. If a Windows path containing
backslashes is used, the next setting, Windows
OS, must be enabled. The path must exist.
Backslashes used in Windows environments do
NOT need to be escaped.
Windows OS Check this box if Sphider is to be run in a
Windows environment. If Sphider is located on a
Linux system but administered via browser on a
Windows system, do NOT check this box!
10
LOGGING SETTINGS
Log spidering results If checked, a log file will be created for each
occurrence of indexing and re-indexing.
Log directory This is the name and relative or absolute path to
the log file directory. This directory is where
spidering log files are stored. If a Windows path
containing backslashes is used, the backslashes
should not be escaped, and Windows OS must be
checked in the General Settings. The path must
exist.
Log file format The log file may be in either HTML or text
format.
Send spidering log to e-mail If checked, the spidering log will be e-mailed to
the Administrator.
SPIDER SETTINGS
Required number of words in a page in
order to be indexed
This sets the minimum number of words which
must appear on a page for it to be indexed.
Minimum word length in order to be
indexed
This sets the minimum length of a word before it
can be indexed.
Keyword weight depending on the number
of times it is appears in a page is capped at
this value
A keyword’s weight is increased by the number of
times it appears on a page. This caps the weight of
a keyword.
Index numbers If checked, numbers will be indexed. (They are
subject to minimum word length rules.)
Index decimal numbers If checked, decimal numbers will be indexed.
(This setting will be ignored if ‘Index numbers” is
not also checked.)
Index words in domain name and url path If checked, words appearing in the domain name
or path to a page will be indexed.
Index meta keywords If enabled, keywords appearing in meta tags are
indexed.
Index images If checked, each page being indexed will be
checked for images, and if found, the images will
also be indexed.
Minimum image width If Sphider can determine size, this is the minimum
width which will be accepted.
Minimum image height If Sphider can determine size, this is the minimum
height which will be accepted.
11
Index PDF files If checked, PDF files will be parsed and indexed.
Index DOC files If checked, DOC files will be parsed and indexed.
Index XLS files If checked, XLS files will be parsed and indexed.
Index PPT files If checked, PPT files will be parsed and indexed.
Full executable path to PDF converter This is the full path to the PDF converter. If
‘Windows OS’ is checked in the General settings,
the path may contain unescaped backslashes.
Full executable path to DOC converter This is the full path to the catdoc converter If
‘Windows OS’ is checked in the General settings,
the path may contain unescaped backslashes.
Full executable path to XLS converter This is the full path to the XLS converter. If
‘Windows OS’ is checked in the General settings,
the path may contain unescaped backslashes.
Full executable path to PPT converter This is the full path to the PPTconverter. If
‘Windows OS’ is checked in the General settings,
the path may contain unescaped backslashes.
User agent string This is the user agent string as it will appear in the
log files of the domain being spidered and
indexed. It can be up to 50 characters in length.
Minimal delay between page downloads The minimum time, in seconds, between page
downloads during spidering. Increasing this
number will increase the amount of time required
to spider a site, but may reduce the number of
time-out errors.
Use word stemming If used, this should be enabled BEFORE initial
indexing. It allows, for example, a search for the
word “run” to also return “runs” and “running”.
Language to stem Each language has its own algorithm
Strip session ids If enabled (recommended), session ids are
removed from spidering results.
12
SEARCH SETTINGS
Default results per page This sets the number of results show per page to
10, 20, or 50. (It can be overridden on the search
screen.)
Number of columns in category list If categories are shown on the search page, this
determines the number of columns to be used in
their display.
Bound number of search results This limits the number of search results returned.
When set to 0, the limit is removed.
The length of the description string This limits the length of the description string
retrieved from the database. Visually, it will have
no impact on the length of the description shown
in search results unless the value is less than
“Maximum length of page summary” (below). Are
0 removes the limits.
Number of links shown to “previous” and
“next” pages
This limits the number of links shown for
“Previous” and/or “Next” pages when the number
of results returned exceeds the maximum number
of results per page.
Floor for query scores Limits search results to this minimum score.
0 means no limit. (3.2+)
Show meta description in a results page If enabled, the meta description will be used if it is
available. If not available, the normal page extract
will be shown in the results descriptions.
Advanced search Changes the default AND only search to an AND/
OR/Phrase search.
Show result number Toggles the result number on the results report
Show index date Displays the index date of the page reported
Show query scores This shows the query scores (chance of relevance)
for each returned search result.
Show stars Shows scores using a 5 star system. Show query
scores must also be enabled. (3.2+)
Show categories If enabled, categories will be displayed on the
search form.
Maximum length of a page summary This controls the length of the page summary for
each search result.
Enable spelling suggestions (Did you
mean...)
If enabled, when a search returns empty but
Sphider finds a similar word or phrase in the
database, it will be suggested.
13