Sphider
User's Guide
Versions 5.5
and Lite 2.6
Contents
Introduction 3
About Sphider 4
Installation 6
Using the Admin Panel
Settings Tab 9
Sites Tab 15
Feed Tab 21
Categories Tab 23
Index Tab 25
Clean Tables Tab 27
Statistics Tab 28
Database Tab 31
Log Out Tab 33
Using the Search Features
Using Sphider Search 34
Searching Site Contents 34
Searching RSS Feeds 38
Searching Images 40
Miscellaneous subjects
Spidering from the
command prompt 41
Database.php 43
My.cnf 43
Auth.php 44
Creating your own
templates 46
Preventing indexing 47
Indexing tips 48
About robots.txt 49
About common text languages 50
ER Diagram 51
2
Introduction
Sphider is a lightweight web spider and search engine written in PHP, using MySQL as its back
end database. It is a great tool for adding search functionality to your web site or building your
custom search engine. Sphider is small, easy to set up and modify, and is used in thousands of
websites across the world
.
Sphider not only supports all standard search options, but also includes a plethora of advanced
features such as word auto-completion, spelling suggestions etc. The sophisticated administration
interface makes administering the system easy. The full list of Sphider features can be seen on
the About Sphider page.
The current official version is 1.3.6 and was released 6 April 2013, and it only was a security
update to address a critical issue. The last release with any functional changes was 1.3.5, and that
dates to 2009. Version 1.3.6 may be obtained from the Sphider PHP search engine site. The
official version is a) no longer supported,1 b) built upon earlier versions of PHP which contain
much deprecated code, c) is highly vulnerable to SQL injection attacks as well as other forms of
remote code execution, d) uses a suggest system which has grown increasingly unstable and
unreliable as browsers change, and e) has several uncorrected bugs.
This version, 5.x.x or Lite 2.x.x, has been updated to use prepared statements and works with the
latest current PHP (8.1 at this writing) and MySQL 5.6 or greater. MariaDB may be used in lieu
of MySQL.
All queries, which in the official version use the now deprecated MySQL extension, have been
updated since 1.5.1 to use the MySQLi /MySQLnd extension and prepared statements, virtually
eliminating SQL injection attacks. The unstable and insecure SuggestFramework has been
replaced by jQuery, making spelling suggestions dependable once again. All HTML is now
HTML5 compliant. Configuration settings are now contained in the database, eliminating the
horrendous danger presented when an entire page was completely rewritten using unfiltered
$_GET data every time the configuration settings changed.
Windows operating systems, which was only partially supported in the official versions are now
fully supported. All this represents only SOME of the improvements made in 1.5.1 and later.
____________________
1. The official Sphider site also has a forum, which supposedly provides support, although much of the
“advice” is aimed at directing individuals to a paid Sphider-plus version , rather than giving genuine help and
discussion for the free version.
3
About Sphider
Sphider is a popular open-source web spider and search engine. It includes an automated crawler,
which can follow links found on a site, and an indexer which builds an index of all the search
terms found in the pages. It also catalogs images occurring on each page (link) scanned, as well
as the ability to store links found in a RSS feed. It is written in PHP and uses MySQL as its back
end database (requires version 5.5 or above for both). For the standard 5.x.x and Lite 2.x.x
versions, both MySQLi and MySQLnd are required.
Features
Spidering and indexing
Performs full text indexing.
Can index both static and dynamic pages.
Finds links in href, frame, area and meta tags, and can also follow links given in
javascript as strings via window.location and window.open.
Respects robots.txt protocol, and nofollow and noindex tags.
Follows server side redirections.
Allows spidering to be limited by depth (ie maximum number of clicks from the starting
page), by (sub)domain or by directory.
Allows spidering only the urls matching (or not matching) certain keywords or regular
expressions.
Supports indexing of pdf, doc, xls and ppt files (using external binaries for file
conversion).
Allows resuming paused spidering.
Possibility to exclude common words from being indexed.
Indexes images occurring either directly or by reference to each link spidered.
Indexes RSS feed links.
4
Searching
Default search
Supports AND, OR and Phrase searches.
Supports excluding words (by putting a ‘-’ in front of a word, any page including that
word
will be omitted from the results).
Supports wildcard (*) searches.
Option to add and group sites into categories.
Possible to limit searches to a given category and its subcategories.
Possible to search all or a single specified domain.
“Did you mean” search suggestion on mistyped queries.
Context-sensitive auto-completion on search terms (a la Google Suggest).
Word stemming for English (searching for “run” finds “runnings”, “runs”, etc.).
Optional word stemming for eleven other languages, such as French, German, Italian, or
Spanish.
RSS search
Support AND and OR searches.
Supports wildcard (*) searches.
Can search all publication dates, a specific date, or a date range.
Can retrieve all feed items by leaving the query blank.
Possible to search all feed sources or a specific one.
Image search
Can search by the occurrence of a word in the image name, in the image URL, or in the
image ‘alt’ tag.
Can retrieve all images by leaving the query blank.
Supports wildcard (*) searches.
Possible to search all indexed sites or a specified site.
Administering
Includes a sophisticated web based administration interface.
Supports indexing via a web interface as well as from command line.
Easy to set up cron jobs (or in Windows Task Manager).
Comprehensive site and search statistics.
Simple template system - easy to integrate into a site.
5
Installation
New installation
1. Unpack the files, and copy them to the server, for example to
/home/youruser/public_html/sphider. This will be the '[path_of_sphider]'.
2. In the server, create a database in MySQL to hold Sphider data.
a) at command prompt type (to log into MySQL):
mysql -u <your username> -p
Enter your password when prompted.
b) in MySQL, type:
CREATE DATABASE `sphider_db` CHARACTER SET utf8mb4 COLLATE
utf8mb4_0900_ai_ci;
Of course you can use some other name for database instead of sphider_db.
c) Use exit to exit MySQL.
At this point, it would be advisable to create a another user and password for use in the next step.
For more information on how to create a database and give/get the necessary permissions, check
MySQL.com
Note that creating the database can also be done using phpMyAdmin ,if available.
3. In settings directory, edit database.php file and change $database, $mysql_user,
$mysql_password and $mysql_host to correct values. If you don't know what $mysql_host
should be, it should probably stay as it is - 'localhost'. There is also $mysql_table_prefix,
defaulted to a null value. If you desire to change this, the names of the soon to be created tables
will all begin with the value of $mysql_table_prefix. For example, if you set $mysql_table_prefix
= "sph_", the table "keywords" will be created as "sph_keywords". The prefix is optional.
4. in settings directory, edit my.cnf file with the appropriate host, user, and password.
5. Open install.php script (admin directory) in your browser, which will create the tables
necessary for Sphider to operate.
Alternatively, the tables can be created by hand using tables.sql script provided in the sql
directory of the Sphider distribution. At the prompt, type:
mysql -u USERNAME -p sphider_db < [path_of_sphider]/sql/tables.sql
You will be prompted for you password.
** Realize that creating the tables in this manner will NOT recognize any prefix designated by
$mysql_table_prefix in the database.php file.
6
6. In admin directory, edit auth.php to change the administrator user name and password
(default values are 'admin' and 'admin').
7. It is highly recommended that the admin and settings directories be password protected. If at
all possible, the admin directory should also be set to only allow SSL access. When logging into
the admin directory using standard http access, your directory user name and password are not
encrypted. With https access, these items are encrypted and the risk of unauthorized access to the
admin directory is greatly reduced. The common_template and include directories should also
be protected. Do NOT restrict js_suggest or templates!
8. On Linux machines, you should check to bew sure your web server has read/write/delete
permissions for the admin/backup, admin/log, admin/reports, admin/sitemaps, and
admin/tmp directories. There is also a tmp directory in Sphider home that needs web server
permissions. (Not all of these directories exist in SphiderLite).
9. Open admin/admin.php in a browser and start using Sphider.
10. The first step to take after getting the admin screen should be to click on the "Database" tab
to ensure that all 29 tables (26 tables in SphiderLite) have been successfully created.
Upgrading an existing installation
1. If you already have an earlier installation of Sphider, you should first make a backup of your
existing database and store it in a safe place.
2. In the server, alter your database in MySQL (or use phpMyAdmin) to current standards.
a) at command prompt type (to log into MySQL):
mysql -u <your username> -p
Enter your password when prompted.
b) in MySQL, type:
ALTER DATABASE `sphider_db` CHARACTER SET utf8mb4 COLLATE
utf8mb4_0900_ai_ci;
Use your current database name in place of sphider_db.
c) Use exit to exit MySQL.
7
3. Delete these current directories and their contents:
admin
include
common_template
js_suggest
languages
settings
sql
templates
upgrade (if it exists)
Then delete the current files: changelog, install.txt, search.php, and SphiderUserGuid.pdf.
4. Unpack the new files to your existing sphider directory which you have just cleaned out.
5. In settings directory, edit database.php file and change $database, $mysql_user,
$mysql_password and $mysql_host to correct values. If you don't know what $mysql_host
should be, it should probably stay as it is - 'localhost'. There is also $mysql_table_prefix,
defaulted to a null value. If you desire to change this, the names of the soon to be created tables
will all begin with the value of $mysql_table_prefix. For example, if you set $mysql_table_prefix
= "sph_", the table "keywords" will be created as "sph_keywords". The prefix is optional. Edit
the my.cnf file to match values in database.php.
6. Open version_update.php script (admin directory) in your browser, which will update the
tables necessary for Sphider to operate. Your existing data should be preserved.
7. In admin directory, edit auth.php to change the administrator user name and password
(default values are 'admin' and 'admin').
8. It is highly recommended that the admin directory be password protected. If at all possible,
the admin directory should also be set to only allow SSL access. When logging into the admin
directory using standard http access, your directory user name and password are not encrypted.
With https access, these items are encrypted and the risk of unauthorized access to the admin
directory is greatly reduced.The common_template and include directories should also be
protected. Do NOT restrict js_suggest or templates!
9. Open admin/admin.php in a browser and start using your updated Sphider.
NOTE ABOUT UPGRADING - The changelog lists which files have changed. It may be
tempting to ONLY replace the changed files and be done with it. While this may be fine on a
base level, if you do so, PLEASE DO RUN the version_update.php. It will make needed
changes to your database.
FINAL NOTE ABOUT INSTALLATION - When you have completed installing or upgrading
Sphider, the install.php and update_rollup.php scripts should be deleted. You won't be needing
them and there is no sense leaving them around for someone else to misuse.
8
Using the Admin Panel
Settings Tab
Figure 1: Settings tab
There are 68 user configurable settings (62 in SphiderLite) on this page.
GENERAL SETTINGS
Language A drop down list of available languages is
provided. This is the language which will
appear to the user on the search page.
Search template This drop down list shows available
templates. Each template uses a CSS file to
determine the look of the user search and
search results pages.
Administrator e-mail address The e-mail address to which spidering log
files may be sent.
Print spidering logs to standard out If this is checked, the spidering results will
be displayed in the browser as spidering
progresses.
9
Temporary directory This is the name and relative or absolute
path to the temporary directory. This
directory is used by Sphider during the
parsing of url's during indexing. If a
Windows path containing backslashes is
used, the next setting, Windows OS, must
be enabled. The path must exist.
Windows OS Check this box if Sphider is to be run in a
Windows environment.
10
Figure 2: Example of a spidering log printed to standard output
LOGGING SETTINGS
Log spidering results If checked, a log file will be created for
each occurrence of indexing or re-indexing.
Log directory This is the name and relative or absolute
path to the log file directory. This directory
is where spidering log files are stored. If a
Windows path containing backslashes is
used, the next setting, Windows OS must
be enabled in General Settings.. The path
must exist.
Log file format Log file may be in either HTML or text
format.
Send spidering log to e-mail If checked, the spidering log will be
e-mailed to the Administrator.
SPIDER SETTINGS
Required number of words in a
page to be indexed
This sets the minimum number of words
which must appear on a page for it to be
indexed.
Minimum word length in order to
be indexed
This sets the minimum length of a word
before it can be indexed.
Keyword weight depending on the
number of times it appears in a page
is capped at this value
A keywords weight is increased by the
number of times it is used on a page. This
caps the weight of a keyword.
Index numbers If checked, numbers will be indexed. (They
are subject to minimum word length rules.)
Index decimal numbers If checked, decimal numbers will be
indexed. (This setting will be ignored if the
above 'Index numbers' is not also checked.)
Decimal separator Decimal period is default, but decimal
comma may be chosen. Choice affects
thousands separator.
Index words in domain name and
url path
If checked, words appearing in the domain
name or path to a page will be indexed.
Index meta keywords If enabled, keywords appearing in meta
tags are indexed.
Index images If checked, each page being indexed will be
checked for images, and if found, the
images will also be indexed. (Not in
SphiderLite)
Minimum image width If Sphider can determine this size, this is
the minimum width which will be
accepted. (Not in SphiderLite)
Minimum image height If Sphider can determine this size, this is
the minimum height which will be
accepted. (Not in SphiderLite)
11
Index PDF files If checked, PDF files will be parsed and
indexed.
Index DOC files If checked, DOC, DOCX, and ODT files
will be parsed and indexed.
Index XLS files If checked, XLS files will be parsed and
indexed.
Index PPT files If checked, PPT files will be parsed and
indexed.
Full executable path to PDF
converter
This is the full path to the PDF converter.
For a Windows OS, backslashes may be
used. NOTE: The converter is not provided
as a part of Sphider.
Full executable path to catdoc
converter
This is the full path to the catdoc converter.
For a Windows OS, backslashes may be
used. NOTE: The converter is not provided
as a part of Sphider.
Full executable path to XLS
converter
This is the full path to the XLS converter.
For a Windows OS, backslashes may be
used. NOTE: The converter is not provided
as a part of Sphider.
Full executable path to PPT
converter
This is the full path to the PPT converter.
For a Windows OS, backslashes may be
used. NOTE: The converter is not provided
as a part of Sphider.
Full executable path to Pandoc
converter
This is the full path to the Pandoc
converter. For a Windows OS, backslashes
may be used. NOTE: The converter is not
provided as a part of Sphider. Pandoc is
needed to convert DOCX or ODT files.
User agent string This is the user agent string which will
appear in the log files of the domain being
spidered and indexed. It can be up to 50
characters in length.
Minimal delay between page
downloads
The minimum time, in seconds, between
page downloads during spidering.
Increasing this number will increase the
amount of time required to spider a site, but
may reduce the number of time-out errors.
Pause When checked, Sphider will pause for (1,
2, or 5) minutes after indexing (10, 20, 30,
or 50) pages
Use word stemming If used, this should be enabled BEFORE
indexing. It allows, for example, a search
for the word "run" to also return "runs" or
"running".
Language to stem Each language has its own algorithm
Strip session ids If enabled (recommended), session ids are
removed from spidering results.
12
SEARCH SETTINGS
Default results per page This sets the number of results shown per
page to 10, 20, or 50. (it can be overriddeen
on the search screen.)
Number of columns in category list If categories are shown on the search page,
this determines the number of columns to
be used in their display.
Bound number of search results This limits the number of search results
returned. When set to 0, the limit is
removed.
The length of the description string This limits the length of the description
string retrieved from the database.
Visually, it will have no impact on the
length of the description shown in search
results unless the value is less than
"Maximum length of page summary"
(below). A 0 removes the limits.
Number of links shown to
"previous" and "next" pages
This limits the number of links shown for
"Previous" and/or "Next" pages when the
number of results returned exceeds the
maximum number of results per page.
Floor for query scores Limits results to this minimum score.
0 means no limit.
Show meta description in a results
page
If enabled, the meta description will be
used if available. If not available, the
normal page extract will be shown in the
result descriptions.
Advanced search Changes the default AND search to a
AND/OR/Phrase search.
Show result number Toggles the result number on the results
report
Show index date Displays the index date of the page
reported
Show url Shows the url of the reported result
Show query scores This shows the query scores (chance of
relevance) for each returned search result.
Show stars Shows scores using a 5 star system. Show
query scores must also be enabled.
Show categories If enabled, categories will be displayed on
the search form.
Maximum length of page summary This controls the length of the page
summary for each search result.
Enable spelling suggestions (Did
you mean...)
If enabled, when a search returns empty but
Sphider finds a similar word or phrase in
the database, it will be suggested.
Show the 2 most relevant links
from each site
If enabled, only the 2 most relevant links in
each domain are returned.
13
FORM SELECTION (Full version only)
Display the classic search form This will make the classic search form
available.
If no forms are selected, classic will auto-
select.
Display the RSS search form This will make the RSS search form
available.
Display the image search form This will make the image search form
available.
SUGGEST
Enable Sphider Suggest This turns the suggestion feature on. If
unchecked, none of the next five items are
of any effect.
Search for suggestions in query log This enables suggestions from the query
log. Only successful queries appear. Query
log suggestions take priority over keyword
or phrase suggestions.
Search for suggestions in keywords Enable suggestions from keywords. By
default, suggestions are returned
alphabetically.
Use weighting when suggesting
keywords
If suggestions for keywords is enabled, this
alters their return from alphabetical to
weighted. Keywords are weighted by
frequency of occurrence.
Search for suggestions in phrases Enables suggestions from keyword phrases.
This setting overrides any keyword
settings, although phrase suggestions do
not occur unless more than one word is
entered.
Limit number of suggestions Controls the number of suggestions in the
drop down from the query.
WEIGHTS
Relative weight of a word in the
title of a webpage
Assigns a relative weight to words
appearing in a page title.
Relative weight of a word in the
domain name
Assigns a relative weight to words
appearing in the domain name.
Relative weight of a word in the
path name
Assigns a relative weight to words
appearing in a url path name.
Relative weight of a word in
meta_keywords
Assigns a relative weight to words
appearing in meta tag keywords.
If any of the options in the Setting tab are altered, click the "Save Settings" button at the bottom
of the page. The page will automatically refresh with the new settings.
14
Sites Tab
This tab shows information on all sites in the database. If this is a new installation, this tab
appear as in Figure 3. When one or more sites have been added, you will see each site, one per
line, showing Site name, URL, Indexing status, and a link to Options so you may edit the site. On
the upper left of the Sites tab, you will initially have an additional link, Add site. Once one or
more sites have been added to the database, a second link, Reindex all, will appear. See Figure 4.
15
Figure 3: Initial appearance of the Sites screen
Figure 4: Sites tab after several sites have been added
Add site:
Figure 5: Add a site screen
From this screen, you can add sites to the database. For URL, enter the complete url of the site
you want to add, for example, "http://www.bobbuilder.com/".
For the Title, enter the title of the site, for example, "Bob the Builder".
The Short description is a description of the site, for example, "Bob Smith, builder of fine custom
homes in the Red River Valley".
If any categories exist, they will be displayed and you may choose which category or categories
best fit this site.
Click "Add" to save the site. You will be taken to a new page showing the information you have
entered about the site. Except for the “Site added” caption, this is the Options page accessed
from the main Sites screen with each site listed. See Figure 6.
16
Figure 6: Site added screen showing options
On the right, there will be several options.
Edit takes you to the Edit site screen (Figure 7) which allows you to make changes to the site.
You can change the title, description, or even change the selected categories. Most importantly,
there are several other changes which may be made. Spidering options allows you to control how
deep into a web site you wish to spider. The default, 2, means spider will search no more than
two clicks away from the home page. Setting this option to Full removes any limitation.
Index using a sitemap, if available causes the site to be indexed by using the contents of the sites
sitemap.xml (if it exists and is valid) instead of crawling and following links.
Ignore robots.txt for images causes to Spider to ignore the same rules as apply to indexing of
links. Some sites may allow a page to be indexed, but ask that you keep hands off indexing
images. This allows that to be overridden, but is not a recommended thing to do. Respect the site
owners. (If you ARE the owner, then go ahead and index away!) (Not in SphiderLite.)
The Spider can leave domain means the search can include links to other sites.
Index foreign images allows you to index referenced images which are not native to the domain
being indexed. (Not in SphiderLite.)
Common text language: This allows you to choose the common text language for the selected
website. Common text words are excluded from indexing.
URL's must include is a list, one per line, of url's which must be included in the spidering. For
example, you may want www.mysite.com/gotta-see-this to be indexed, so you would enter
"/gotta-see-this" in the text box.
17
Figure 7: Edit site screen
URL's must not include is a list, one per line, of url's which are not to be included in the
spidering. If you have a set of pages in www.mysite.com/donot-search-here, you would enter
"/donot-search-here" in the text box.
Both the must and must not lists may optionally use Perl style regular expressions in lieu of
literal strings. Every string starting with a '*' in front is considered as a regular expression, so that
'*/[a]+/' denotes a string with one or more a's in it. The delimiter used does not need to be a '/'
(slash), but it is recommended that the character used not be one occurring in the regular
expression.
When finished editing the site, be sure to click "Update" to save your changes. This will take
you back to the main page on the Sites tab.
The Index (or Re-index) option takes you to a page where you may enter or change indexing
options. This is initially a subset of the spider options given on the Edit page. Advanced options
in the upper left will expand to show all indexing options. When you are ready, click "Start
indexing". Be patient. It may appear nothing is happening, but you may notice your browser
indicating activity. If you enabled "Print spidering results to standard out" on the Settings tab,
you will soon begin to see the spidering log appear. It will indicate when spidering is complete. If
you did not enable "Print spidering results to standard out", just wait it out. Depending on the size
of the site being crawled, it may be from a minute to an hour or more. When images are being
indexed, this can add significantly to the time required.
Clear site allows all links and keywords associated with the site to be deleted. This essentially
resets the site to a “Not indexed” status. (Images associated with the site are NOT deleted.) Clear
site may be absent if the site hasn’t yet been indexed.
The Browse pages option lets you view a list of pages indexed on the site. If there is a long list,
there is a filter which you can use to narrow the results. For example, putting "/contacts" in the
filter and clicking the "Filter" button will restrict the pages listed to those containing "/contacts"
in the url. You can change the number of urls listed per page. The default is 10. You also have
the option to delete an indexed page from the database.
18