1.
Core Project:
a.
Main application:
Our
main application is a search engine that deals with structured, semi-structured
data types. Our main objective is to find and retrieve the information from big
data.
We
will have instant activities that will populate instant changes on our web
based application such as weather, trending tweets and stock prices.
There
will be also news part, this will be also instantly changing category. Some of
the information types such as weather, and news cannot be part of continuous
crawling because these type of information is changeable. Therefore, data
communication should be stabilized.
Main
application should communicate with database. When a query created, it should
start looking for the right key word out of cluster of tables. Then should
retrieve the information as a list in pagination.
We
will have several instance for each part in the projects web services will help
us to create this communication in between them.
Both
client side and server side will be structured on event driven architecture (micro-services)
because the concept itself requires that every application or part should have
its own database. This type mechanism provides us data consistency across
services.
It
will also help us make an easy separation in between the data routes.
(Structured data type comes from our database and semi-structured data type
comes from live services)
i. Client
Side:
1.
Get requests:
Live data should consume by the client end
however It need to mix with our database results
2.
Post request:
It should get the users input and parse into
our server side(back end)
ii. Server
Side:
1.
Get requests:
Collects all the semi-structured data from
targeted APIs’.
It also needs to be able to get all item in
database.
2.
Post requests:
Getting the
proper queries and sending them to client side to consume by the client side.
2.
Bot/Crawler Project:
It will travel around the web and collect
unstructured data and prepare them for our core project. In this part our main
target data unstructured data type, bots should consume those from web sites.
Per visit should be anonymous and agreed with web-sites robot.txt files. Every
time a bot visits a web page, it needs to create its own web agent. Therefore,
there has to be more one technology that should be applied.
a.
URL finder:
URL finder is our main and key
component for collecting data. It should find every clickable item (Attribute: HREF) on the web site and
then it should create a folder which must carry the same name as the bot crawls
the site, Finally, it should list all the links.
b.
Text finder:
Later, using multithreading methods our
scrappers should start visiting each item in the crawled list and grab the
information inside of the web site. It also needs to prepare this information
for the next step. Parsing to JSON file is our next step. It should act like a
client end and post JSON objects to our database. Once they became
semi-structured data, it would be easier to manipulate and find in database.
3.
Search Engine Project:
a.
Create Index:
Our corpus (document
collection) as I earlier mention, is mixed with semi-structured and structured
data. To simplify our work and focus on the core algorithms, and we won’t write
the crawler to get the articles from web sites such as Wikipedia, CNN, Bloomberg,
AccuWeather and etc. We will use a prebuilt file which contains approximately
300.000 articles from various fields ranging from computer science to arts.
Therefore, we need structure of the file is as follows:
<page>
<id> page-ID (an integer) </id>
<title> title of the page </title>
<text>
Contents of the article
Can span multiple lines
May contain non-alphanumeric characters and
links
</text>
</page>
<page>
<id> another page-ID (every page-ID is
distinct) </id>
<title> title of another page
</title>
<text>
Body of the article
</text>
</page>
As we can see, there is an XML
like structure in the collection. Every article lies between <page> and
</page> tags, where the page-ID, title, and text is separated by the
corresponding tags. We will write our own routine to parse this structure by
using regular expressions.
We will assume that these
special tags won’t appear in the body of the articles. However, the articles
may contain non-alphanumeric characters and capital letters that we will pay
special attention to while building the index. For example, we don’t want to
index apple, Apple, and APPLE differently.
They are basically all apple.
Or considering the grammatical structure of the language, we don’t want to
index research, researches, and researching separately. They are all about
research. When we search for one of these terms we would expect results
containing any of those variations. Additionally, we wouldn’t like to index
words such as ‘the’, ‘a’, ‘an’ because they appear in almost every document and
they don’t give us very much information about the document or the query. These
very common words are called stop words.
i. Inverted
Index:
an inverted index is a data structure that we build while
parsing the documents that we are going to answer the search queries on. Given
a query, we use the index to return the list of documents relevant for this
query. The inverted index contains mappings from terms (words) to the documents
that those terms appear in. Each vocabulary term is a key in the index whose
value is its postings list. A term’s postings list is the list of documents
that the term appears in. To illustrate with an example, if we have the
following documents:
Document 1: Information Retrieval and Web Search
Document 2: Search Engine Ranking
Document 3: Web Search Course
Then the postings list of the term ‘web’ would be the list
[1, 3], meaning the term ‘web’ appears in documents with IDs 1 and 3. Similarly
the postings list of the term ‘search’ would be [1, 2, 3], and for the term
‘course’ the postings list would be [3]. We may want to keep additional
information in the index such as the number of occurrences of the term in the
whole collection (its term frequency), or the number of different documents
that the term appears in (its document frequency), the positions of the term’s
occurrences within a document etc. The amount of information we keep in our
index will grow as we add more functionality to our search engine.
ii. Query Types:
So,
what types of queries our search engine will answer? We will answer the types
of queries that we use while doing search every day. Namely:
1) One Word Queries: Queries that consist of a
single word. Such as movie, or hotel.
2) Free Text Queries: Queries that contain
sequence of words separated by space. Such as godfather movie, or hotels in San
Francisco.
3) Phrase Queries: These are more advance
queries that again consist of sequence of words separated by space, but they
are typed inside double quotes and we want the matching documents to contain
the terms in the query exactly in the specified order. Such as “godfather
movie”.
iii. Parsing The
Collection:
While
parsing the document collection we will decide which words will be the terms in
the index. As mentioned above, we don’t want every word to be a term. So, while
parsing the Wikipedia articles we will perform the following operations on each
page in this order:
1) Concatenate the title and the text of the
page.
2) Lowercase all words.
3) Get all tokens, where a token is a string of
alphanumeric characters terminated by a non-alphanumeric character. The
alphanumeric characters are defined to be [a-z0-9]. So, the tokens for the word
‘apple+orange’ would be ‘apple’ and ‘orange’.
4) Filter out all the tokens that are in the
stop words list, such as ‘a’, ‘an’, ‘the’.
5) Stem each token using porter stemmer to
finally obtain the stream of terms. Porter Stemmer removes
common endings from words. For example, the stemmed version of the words fish,
fishes, fishing, fisher, fished are all fish.
b.
Query Index:
This is the second
part of our implementing a search engine project. The first part was
about creating the
inverted index. Now, we will use
the index to answer actual search queries. Although there are more on queries
such phrase queries and free text queries, we will discuss these when we get to
that point.
i. Query
Types:
Let’s first remember
the query types. Our search engine is going to answer 3 types of queries that
we generally use while searching.
1) One Word Queries (OWQ): OWQ consist of only a
single word. Such as computer, or university. The matching documents are the
ones containing the single query term.
2) Free Text Queries (FTQ): FTQ contain sequence
of words separated by space like an actual sentence. Such as computer science,
or Brown University. The matching documents are the ones that contain any of
the query terms.
3) Phrase Queries (PQ): PQ also contain sequence
of words just like FTQ, but they are typed within double quotes. The meaning
is, we want to see all query terms in the matching documents, and exactly in
the order specified. Such as “Turing Award”, or “information retrieval and web
search”.
c.
Ranking:
We basically have a
search engine that can answer search queries on a given corpus, but the results
are not ranked. Now, we will include ranking to obtain an ordered list of
results, which is one of the most challenging and interesting parts.
The first ranking
scheme we will implement is tf-idf. In the following articles, we’ll
analyze Okapi BM25, which is a variant of tf-idf. We will also implement
Google’s PageRank. Then we will explore Machine Learning techniques such
as Naive Bayes Classifier, linear regression (Neural Networks), Clustering, Decision Trees and so forth.
Tf-idf is a
weighting scheme that assigns each term in a document a weight based on its
term frequency (tf) and inverse document frequency (idf). The terms with
higher weight scores are considered to be more important. It’s one of the most
popular weighting schemes in Information Retrieval.
Conclusion:
There are many small applications that communicates each
other and help each other to build the core project / main application. We will
add and improve more on the core project. So far we know how to structure our
database based on the information type we consume throughout the internet. As
per user search (creating a query) it returns similar and accurate results.