SadikErisen: August 2017

1. Core Project:

a. Main application:

Our main application is a search engine that deals with structured, semi-structured data types. Our main objective is to find and retrieve the information from big data.

We will have instant activities that will populate instant changes on our web based application such as weather, trending tweets and stock prices.

There will be also news part, this will be also instantly changing category. Some of the information types such as weather, and news cannot be part of continuous crawling because these type of information is changeable. Therefore, data communication should be stabilized.

Main application should communicate with database. When a query created, it should start looking for the right key word out of cluster of tables. Then should retrieve the information as a list in pagination.

We will have several instance for each part in the projects web services will help us to create this communication in between them.

Both client side and server side will be structured on event driven architecture (micro-services) because the concept itself requires that every application or part should have its own database. This type mechanism provides us data consistency across services.

It will also help us make an easy separation in between the data routes. (Structured data type comes from our database and semi-structured data type comes from live services)

i. Client Side:

1. Get requests:

Live data should consume by the client end however It need to mix with our database results

2. Post request:

It should get the users input and parse into our server side(back end)

ii. Server Side:

1. Get requests:

Collects all the semi-structured data from targeted APIs’.

It also needs to be able to get all item in database.

2. Post requests:

Getting the proper queries and sending them to client side to consume by the client side.

2. Bot/Crawler Project:

It will travel around the web and collect unstructured data and prepare them for our core project. In this part our main target data unstructured data type, bots should consume those from web sites. Per visit should be anonymous and agreed with web-sites robot.txt files. Every time a bot visits a web page, it needs to create its own web agent. Therefore, there has to be more one technology that should be applied.

a. URL finder:

URL finder is our main and key component for collecting data. It should find every clickable item (Attribute: HREF) on the web site and then it should create a folder which must carry the same name as the bot crawls the site, Finally, it should list all the links.

b. Text finder:

Later, using multithreading methods our scrappers should start visiting each item in the crawled list and grab the information inside of the web site. It also needs to prepare this information for the next step. Parsing to JSON file is our next step. It should act like a client end and post JSON objects to our database. Once they became semi-structured data, it would be easier to manipulate and find in database.

3. Search Engine Project:

a. Create Index:

Our corpus (document collection) as I earlier mention, is mixed with semi-structured and structured data. To simplify our work and focus on the core algorithms, and we won’t write the crawler to get the articles from web sites such as Wikipedia, CNN, Bloomberg, AccuWeather and etc. We will use a prebuilt file which contains approximately 300.000 articles from various fields ranging from computer science to arts. Therefore, we need structure of the file is as follows:

<page>
<id> page-ID (an integer) </id>
<title> title of the page </title>
<text>
Contents of the article
Can span multiple lines
May contain non-alphanumeric characters and links
</text>
</page>
<page>
<id> another page-ID (every page-ID is distinct) </id>
<title> title of another page </title>
<text>
Body of the article
</text>
</page>

As we can see, there is an XML like structure in the collection. Every article lies between <page> and </page> tags, where the page-ID, title, and text is separated by the corresponding tags. We will write our own routine to parse this structure by using regular expressions.

We will assume that these special tags won’t appear in the body of the articles. However, the articles may contain non-alphanumeric characters and capital letters that we will pay special attention to while building the index. For example, we don’t want to index apple, Apple, and APPLE differently.

They are basically all apple. Or considering the grammatical structure of the language, we don’t want to index research, researches, and researching separately. They are all about research. When we search for one of these terms we would expect results containing any of those variations. Additionally, we wouldn’t like to index words such as ‘the’, ‘a’, ‘an’ because they appear in almost every document and they don’t give us very much information about the document or the query. These very common words are called stop words.

i. Inverted Index:

an inverted index is a data structure that we build while parsing the documents that we are going to answer the search queries on. Given a query, we use the index to return the list of documents relevant for this query. The inverted index contains mappings from terms (words) to the documents that those terms appear in. Each vocabulary term is a key in the index whose value is its postings list. A term’s postings list is the list of documents that the term appears in. To illustrate with an example, if we have the following documents:
Document 1: Information Retrieval and Web Search
Document 2: Search Engine Ranking
Document 3: Web Search Course

Then the postings list of the term ‘web’ would be the list [1, 3], meaning the term ‘web’ appears in documents with IDs 1 and 3. Similarly the postings list of the term ‘search’ would be [1, 2, 3], and for the term ‘course’ the postings list would be [3]. We may want to keep additional information in the index such as the number of occurrences of the term in the whole collection (its term frequency), or the number of different documents that the term appears in (its document frequency), the positions of the term’s occurrences within a document etc. The amount of information we keep in our index will grow as we add more functionality to our search engine.

ii. Query Types:

So, what types of queries our search engine will answer? We will answer the types of queries that we use while doing search every day. Namely:
1) One Word Queries: Queries that consist of a single word. Such as movie, or hotel.
2) Free Text Queries: Queries that contain sequence of words separated by space. Such as godfather movie, or hotels in San Francisco.
3) Phrase Queries: These are more advance queries that again consist of sequence of words separated by space, but they are typed inside double quotes and we want the matching documents to contain the terms in the query exactly in the specified order. Such as “godfather movie”.

iii. Parsing The Collection:

While parsing the document collection we will decide which words will be the terms in the index. As mentioned above, we don’t want every word to be a term. So, while parsing the Wikipedia articles we will perform the following operations on each page in this order:
1) Concatenate the title and the text of the page.
2) Lowercase all words.
3) Get all tokens, where a token is a string of alphanumeric characters terminated by a non-alphanumeric character. The alphanumeric characters are defined to be [a-z0-9]. So, the tokens for the word ‘apple+orange’ would be ‘apple’ and ‘orange’.
4) Filter out all the tokens that are in the stop words list, such as ‘a’, ‘an’, ‘the’.
5) Stem each token using porter stemmer to finally obtain the stream of terms. Porter Stemmer removes common endings from words. For example, the stemmed version of the words fish, fishes, fishing, fisher, fished are all fish.

b. Query Index:

This is the second part of our implementing a search engine project. The first part was about creating the inverted index. Now, we will use the index to answer actual search queries. Although there are more on queries such phrase queries and free text queries, we will discuss these when we get to that point.

i. Query Types:

Let’s first remember the query types. Our search engine is going to answer 3 types of queries that we generally use while searching.

1) One Word Queries (OWQ): OWQ consist of only a single word. Such as computer, or university. The matching documents are the ones containing the single query term.

2) Free Text Queries (FTQ): FTQ contain sequence of words separated by space like an actual sentence. Such as computer science, or Brown University. The matching documents are the ones that contain any of the query terms.

3) Phrase Queries (PQ): PQ also contain sequence of words just like FTQ, but they are typed within double quotes. The meaning is, we want to see all query terms in the matching documents, and exactly in the order specified. Such as “Turing Award”, or “information retrieval and web search”.

c. Ranking:

We have come to the third part of our implementing a search engine project, ranking. The first part was about creating the index, and the second part was querying the index.

We basically have a search engine that can answer search queries on a given corpus, but the results are not ranked. Now, we will include ranking to obtain an ordered list of results, which is one of the most challenging and interesting parts.

The first ranking scheme we will implement is tf-idf. In the following articles, we’ll analyze Okapi BM25, which is a variant of tf-idf. We will also implement Google’s PageRank. Then we will explore Machine Learning techniques such as Naive Bayes Classifier, linear regression (Neural Networks), Clustering, Decision Trees and so forth.

Tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important. It’s one of the most popular weighting schemes in Information Retrieval.

Conclusion:

There are many small applications that communicates each other and help each other to build the core project / main application. We will add and improve more on the core project. So far we know how to structure our database based on the information type we consume throughout the internet. As per user search (creating a query) it returns similar and accurate results.

SadikErisen

Tuesday, August 22, 2017

Creating a Chat Bot

Purpose of our research :

System Structure

Building a Search Engine

Data Structures : Tree Traversals

Report Abuse

Labels