Friday 26 December 2014

PROJECT EULER SOLUTION # 107

Project Euler Problem # 107
The following undirected network consists of seven vertices and twelve edges with a total weight of 243.

The same network can be represented by the matrix below.
    
A
B
C
D
E
F
G
A
-
16
12
21
-
-
-
B
16
-
-
17
20
-
-
C
12
-
-
28
-
31
-
D
21
17
28
-
18
19
23
E
-
20
-
18
-
-
11
F
-
-
31
19
-
-
27
G
-
-
-
23
11
27
-
However, it is possible to optimise the network by removing some edges and still ensure that all points on the network remain connected. The network which achieves the maximum saving is shown below. It has a weight of 93, representing a saving of 243 − 93 = 150 from the original network.

Using network.txt (right click and 'Save Link/Target As...'), a 6K text file containing a network with forty vertices, and given in matrix form, find the maximum saving which can be achieved by removing redundant edges whilst ensuring that the network remains connected.
Project Euler Solution # 107 in Python



MAX_INT=2147483647
tree=[]
numberOfVertices=40

class GraphNode:
    visited=0
    nodeNumber=0
    adjacencyList=[]
    key=-1
    def __init__(self,nodeNumber):
        self.key=MAX_INT
        self.adjacencyList=[]
        self.nodeNumber=nodeNumber
        self.visited=0

    def updateAdjacencyList(self,nodeList,adjacencyMatrix):
        for i in range(numberOfVertices):
            if adjacencyMatrix[self.nodeNumber][i]!=-1:
                self.adjacencyList.append(nodeList[i])

def extractMin(Q,adjacencyMatrix,answer):
    minNode=None
    minNodev=None
    min=MAX_INT
    for i in range(len(tree)):
        v=tree[i]
        for j in range(len(v.adjacencyList)):
            u=v.adjacencyList[j]
            if u.visited==0 and adjacencyMatrix[v.nodeNumber][u.nodeNumber] < min :
                min=adjacencyMatrix[u.nodeNumber][v.nodeNumber]
                minNode=u
                minNodev=v
    answer+=adjacencyMatrix[minNodev.nodeNumber][minNode.nodeNumber]
    Q.remove(minNode)
    minNode.visited=1
    return [minNode,answer]
def isNodeInQ(v,Q):
    for i in range(len(Q)):
        if Q[i]==v:
            return 1
    return 0
def MST_Prim(nodeList,adjacencyMatrix):
    answer=0
    nodeList[0].key=0
    nodeList[0].visited=1
    tree.append(nodeList[0])
    Q=[]
    for i in range(1,numberOfVertices):
        Q.append(nodeList[i])
    while len(Q)!=0:
        tempList=extractMin(Q,adjacencyMatrix,answer)
        u=tempList[0]
        answer=tempList[1]
        tree.append(u)
        for i in range(len(u.adjacencyList)):
            v=u.adjacencyList[i]
            if isNodeInQ(v,Q)==1 and adjacencyMatrix[u.nodeNumber][v.nodeNumber]<v.key:
                v.key=adjacencyMatrix[u.nodeNumber][v.nodeNumber]
    return answer

test_file=open("a.txt","r+")
adjacencyMatrix=test_file.read().split('\n')

for i in range(numberOfVertices):
    adjacencyMatrix.append(adjacencyMatrix[0].split(','))
    adjacencyMatrix.remove(adjacencyMatrix[0])
totalWeight=0
for i in range(numberOfVertices):
    for j in range(numberOfVertices):
        if adjacencyMatrix[i][j]!='-':
            adjacencyMatrix[i][j]=int(adjacencyMatrix[i][j])
            totalWeight+=adjacencyMatrix[i][j]
        else:
            adjacencyMatrix[i][j]=-1
nodeList=[]
counter=0
for i in range(numberOfVertices):
    nodeList.append(GraphNode(counter))
    counter+=1

# updating the adjacency list of each node
for i in range(len(nodeList)):
    nodeList[i].updateAdjacencyList(nodeList,adjacencyMatrix)

# applying Prim's Algorithm for finding Minimum Spanning Tree
reducedWeight=MST_Prim(nodeList,adjacencyMatrix)
print "Savings :", totalWeight/2-reducedWeight


Wednesday 24 December 2014

PROJECT EULER SOLUTION # 83

Project Euler Problem # 83

In the 5 by 5 matrix below, the minimal path sum from the top left to the bottom right, by moving left, right, up, and down, is indicated in bold red and is equal to 2297.

Find the minimal path sum, in matrix.txt (right click and "Save Link/Target As..."), a 31K text file containing a 80 by 80 matrix, from the top left to the bottom right by moving left, right, up, and down.

Project Euler Solution # 83 in Python



numberOfNodes = 80

class GraphNode:
    row = 0
    col = 0
    data = 0
    shortestDistance=0 # distance from source
    parent=None
    adjacencyList = []

    def __init__(self, row, col, data):
        self.row = row
        self.col = col
        self.data = data
        self.shortestDistance=0
        parent=None

    def printGraphNode(self):
        print(self.data),

    def updateAdjacencyList(self, nodeList):
        self.adjacencyList=[]
        if self.row - 1 != -1:
            self.adjacencyList.append(nodeList[self.row - 1][self.col])
        if self.row + 1 != len(nodeList):
            self.adjacencyList.append(nodeList[self.row + 1][self.col])
        if self.col - 1 != -1:
            self.adjacencyList.append(nodeList[self.row][self.col - 1])
        if self.col + 1 != len(nodeList):
            self.adjacencyList.append(nodeList[self.row][self.col + 1])


    def printAdjacencyList(self):
        for i in range(len(self.adjacencyList)):
            print self.adjacencyList[i].data,


# class ended here...
def initialiseSingleSource(nodeList,source):
    for i in range(len(nodeList)):
        for j in range(len(nodeList[i])):
            nodeList[i][j].shortestDistance=2147483647
            nodeList[i][j].parent=None
    source.shortestDistance=source.data

def extractMin(Q):
    mini=0
    min=Q[0].shortestDistance
    for i in range(len(Q)):
        if Q[i].shortestDistance < min :
            min=Q[i].shortestDistance
            mini=i
    u=Q[mini]
    Q.remove(Q[mini])
    return u

def relax(u,v):
    if v.shortestDistance > u.shortestDistance+v.data:
        v.shortestDistance=u.shortestDistance+v.data
        v.parent=u

def dijkstra(nodeList,source):
    initialiseSingleSource(nodeList,source)
    Q=[]
    for i in range(len(nodeList)):
        for j in range(len(nodeList[i])):
            Q.append(nodeList[i][j])

    while len(Q) != 0 :
        u=extractMin(Q)
        for i in range(len(u.adjacencyList)):
            relax(u,u.adjacencyList[i])

test_file = open("a.txt", "r+")

myList = test_file.read().split('\n')

listLen=len(myList)
for i in range(listLen):
    myList.append(myList[0].split(','))
    myList.remove(myList[0])

for i in range(numberOfNodes):
    for j in range(numberOfNodes):
        myList[i][j] = int(myList[i][j])

nodeList = []
for i in range(numberOfNodes):
    nodeList.append([])
    for j in range(numberOfNodes):
        nodeList[i].append(GraphNode(i, j, myList[i][j]))

for i in range(numberOfNodes):
    for j in range(numberOfNodes):
        nodeList[i][j].updateAdjacencyList(nodeList)

dijkstra(nodeList,nodeList[0][0])

print "answer is : ",nodeList[numberOfNodes-1][numberOfNodes-1].shortestDistance

Tuesday 23 December 2014

WEB ANALYTICS

WEB ANALYSIS:
Web analytics is the measurement, collection, analysis and reporting of web data for purposes of understanding and optimizing web usage.
Web Analytics is a Tool for:
  • 1.    Measuring web traffic
  • 2.    Business research
  • 3.    Market research.

Web analytics is not just a tool for measuring web traffic but can be used as a tool for business and market research, and to assess and improve the effectiveness of a website. Web analytics applications can also help companies measure the results of traditional print or broadcast advertising campaigns. It helps one to estimate, how traffic of a website change after the launch of a new advertising campaign. Web analytics provides information about the number of visitors to a website and the number of page views. It helps gauge traffic and popularity trends which is useful for market research.
There are two categories of web analytics;
  • 1.    Off-site web analytics
  • 2.    On-site web analytics.

Off-site web analytics refers to web measurement and analysis regardless of whether you own or maintain a website. It includes the measurement of a website's potential audience (opportunity), share of voice (visibility), and buzz (comments) that is happening on the Internet as a whole.

On-site web analytics measure a visitor's behavior once on your website. This includes its drivers and conversions; for example, the degree to which different landing pages are associated with online purchases. On-site web analytics measures the performance of your website in a commercial context. 

MULTI-MEDIA DATA MINING

Multimedia Data Mining
Multimedia data mining refers to the analysis of large amounts of multimedia information in order to find patterns or statistical relationships. Once data is collected, computer programs are used to analyze it and look for meaningful connections.
Applications of Multimedia Data Mining
1.    Information obtained by multimedia data mining is often used by governments to improve social systems.
2.    It can also be used in marketing to discover consumer habits.
3.    Information obtained by multimedia data mining can be used to discover the previously unknown patterns and trends related to market and business.
Multimedia data mining requires the collection of huge amounts of data. The sample size is important when analyzing data because predicted trends and patterns are more likely to be inaccurate with a smaller sample. This data can be collected from a number of different media, including videos, sound files, and images. Some experts also consider spatial data and text to be multimedia. Information from one or more of these media is the focus of data collection.
Whereas an analysis of numerical data can be straightforward, multimedia data analysis requires sophisticated computer programs which can turn it into useful numerical data. There are a number of computer programs available that make sense of the information gathered from multimedia data mining. These computer programs are used to search for relationships that may not be apparent or logically obvious.

SEARCH ENGINES

SEARCH ENGINES:
·         A web search engine is a software system that is designed to search for information on the World Wide Web.
·         The search results are generally presented in a line of results often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, and other types of files.
·         Some search engines also mine data available in databases or open directories.
·         Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.
web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. This process is called Web crawling or spidering. Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data.

There are differences in the ways various search engines work, but they all perform three basic tasks:
  • They search the Internet -- or select pieces of the Internet -- based on important words.
  • They keep an index of the words they find, and where they find them.
  • They allow users to look for words or combinations of words found in that index.
Early search engines held an index of a few hundred thousand pages and documents, and received maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of millions of pages, and respond to tens of millions of queries per day.

KNOWLEDGE MANAGEMENT SYSTEM

KNOWLEDGE MANAGEMENT SYSTEM:
Knowledge management (KM) is the process of capturing, developing, sharing, and effectively using organizational knowledge. It refers to a multi-disciplined approach to achieving organizational objectives by making the best use of knowledge.
Knowledge management systems refer to any kind of IT system that stores and retrieves knowledge, improves collaboration, locates knowledge sources, mines repositories for hidden knowledge, captures and uses knowledge, or in some other way enhances the KM process.

Knowledge is a familiarity, awareness or understanding of someone or something, such as facts, information, descriptions, or skills, which is acquired through experience or education by perceiving, discovering, or learning. Knowledge can refer to a theoretical or practical understanding of a subject.
1.    Knowledge Management System (KM System) refers to a (generally IT based) system for managing knowledge in organizations for supporting creation, capture, storage and dissemination of information. It can comprise a part (neither necessary nor sufficient) of a Knowledge Management initiative.
2.    The idea of a KM system is to enable employees to have ready access to the organization's documented base of facts, sources of information, and solutions.
3.    KMS systems deal with information so they are a class of information system and may build on, or utilize other information sources.

Purpose of a KMS
  • Improved performance
  • Competitive advantage
  • Innovation
  • Sharing of knowledge
  • Integration
  • Continuous improvement by:
    • Driving strategy
    • Starting new lines of business
    • Solving problems faster
    • Developing professional skills
    • Recruit and retain talent
Activities in Knowledge Management
  • Start with the business problem and the business value to be delivered first.
  • Identify what kind of strategy to pursue to deliver this value and address the KM problem
  • Think about the system required from a people and process point of view.
  • Finally, think about what kind of technical infrastructure are required to support the people and processes.
  • Implement system and processes with appropriate change management and iterative staged release.
Level of Knowledge Management
For an organizational Knowledge Management (KM) strategy to be most effective, it should hit various levels:
  • Personal or individual
  • Department, project or team
  • Organization-wide
  • Inter-organization








DATA MARTS

Data-Marts
data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse.

A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data-subjects, that may be distributed to support business needs. Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Data marts are often derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data marts.

Reasons for creating a data mart
1.    Easy access to frequently needed data
2.    Creates collective view by a group of users
3.    Improves end-user response time
4.    Ease of creation
5.    Lower cost than implementing a full Data warehouse
6.    Potential users are more clearly defined than in a full Data warehouse

TEXT MINING

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Text mining is the analysis of data contained in natural language text. Text mining works by transposing words and phrases in unstructured data into numerical values which can then be linked with structured data in a database and analyzed with traditional data mining techniques.
Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
Text Mining involves:
1.    Structuring the input text
2.    Deriving patterns in the input text
3.    Evaluation and interpretation of the output.
Steps in Text Analysis Process
1.    Information retrieval or identification of a corpus is a preparatory step
3.    Recognition of Pattern Identified Entities
4.    Co-reference
5.    Relationship, fact, and event Extraction
6.    Sentiment analysis
7.    Quantitative text analysis is a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns

NOISE AND DATA REDUNDANCY

Redundancy
Redundancy in information theory is the number of bits used to transmit or store a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit or store certain data. Data compression is a way to reduce or eliminate unwanted redundancy, while checksums are a way of adding desired redundancy for purposes of error detection when communicating over a noisy channel of limited capacity.
Data redundancy occurs in database systems which have a field that is repeated in two or more tables. For instance, when customer data are duplicated and attached with each product bought, then redundancy of data is a known source of inconsistency since customer might appear with different values for given attribute.

Definition - What does Data Redundancy mean?
Data redundancy is a condition created within a database or data storage technology in which the same piece of data is held in two separate places.
This can mean two different fields within a single database, or two different spots in multiple software environments or platforms. Whenever data is repeated, this basically constitutes data redundancy. This can occur by accident, but is also done deliberately for backup and recovery purposes.
Within the general definition of data redundancy, there are different classifications based on what is considered appropriate in database management, and what is considered excessive or wasteful. Wasteful data redundancy generally occurs when a given piece of data does not have to be repeated, but ends up being duplicated due to inefficient coding or process complexity.
A positive type of data redundancy works to safeguard data and promote consistency. Many developers consider it acceptable for data to be stored in multiple places. The key is to have a central, master field or space for this data, so that there is a way to update all of the places where data is redundant through one central access point. Otherwise, data redundancy can lead to big problems with data inconsistency, where one update does not automatically update another field. As a result, pieces of data that are supposed to be identical end up having different values.
Noisy Data
Noisy data is meaningless data. The term has often been used as a synonym for corrupt data. However, its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines, such as unstructured text. Any data that has been received, stored, or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy.
Noisy data unnecessarily increases the amount of storage space required and can also adversely affect the results of any data mining analysis. Statistical analysis can use information gleaned from historical data to weed out noisy data and facilitate data mining.

Noisy data can be caused by hardware failures, programming errors and gibberish input from speech or optical character recognition (OCR) programs. Spelling errors, industry abbreviations and slang can also impede machine reading.