+ All Categories
Home > Documents > Fuzy Data Mining

Fuzy Data Mining

Date post: 10-Nov-2015
Category:
Upload: sirerlan
View: 220 times
Download: 0 times
Share this document with a friend
Description:
fuzzy
147
Customer Analysis for Software XploRe — From Data Mining to Marketing Strategy Diplomarbeit zur Erlangung des akademischen Grades eines Master of Science an der Wirtschaftswissenschaftlichen Fakult¨ at der Humboldt-Universit¨ at zu Berlin Eingereicht von Jianqiu Wang Am 27. Mai 2003 Matrikel-Nr.: 161426 Pr¨ ufer: Prof. Dr. Wolfgang H¨ardle
Transcript
  • Customer Analysis for Software XploRe

    From Data Mining to Marketing

    Strategy

    Diplomarbeit

    zur Erlangung des akademischen Grades eines

    Master of Science

    an der Wirtschaftswissenschaftlichen Fakultat

    der Humboldt-Universitat zu Berlin

    Eingereicht von

    Jianqiu Wang

    Am 27. Mai 2003

    Matrikel-Nr.: 161426

    Prufer: Prof. Dr. Wolfgang Hardle

  • Contents

    Abstract 1

    Introduction 3

    1. Customer analysis 5

    1.1 Customer Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.1.1 Customers Black Box . . . . . . . . . . . . . . . . . . . 5

    1.1.2 Consumer buying process . . . . . . . . . . . . . . . . . . 6

    1.1.3 Customer behaviour model . . . . . . . . . . . . . . . . . . 8

    1.1.4 Factors influencing customer buying behaviour . . . . . . . 10

    1.2 Market Segmentation and Profiling . . . . . . . . . . . . . . . . . 12

    1.2.1 Market segmentation . . . . . . . . . . . . . . . . . . . . . 13

    1.2.2 Customer profiling . . . . . . . . . . . . . . . . . . . . . . 22

    1.3 Market targeting and Positioning . . . . . . . . . . . . . . . . . . 23

    1.3.1 Market Targeting . . . . . . . . . . . . . . . . . . . . . . . 23

    1.3.2 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2. Data Mining 26

    2.1 The process of Data mining . . . . . . . . . . . . . . . . . . . . . 26

    2.1.1 Data Collection and Selection . . . . . . . . . . . . . . . . 26

    2.1.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . 28

    2.1.3 Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.1.4 Result Interpretation . . . . . . . . . . . . . . . . . . . . . 29

    2.2 The Aspects of Data Mining . . . . . . . . . . . . . . . . . . . . . 29

    2.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    2.2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.2.3 Data Mining Techniques . . . . . . . . . . . . . . . . . . . 31

    i

  • ii Index of contents

    3. XploRe user and customer analysis 39

    3.1 About XploRe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.2 XploRe user(2002) and customer descriptive analysis . . . . . . . 39

    3.2.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.2.2 Data cleaning and preparation . . . . . . . . . . . . . . . . 41

    3.2.3 Data descriptive analysis and result . . . . . . . . . . . . . 42

    3.2.4 Comparing the user and customer of XploRe . . . . . . . . 46

    3.2.5 Measures of Improvement . . . . . . . . . . . . . . . . . . 46

    3.3 Cluster analysis for XploRe user data 2002 . . . . . . . . . . . . . 47

    3.3.1 Cluster analysis of categorical data . . . . . . . . . . . . . 47

    3.3.2 Clustering with IBM intelligent Miner . . . . . . . . . . . 53

    3.3.3 Cluster analysis with XploRe . . . . . . . . . . . . . . . . 59

    3.3.4 Comparison of Cluster Analysis Results: IBM Intelligent

    Miner versus XploRe . . . . . . . . . . . . . . . . . . . . . 63

    3.4 Analysis of the latest User data (2003) . . . . . . . . . . . . . . . 63

    3.4.1 Results of analysis of 2003 data . . . . . . . . . . . . . . . 63

    3.4.2 Comparison of historical user data . . . . . . . . . . . . . 72

    3.5 Complementary analysis . . . . . . . . . . . . . . . . . . . . . . . 78

    3.5.1 Analysis of regrouped data . . . . . . . . . . . . . . . . . . 78

    3.5.2 Analysis of high profitable sector . . . . . . . . . . . . . . 82

    4. Suggested marketing strategy for XploRe 85

    4.1 Marketing Strategy and Marketing mix . . . . . . . . . . . . . . . 85

    4.1.1 marketing strategy . . . . . . . . . . . . . . . . . . . . . . 85

    4.1.2 Marketing Mix . . . . . . . . . . . . . . . . . . . . . . . . 85

    4.2 Develop the marketing strategy for XploRe . . . . . . . . . . . . . 91

    4.2.1 Niche market strategy . . . . . . . . . . . . . . . . . . . . 92

    4.2.2 Target Market . . . . . . . . . . . . . . . . . . . . . . . . . 92

    4.2.3 Product position of XploRe:103 . . . . . . . . . . . . . . . . 92

  • Index of contents iii

    4.2.4 General XploRe marketing strategy pyramids . . . . . . . 93

    4.2.5 General Marketing Mix . . . . . . . . . . . . . . . . . . . . 96

    4.2.6 Special marketing mix for clusters . . . . . . . . . . . . . . 101

    4.2.7 Marketing research - suggestions for further analysis . . . . 103

    References 107

    Appendix 116

    Appendix 1: User 220702 Frequency Analysis . . . . . . . . . . . . . 117

    Appendix 2: Customer Frequency Analysis (Nov. 05) . . . . . . . . . . 120

    Appendix 3: Customer Registration form. . . . . . . . . . . . . . . . . 121

    Appendix 4: Characteristics of User220702 Clusters by XploRe . . . . . 122

    Appendix 5: User 130303 Frequency Analysis . . . . . . . . . . . . . 123

    Appendix 6: User 13032003 Intelligent Miner Cluster Analysis . . . . 126

    Appendix 7: Comparison of User and Regrouped User Data . . . . . . 128

    Appendix 8: User 130303 (Regrouped) Frequency Analysis . . . . . . 129

    Appendix 9: Regrouped User Intelligent Miner Cluster Analysis . . . 132

    Appendix 10: Institute Users Frequency Analysis . . . . . . . . . . . 134

    Erklarung zur Urheberschaft 137

  • iv Index of contents

  • List of Figures

    1.1 The customers Black box. . . . . . . . . . . . . . . . . . . . . . 6

    1.2 A sequential model of the buying process . . . . . . . . . . . . . . 7

    1.3 Consumer Behaviour model. . . . . . . . . . . . . . . . . . . . . . 9

    1.4 Factors influencing consumer behaviour. . . . . . . . . . . . . . . 10

    1.5 The process of marketing segmentation. . . . . . . . . . . . . . . . 14

    1.6 Alternative consumer demand categories. . . . . . . . . . . . . . . 15

    1.7 SAGACITY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    1.8 Targeting strategies. . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.1 Sample of online survey questionnaire. . . . . . . . . . . . . . . . 40

    3.2 Clustering of Users 2002. . . . . . . . . . . . . . . . . . . . . . . . 55

    3.3 Clustering of user 2003. . . . . . . . . . . . . . . . . . . . . . . . . 67

    3.4 Software used in 2000 and 2003. . . . . . . . . . . . . . . . . . . . 74

    3.5 Information resource in 2000 and 2003. . . . . . . . . . . . . . . . 75

    3.6 Clustering of regrouped user data. . . . . . . . . . . . . . . . . . . 81

    4.1 4P of marketing mix . . . . . . . . . . . . . . . . . . . . . . . . . 86

    v

  • vi Index of contents

  • List of Tables

    1.1 Broad- based ACORN classifications 23 . . . . . . . . . . . . . . . 18

    1.2 National readership survey socio-economic groups 24 . . . . . . . . 19

    2.1 The aspects of data mining . . . . . . . . . . . . . . . . . . . . . . 30

    3.1 Summary and decription of the varibale of User 22/07/02 data . . 44

    3.2 Summary and descripiton of the variables for customer data . . . 45

    3.3 Comparison of XlopRes Users and Customers . . . . . . . . . . . 47

    3.4 Character characteristics of User IBM Intelligent Miner Clusters

    (2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    3.5 Comparison of Clustering results with IBM Intelligent Miner and

    XploRe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    3.6 Summary and description of the variables for User data 2003 . . . 65

    3.7 Comparison of User 220702 and User 130303 . . . . . . . . . . . 72

    3.8 Comparison of software used in 2000 and 2003 . . . . . . . . . . . 73

    3.9 Comparison of information resources in 2000 and 2003 . . . . . . 74

    3.10 Comparison of country in 2000 and 2003 . . . . . . . . . . . . . . 76

    3.11 Comparison of continent in 2000 and 2003 . . . . . . . . . . . . . 76

    3.12 Comparison of User clusters of 2000 and 2003 . . . . . . . . . . . 77

    3.13 Summary and description of the variables of regrouped User data

    2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    3.14 Comparison of Institute user and General user . . . . . . . . . . 84

    vii

  • viii Index of contents

  • Abstract

    This thesis paper presents a case study of customer analysis with the purpose

    of to developing a marketing strategy for the statistical software XploRe. The

    customers analysed include the users, who downloaded XploRe free trial version

    through web site and the actual customers, who bought XploRe. Descriptive

    analysis was conducted for both data, which leaded to the conclusion that re-

    search institutes represent is the high- profit able sector for of XploRe. For users

    data, data mining method clustering was undertaken to identify the customer

    segments. Two different clustering methods were tested on the same users data

    set with different software IBM Intelligent Miner and XploRe. As the a result,

    the users of XploRe were divided into four clusters by both methods, Internet

    surfer,Academia, Linux user and Home worker. Through the comparison

    of historical data for of user data 2003 and data 20020, more facts and trends

    of XploRe market and customers were discovered regarding the software used,

    information resource, new market and the undergoing changes in customer seg-

    ments. Based on the results of customer analysis, the suggestions for marketing

    strategy, marketing mix and further analysis were outlined.

    Key words: customer analysis, market segmentation, data mining, clustering,

    marketing strategy, marketing mix

    1

  • 2 Abstract

  • Introduction

    Customer analysis is a crucial step for the development of marketing strategy.

    Only when the company has a clear view of its customers could , the proper

    strategy and actions could then be undertaken to gain competitive advantage in

    the market.

    In the current time, together with the development of digital data management

    systems, the capability for of gathering, storing and accessing to the information

    has improved dramatically. This trend brings the difficulty for companies when

    they confront the huge amount of data. Data mining is a important technology

    for the companies to conduct customer analysis for large data set. It discoveries

    valuable information which is useful for marketing.

    The research presented in this paper tried to segment the customers and find

    the trends and facts of XploRe market, so that the suggestions for marketing

    strategy could be derived based on the results. XploRe is a statistical software

    which aims at sophisticated users who are looking for a flexible, programmable

    statistics package with an emphasis on more advanced procedures.1 It is impor-

    tant for XploRe marketer to understand its customer and market. The customer

    data studied here include the data of XploRe users (the potential customer) and

    actual customers (the buyers). The user data was collected through an online

    questionnaire preceding the downloading process of XploRe trial version, while

    through the returned registration forms the customer data was gathered. With

    the purpose of comparison, two sets of user data were analysed and two cluster-

    ing methods were tested with two software IBM Intelligent Miner and XploRe.

    The user data 2002 is from October 11, 2001 to July 22, 2002 and with 1734

    profiles. The raw data of user data 2003 contains 2593 profiles and is collected

    from October 11, 2002 to March 13, 2003. The customer data includes data of

    32 profiles from July 1, 2000 to August 30, 2002.

    Only descriptive analysis was taken for customer data due to its low amount

    of records. For user data, the data mining process of clustering was conducted

    to segment the market. The mining run for user data consists of several steps:

    cleaning the raw data with MS Excel, transferring data to IBM Intelligent Miner

    or XploRe, performing cluster analysis. The clustering identified four groups

    of XploRe customers, namely Internet surfer, Academia, Linux user and

    1Hardle, Klinke and Muller, 1999, P17.

    3

  • 4 Introduction

    Home worker. Each cluster possesses its distinguishable features.

    The comparison of customer and user 2002 leaded to the discovery of high prof-

    itable sector research institute. XploRe and IBM Intelligent Miner (IM) delivered

    similar clustering results for user data, but IM performed better in visualisation

    and computational efficiency. Comparing the results of historical data between

    user data 2003 and user data 2000, some trends were identified. More professional

    users switched to command driven software. XploRe made progress in commu-

    nicational channels. Asia, especially Japan emerged as new market. From the

    aspects of segments, Internet surfer is a brand-new group in 2003, which indicates

    the entering of Internet age. The appearance of Home worker in 2003 instead of

    Researcher in 2002 gives hint in the problem in the survey questionnaire. More

    Academia take non-personal channels to get information. This again confirms

    the improvement made by XploRe in communication channels. Linux users were

    very stable during the period.

    Based on the findings of analysis, some suggestions for marketing strategy and

    further analysis were made for XploRe marketer.

    This paper consists of mainly four parts. The first two sections following the

    introduction lay the theoretical foundation for the customer analysis and data

    mining. Section three is presents engaged for the analysis and results. Marketing

    strategy and suggestions are developed in the fourth section. At the end, the

    summary gives a brief overview for the whole paper.

  • 1. Customer analysis

    In the current market space, the competition is intensive. The market is abundant

    with all kinds of products. To win the decision of customers to their products, the

    companies should get a deep sight into what the customers really need and how to

    influence their purchasing e decision. Therefore, the companies should now have

    a customer focus conducting business with the emphasis on the understanding

    of the customers and the market.

    Customer analysis is the study of customers and their behaviour, which is central

    to achieve a customer focus. 2 The purpose of conducting customer analysis is

    to achieve marketing goals, such as the following: 3

    Customer acquisition finding the new customer

    Customer cross sell further sales of different products to the same customer

    Customer up sell the customer makes greater use of the same product orservice

    Customer retention keeping the customer loyal

    1.1 Customer Behaviour

    In order to understand the customer buying behaviour, we should first understand

    the customer behaviour.

    1.1.1 Customers Black Box

    Customer behaviour here means that the behaviour of individuals who purchase

    for private or household consumption. These customers buy goods which are not

    a part of the value chain, and the purpose of purchasing is not to generate profit.

    Buying behaviour depends on the individual reaction to the internal and external

    stimuli; therefore, it is difficult to predict. Black box is the item that describes

    2WWW143Heygate, Richard, 1998.

    5

  • 6 1. Customer analysis

    the customer purchasing decision, which is difficult to access but is crucial for the

    purchasing determination.

    In order to develop appropriate products that are attractive to the customers,

    firms need to have an insight into what happens in the black box. Figure ??

    presents the customers black box. In the customers black box, the customer

    actually gather information, evaluate and compare, then come to a decision, which

    is called the Consumer buying process.

    Blackbox

    -Identificationofneeds-Evaluationofoffers

    thatSatisfyneed-Comparsionofsubstitute

    productsandbrands-Purchase-Post-purchaseevaluation

    AspirationsMotivationEducationPersonalityBeliefs

    Externalstimuli

    -Socialpressure-Legalrequirments-Physicalfactors-Economiccycle

    Consumer

    People Place - Promotion -- -- Product Price Process Physicalenvironment

    Marketer

    7Ps

    Fig. 1.1: The customers Black box.

    1.1.2 Consumer buying process

    Buying decision process

    The buying process starts with the customers desire of a product. This want

    might be the result of internal stimuli like hunger and thirsty or the result of

    external stimuli, such as advertisement.

    Next step is the search for information. The consumers may collect information

    consciously or unconsciously from various resources. There are four kinds of

    information resources:

    1. Personal sources such as family, friends, colleagues and neighbours;

    3Bannes, E., McClelland, B.,etc., 1997, P139.

  • 1. Customer analysis 7

    Recognitionof

    theproblem

    Thesearchfor

    information

    Evaluationofthe

    alternatives

    Thepurchase

    decision

    Post-puchase

    behaviour

    Fig. 1.2: A sequential model of the buying process

    2. Public sources such as the mass media and consumers organisation;

    3. Commercial sources such as advertising, sales staff and brochures;

    4. Experimental sources such as handling or trying the product.

    Through information gathering, the customers get aware become aware of the var-

    ious products and brands in the market, then they will evaluate the alternatives,

    and finally make the purchase decision.

    After purchasing major items or expenditure, many people experience cognitive

    dissonance also called post purchase anxiety. They wonder whether they have

    made the correct purchasing decision. To reduce this anxiety, they will look for

    confirmation. For example, they might ask friends to approve that their purchase

    is a right choice.

    Figure 1.2 summarises the stages of consumer buying process: Recognition of the

    problem, The search for information, Evaluation of the alternatives, The purchase

    decision and Post-purchase behaviour.

    Companies should present themselves in each buying process stage and try to

    be distinguished among all other products and brands of competitors. To let

    a brand or product be the final choice of customer, companies need to have

    clear understanding of the evaluative criteria used by consumers in comparing

    products, which was mentioned before.

    3Wilson, R. W. S. and Gilligan, C., P170.

  • 8 1. Customer analysis

    Five buying roles

    The purchase process normally involves several persons, each has his distinct role.

    Each role doesnt necessarily require to be the a different person. One person can

    play several roles in a purchasing process.

    The five roles in a purchasing process are:

    The Initiator: The person who suggests buying the product or service.

    The influencer: Person whose comments can affect the decision of purchas-ing.

    The decider: The person who decide whether to buy and which product tobuy.

    The buyer: Who executes the purchase.

    The user: The final consumer of the product or service.

    For example, a mother buys ice cream for her child. The child is the user; the

    mother is the decider and buyer. The company should understand the function

    that each role plays in the buying process in order to put effective influence on

    customers buying decision through proper action.

    1.1.3 Customer behaviour model

    The customer behaviour model indicates the procedure and basic elements, which

    happens inside the customers black box or consumer buying process.

    The most basic, simplest and best known model of buyer behaviour is the AIDA,

    which stands for Awareness, Interest, Desire and Action.4

    The model introduced here composes of six interrelated components.5

    1. Information or facts: refers to the precept caused by stimulus.

    2. Product recognition defines to what the extent the buyer knows about the

    product to distinguish it from others products.

    4Baker, M. and hart, S., 1999, P63.5Howard, J. A., 1994, P31-56.

  • 1. Customer analysis 9

    F RI P

    A

    C

    Fig. 1.3: Consumer Behaviour model.

    3. Attitude towards the product refers to what the customer expects from the

    product to satisfy their particular needs.

    4. Confindence in judging the product is the customers degree of certainty that

    his or her evaluative judgement of a product is correct.

    5. Intention to buy is the mental state that reflects the customers plan to buy

    some specific number of products from a particular brand in some specified

    time period.

    6. Purchase is caused by the intention to buy. It is defined as when the cus-

    tomer has paid for a product or has made some financial commitment to

    buy some specified amount during some specified time period.

    F- Information R- product recognition C-Confidence A-Attitude I-Intention P-

    Purchase

    When consumers evaluate a product, they also employ certain evaluative criteria,

    which have several aspects:

    1. The products attributes such as its price, performance, quality, and styling.

    2. Their relatively importance to the consumer.

    3. The consumers perception of each brands image.

    4. The consumers utility function for each of the attributes.

    These evaluative criteria come cross with the elements in the consumer behaviour

    model. For instance, product recognition, attitude towards the product and con-

    fidence in judgement are the three parts in the buyers image of a product. They

    all have vital impact on the consumers buying decision.

  • 10 1. Customer analysis

    CultureSub-cultureSocialclass

    EconomiccycleSocialpressureLegalrequirementNewtechnology

    ReferencegroupsFamilyRolesandstatus

    Thebuyer

    CulturalEnvironmental

    Social

    Psychological

    MotivationLearningPerceptionBeliefsandattitudes

    PersonalAgeandlifecyclestageOccupationEconomiccircumstanceLifestyleandpersonality

    Fig. 1.4: Factors influencing consumer behaviour.

    1.1.4 Factors influencing customer buying behaviour

    Various factors influence customer buying behaviour. Generally we could put

    them into five categories: Psychological factors, Cultural factors, Social factors,

    Personal factors and Environmental factors. 6 78

    1.Psychologicalfactors

    Human needs include the basic needs, like shelta, food and drink, and higher

    level needs, such as friendship and achievement. People purchase goods to satisfy

    their needs. The purchasing behaviour can be considered as the result of internal

    and external stimuli.

    Maslow (1943) has suggested that behaviour can explained by a hierarchy of

    needs. He grouped peoples needs into five levels and argued that when a person is

    satisfied with one level of needs, he will strive for another level of needs. Maslows

    five levels of needs are Physiological needs, Safety needs, Social needs, Esteem

    needs and Self-actualisation needs.9

    Physiological needs are the basic needs for human being to survival, such as food

    and drink. Only after these needs are satisfied, the other level of needs will be

    6WWW117Bannes, E., etc., 1997, P139-149.8Environmental factors are external factors, while the other four factor categories are internal

    factors that influence consumer buying behaviour.9Bannes, E., Mcclelland, B., etc., 1997, P139-184.

  • 1. Customer analysis 11

    desired.

    Safety needs refers to peoples needs for security, stability and predictability. Ser-

    vices, such as insurance, guarantees, etc. are the products to satisfy humans

    safety needs.

    Social needs explain the humans desire of love and sense of belonging. At this

    level, people will seek to join association and clubs.

    Self-actualisation is the highest level of needs. It demonstrates itself in the search

    of status, esteem, achievement and recognition. To satisfy this level of needs,

    people turn to the luxurious products, like perfumes, high-tech products, cars,

    etc..

    Only after people achieve all these level of needs, they will then turn to the

    realisation of their potential, which is expressed in concern for external issue, like

    volunteer work.

    2. Personal factors

    Personal factors are the set of buyers personal characteristics, including age,

    occupation, lifestyle, personality, and economic circumstances.

    3. Cultural factors

    Culture factors include culture, sub-culture and social class.

    Culture is a set of shared values, which define peoples behaviour. Language is

    the best example of culture difference. Not rightly using a language will cause

    misunderstanding. And also there are attitude differences between eastern and

    western culture towards family and individual.

    A large society or culture is normally divided into subculture groups, which define

    more subtle behaviour norms. Subculture groups include ethnic groups, religious

    groups, racial groups and geographical groups etc.. They exhibit the difference

    in culture preference, ethnic taste, attitudes, life style and taboos.

    Social class is also called socio-economic group. It is decided by the income level,

    education and occupation. The often-used social class model divides the society

    into upper class, upper middle class, lower class, upper working class, working

    class and others.

    4. Social factors

    Social factors includes reference groups, family, social role and status.

    Reference groups are defined as all groups that have a direct (face-to- face) or

  • 12 1. Customer analysis

    indirect influence on the persons attitude or behaviour.10 Reference groups can

    be divided into four types.

    1. Primary membership groups are generally informal, and interact within the

    members, such as family, neighbours, colleagues and friends.

    2. Secondary membership groups are more formal than primary memberships,

    and the interactions between members are less. These include religious

    groups, professional groups, trade unions.

    3. Aspirational groups are groups that one would like to belong to.

    4. Dissociating groups are groups, whose values and behaviour are rejected by

    the individual.

    5. Environmental factors

    Environmental factors consist of economic, social, political, technological aspects.

    Economic cycle, social pressure, legal requirements, new technology all will influ-

    ence consumers purchase decision on which product to buy and the way to buy

    it.

    1.2 Market Segmentation and Profiling

    When firms try to sell their products in customer markets, they should not only

    try to identify the factors that influence the customers black box, but also to

    estimate whether there is enough number of customers who need their offer. It

    is important for the companies to compare their capabilities and the objectives

    of customers, so that they can decide whether they are able to serve the market

    with appropriate products profitably. Therefore, firms must identify market need,

    segment the total customer into potential customer groups, which are likely and

    able to purchase the offer, and also position the product or service as attractive

    alternative to other offers of the target groups.

    10Wilson, Gilligan and Person, 1994, P160.

  • 1. Customer analysis 13

    1.2.1 Market segmentation

    Market segmentation is the subdivision of a market into distinct subsets of

    customers, where any subsets may conceivably be selected as a target market to

    be reached with a distinct marketing mix.11

    Market segmentation is inspired by Kotlers Targeting marketing. As Kotler

    said, that in target marketing, the seller distinguishes the major market seg-

    ments, targets one or more of these segments, and develops products and services

    tailored to each selected segments. 12

    Because each individual has different preference, characteristics, taste and inter-

    est, their buying behaviour patterns are various and heterogeneous, it is almost

    impossible or unprofitable for a company or single product to serve all of the

    needs. Furthermore, the communication of marketing mix to a non-homogenous

    group will also be inefficient. Therefore, the companies search for the groups

    with attractive attribute, then concentrate on them to develop specific products,

    services and to utilise specific marketing resources to gain the maximal market

    return.

    Segmentation identifies the subsets of buyers who share the similar needs and

    demonstrate the similar buying behaviour. It subdivides a heterogeneous total

    customer market into smaller, manageable and homogenous clusters by criteria.

    The similar patterns of buyers needs and buying behaviour, which are identifiable

    and relevant to the buying decision, exist in each cluster.

    Customer segmentation brings major benefits to the companies:13

    EfficiencyBecause the customers are subdivided, companies could only focus on the

    interested markets. Therefore, they could allocate and utilise their resources

    more efficiently.

    EffectivenessThrough segmentation, the needs of each customer segments could be bet-

    ter identified and examined. Thus, the understanding and awareness of the

    customer needs could be enhanced. The companies could tailor their prod-

    ucts and marketing measures to meet customer needs more effectively. Due

    11Kotler, 1995, p286.12Kotler, 1991, P262.13WWW29.

  • 14 1. Customer analysis

    Definingthemarket

    Selectingthebaseforsegmentation

    Dividingthemarketandprofiling

    Fig. 1.5: The process of marketing segmentation.

    to the improved marketing effectiveness, the response rate of customer will

    also increase, thus, the return and profit from marketing investment will

    also be improved.

    New MarketSegmentation could help companies to identify the new market opportu-

    nities. The needs and characteristic of the total customer /market are so

    various diverse that some unique feature of a small group are not distin-

    guishable. After segmentation, company could discover those markets with

    unique features. They could offer the valuable opportunities for companies

    to enter new markets.

    The process of market segmentation14

    The process of market segmentation is composed of three steps.

    1. Defining the market

    The total market for a product or service comprise oses all of the consumers who

    14Bannes, E., McClelland, B., and Meyer, R, 1997, P181-185.

  • 1. Customer analysis 15

    HomogeneousdemandConsumershaverelativelysimilarneedsordesiresforaproductorservicecategory

    Diffuseddemand

    Consumersneedsanddesiresaresodiversthatnoclearclusters(segments)canbeidentified

    Clustereddemand

    Consumersneedsanddesirescanbegroupedintotwoormoreidenitifiableclusters(segments),eachwithitsownsetofpurchasecriteria

    Fig. 1.6: Alternative consumer demand categories.

    desire or potentially desire it, and willing to and able to buy it. It is necessary

    to analyse the market in terms of its size and pattern of demand.

    There are three patterns of demand categories: 15

    1. Homogeneous demand

    All consumers in a market have similar needs and wants.

    2. Diffused demand

    Consumers needs are diverse and no clear segments can be identified. This

    suggests the need for customisation.

    3. Clustered demand

    Consumers need and desires can be grouped into several identifiable seg-

    ments. Each has its own set of purchase criteria.

    2. Selecting the approach and bases for segmentation

    Identification of market segmentation could be conducted based on detailed mar-

    ket research, or on basic analysis of customer data held within a company. Many

    companies keep customer records detailing information such as age and gender.

    15Bannes, E., McClelland, B, etc. , P181-183.

  • 16 1. Customer analysis

    There are generally two types of methods for of market segmentation.16 17

    1. A Priori methods:

    In a prior approach, the basis for segmentation is set in advance. The primary

    market research is not necessary. Thus, the analysis of second data resources,

    the customer information at hand, manger intuition and other methods will be

    employed to set the segmentation basis for the buyers according to their usage

    patterns (heavy, medium, light and non-user), demographic characteristics (age,

    sex, income) or psychographic profiles (personality). After the basis setting, a

    research will be conducted to identify the size, location and potential of each

    segment. The marketing decision will be based on which segment the marketing

    efforts should be concentrated. For example, classification is a prior approach.

    2. Post hoc methods:

    Post hoc approach segments the market depending on the research finding, rather

    than decides the segmentation basis in advance. The primary market research is

    conducted to collect the classification and descriptor variables. Segments will be

    defined only after all the relevant information is collected and analysed. The re-

    search might highlight the particular attributes, attitudes or benefits, with which

    particular groups of customers are concerned. The result then becomes the basis

    for dividing the market.

    3. Dividing the market and profiling the segments

    Based on the data gathered, the process of dividing the market into identifiable

    market segments is carried out. The information obtained will give details re-

    garding to the nature of customer segments. This is called segment profiling.

    Profiling associates tapes each segment with certain characteristics, and aggre-

    gates the customer with similar characteristics into group and separates them

    from those with different characteristics.

    Criteria of customer segmentation

    A market could be segmented in various ways. There are problems with segmen-

    tation, such as the relevance and quality of the data, intuition, continuous process

    16WWW3117Han, J. and Kamber, M, 2001, P281-319.

  • 1. Customer analysis 17

    and over-segmentation. A good segmentation should be relevant for buying be-

    haviour and satisfy the following requirements:18 19

    Size: the market should be big enough to guaranty a good segmentation.It is dangerous to over segment an already very small market.

    Difference: the difference between the member of the segments should existand could be measured through data collection approach.

    Measurability: The company is able to collect information that measuresthe nature of buying behaviour for the segmentation.

    Substantiality: The selected segmentation should be profitable regarding tothe marketing mix resources designed especially for it.

    Accessibility: The extend that the marketing effort could reach the segmen-tation.

    Stability over time: The segmentation should last a certain period withoutdramatic change in major features.

    Responsive to communication means: The segmentation sensitive to themarketing mix and communication means.

    Variables for customer segmentation

    Almost all factors which affect customers buying process and decision can be

    used as the variables of customer segmentation. Generally the variables for

    customer segmentation can be put into five categories: Demographic, Socio-

    economic Grade, Psychographics and life style, Behavioural, Geographic and

    Geo-demographics. 20 21

    1. Demographic variables

    Demographic variables categorise the market according to the population char-

    acteristics and population profiles. Customers are subdivided into groups based

    on one or more demographic variables such as age, sex, religion, race, nationality,

    family size and stage of family life cycle. For example, the custom seller groups

    18WWW2019Wilson, R. and Gilligan, C., 1997, P275.20Kalakota, R. and Whinston A. B..21McDonald M. and Dunbar I., P85-91.

  • 18 1. Customer analysis

    ACORN Group 1981

    Population %

    A Agricultural areas 1, 811, 485 4.3

    B Modern family housing, higher incomes 8, 667, 137 16.2

    C Older housing of intermediate status 9, 420, 477 17.6

    D Older terraced housing 2, 320, 846 4.3

    E Better - off council estates 6, 976, 570 13.0

    F Less well-off council estates 5, 032, 657 9.4

    G Poorest council estates 4, 048, 658 7.6

    H Multi-racial areas 2, 086, 026 3.9

    I High-status non-family areas 2, 248, 207 4.2

    J Auent suburban housing 8, 514, 878 15.9

    K Better-off retirement areas 2, 041, 338 3.8

    U Unclassified 388, 632 0.7

    Tab. 1.1: Broad- based ACORN classifications 23

    customer regarding their ages. Like age of 20-30, this group are the customers,

    who are more like to purchase trendy items.

    2. Geographic and Geo-demographics

    Geographic segmentation divides the market into different geographic units such

    as countries, regions, counties, cities and postcode etc. Geographic system is

    based on the proposition that the neighbourhood area in which you live will

    be reflected in your professional status, income, life stage and behaviour. The

    neighbourhood types are initially identified using national census data.

    ACORN (A Classification of Residential Nneighbourhoods) is an example of ge-

    ographic systems. ACORN classifies consumers into 43 demographic and be-

    haviourally distinct clusters. The clusters are based on the type of neighbourhood,

    socio-economics status and the buying behaviour and preference.22 A Broad-

    based ACON classification is conducted in Great Britain in 1981. It segments

    the residents in Great Britain into 12 categories.

    3. Socio-economic Grade

    The buying behaviour is often influenced by the social class of a person The

    factors include income, status, education etc. National Readership Survey scales

    22Kurs, M., Ryan, B., Lamb, G. etc., 2001.23Bannes, E., McClelland, etc., 1997, P201.

  • 1. Customer analysis 19

    Grade Social Classification Occupation

    A Upper Middle Class Higher managerial, professional or administrative jobs

    B Middle Class Middle managerial, professional or

    C1 Lower middle class Supervisory or clerical jobs, Junior management

    C2 Skilled working class Skilled manual workers

    D Working class Unskilled and semi-skilled manual workers

    E Subsistence level Pensioners, unemployed, casual or low grade workers

    Tab. 1.2: National readership survey socio-economic groups 24

    is one of the popular classifications, which and is based on the occupation of the

    main wage earner of the household.

    A further development of the life stages socio-economic grade model is SAGAC-

    ITY, developed by Research Services Ltd.. This model combines life stages with

    income and social class.

    4. Psychographic variables

    Psychographics attempts to classify individuals by their attitudes, personality

    and life styles.

    (1)Personality

    Personality is used as variable to segment the market. The earliest segmentation

    was conducted by Riesman et al (1950) in early 1950s. It identified three distinct

    types of social characterisation and behaviour: 25

    1. Traditional directed behaviour, which changes little over time and which as

    a result, is easy to predict and is used as a basis for segmentation.

    2. Other directness, in which the individual attempts to fit in and adapt to

    the behaviour of the peer group.

    3. Inner directness, where the individuals is seemingly indifferent to the be-

    haviour of others.

    (2) Attitude

    Attitude includes the customers attitudes towards risk, degree of loyalty, the

    24Kurs, M., Ryan, B., Lamb, G. etc., 200124Blois Keith, 2000, P389.25Wilson, Gilligan and Pearson, 1994, P291

  • 20 1. Customer analysis

    LifeCycle Income Occupation

    Family

    Late

    Pre-family

    Dependent

    Betteroff

    Betteroff

    Worseoff

    Worseoff

    White-collar

    White-collar

    White-collar

    White-collar

    White-collar

    White-collar

    Blue-collar

    Blue-collar

    Blue-collar

    Blue-collar

    Blue-collar

    Blue-collar

    Fig. 1.7: SAGACITY.

  • 1. Customer analysis 21

    likelyhood of taking new products, etc. Many of the personality variables could

    also use as the descriptor of the attitude.

    (3) Lifestyle

    The consumers behaviour is determined by the way we live our lives as well. It

    arises from a complex relationship between our aspirations, surest situation, and

    perception of self, income and attitudes. Life style market segmentation offers a

    detailed view of buyers because it composes of numerous characteristics related

    to their activities, interests and opinions. The life style consist mainly of three

    dimensions: 26

    1. Activities: Work, hobbies, social events, vacations, entertainment, club,

    membership, community, shopping, sports.

    2. Interests: Family, home, job, community, recreation, fashion, food, media,

    and achievements.

    3. Opinions: Selves, social issues, politics, business, economics, education,

    products, future, culture.

    5. Behavioural variables

    (1) Benefit sought variables

    This group of variables for segmenting customer considers the motive for a pur-

    chase. It groups consumers according to specific benefits that they seek in a

    product. Even if two customers bought exactly the same products, the benefit

    they expected may vary. Benefit segmentation is therefore based on behaviour

    processes, involving thought and action, as opposed to age and socio-economic

    class, which are defined according to individual characteristics. It closely identi-

    fies the customers needs and represents a powerful method of understanding and

    influencing behaviour.

    In applying for this approach, a company should begins by attempting to measure

    consumers value systems and their perceptions of various brands within a given

    product class. The information gathered is then used as the basis of marketing

    segmentation. Benefiting segmentation begins by determining the principal ben-

    efits that the customers are seeking in the product, the kinds of people who look

    for each benefit and the benefit delivered by each brand. For example, for teeth

    26McDonald, M. and Dunbar, I., 2000, P89.

  • 22 1. Customer analysis

    paste market, four segments are identified according to benefit: Seeking economy,

    Decay prevention, Cosmetic and Taste benefits.

    (2) User status

    The market can be divided into five segments, according to user status: non-

    users, ex-users, potential users, first-time users and regular users. First-time user

    and potential users can be further subdivided on the basis of usage rate.

    (3) Loyalty Status and Brand Enthusiasm

    Loyalty status categorises the customers on the basis of the extent and depth

    of their loyalty to particular brands or products. Most typically there are four

    categories: Hard core loyals, soft-core loyals, shifting loyals and switchers.27

    1. Hard core loyals are customers who consistently buy the same brands or

    product.

    2. Soft-core loyals are those who are willing to choose from a limited brand

    set. Their Loyalty is divided among the limited brands or products.

    3. Shifting loyals consists of consumers who shift their loyalty from one brand

    to another. After they shift the brand, they will not buy the ex-brand any

    more.

    4. Switcher loyals are those who show no loyalty to any single brand. Their

    buying pattern is typically determined either by the special offers available

    or by their search for variety.

    (4) Critical events

    Major or critical events generate ones needs, which can be satisfied by the pro-

    vision of a special collection of products and/or services. Typical examples are

    marriage, the death of someone in the family, unemployment, illness, retirement

    and moving house, etc..

    1.2.2 Customer profiling

    Customer segmentation and customer profiling are two elements of Customer Re-

    lationship Management (CRM). Customer Profiling is performed after customer

    segmentation. Customer Profiling is to locate clusters within the customer file

    that outperform the average.28 It creates customer segment profile, which labels

    27Wilson, Gilligan and Pearson, 1994, P291.28WWW18

  • 1. Customer analysis 23

    the customers with their attributes.

    Identifying the characteristic of the customers helps the company to decide which

    segments will respondse best to their marketing effort. When companies get

    clearer overview about the attributes and demands of the customer segments,

    they could then decide what action and what resource should be taken and located

    to the selected customer segments. Furthermore, according to pre-built models,

    customer profiling can also be used to find potential customers and delete inactive

    or bad customers.

    The profiling attributes are similar as the segmentation attributes. For example,

    the profiling attributes include: Geographic, Cultural and e and ethnic, Economic

    conditions (Incomes and /or purchasing power), Age, Values, attributes, beliefs,

    Lifestyle Knowledge and awareness, Lifestyle, Media, Recruitment method. For

    acquired customer, the variable of customer behaviour could also be employed as

    profiling variables, such as shopping frequency, complaining, frequency, satisfied

    degree of satisfaction and preferences, etc.

    1.3 Market targeting and Positioning

    1.3.1 Market Targeting

    The next task after customer segmentation and profiling is market targeting.

    Companies choose one segment or several segments as the target market. The

    target market is the market that company decides to serve. Specific marketing

    mix and resources will be developed to serve the target market.

    The companies normally adopts on e of the three targeting strategies:29

    Undifferentiated strategy: Company ignores the difference between each cus-tomer segments, and regards the whole market as a single market. Single

    marketing mix is adopted for the whole market. This is the so called mass

    marketing.

    Differentiated strategy: The whole market is divided into several segments.The company develops different marketing mix for different segments.

    28Keith Blois, 2000, P398.29Amstrong, G.and Kotler, P., 2002, P255-258.

  • 24 1. Customer analysis

    DifferentiatedStrategy

    ConcentratedStrategy

    UndifferentiatedStrategy

    Organisation

    Organisation

    Organisation

    MarketingMix

    MarketingMix

    MarketingMix1

    MarketingMix2

    MarketingMix3

    Segment1

    Segment1

    Segment2Segment3

    Segment3Segment2

    Entiremarket

    Fig. 1.8: Targeting strategies.

    Concentrated strategy: The company chooses one or several market seg-ments, but only take the single marketing mix. Under this strategy, the

    company tries to have a high market share in one or several niches markets,

    instead of struggling to have a small share in the whole market. For the

    firms with limited resource, this strategy is very appealing.

    1.3.2 Positioning

    The purpose of target marketing is to focus on the selected target market, fine-

    tune the market mix to provide a group of potential customers with superior

    value, therefore, to build up unique position of product in the customers view.

    A products position is the complex set of perceptions, impressions, and feeling

    that it induces in consumers, compared with competing products.30 Positioning

    refers to the how customer think about proposed and /or present brands in a mar-

    ket. 31The fundamental idea of positioning is competitive advantage. 32Through

    30Bannes, McClelland, Meyer and Wiesehofer, 1997, P230.31WWW3332WWW30

  • 1. Customer analysis 25

    the differentiated market mix, the special needs and demands of customers could

    be satisfied. Thus, the customers will view the product or brand as superior to

    the others, and place the product or brand with a distinct position. To position

    a product, the marketer must appeal to the target customers strongly with its

    strength and differences using proper marketing mix.

  • 2. Data Mining

    Data mining, which is also known as Knowledge Discovery in Database KDD,33

    is a powerful new technology, which help company to identify the important

    information among the sea of data. Data mining technology is commonly used

    for customer analysis.

    Fayyad defined data mining as a non-trivial process aimed at identifying, valid,

    novel, potentially useful and ultimately understandable pattern in data.34 While

    Grameier and Rudolph consider data mining in terms of all methods and tech-

    niques, which allow to analyse very large data sets to exact and discover previ-

    ously unknown structures and relations out of such huge heaps of details. These

    information is filtered, prepared and classified so that it will be a valuable aid for

    decisions and strategies.35

    Data mining extract the implicit, previous unknown and potentially useful data

    from the data in order to automate the process of discovering the significant

    pattern and trends.

    2.1 The process of Data mining

    The process of data mining could be summarised in as the four stages: Data col-

    lection and selection, Data preparation, Data mining, and Result interpretation.36

    37

    2.1.1 Data Collection and Selection

    The Ways of data collection include:

    In-house customer database: Companies normally keep records of cus-tomers. The information of customer could be gathered from mailing list,

    receipt, memberships, warranty registrations, etc.

    33Kotala, P., Perera, A., Kai Zhou, J.,ect.34Fayyad, U., Piatetsky-Shapiro, G. et. al., P6.35Grameier, J., and Rudolph A..36IBMs Data Mining Technology, 199637Bounsaythip, C. and Rinta-Runsala, E., 2001

    26

  • 2. Data Mining 27

    External resource: There are resources, from which one could obtain infor-mation such as demographic information.

    Research survey: The often-used way to collect particular information isto conduct a survey. The survey could be conducted through face-to-face

    interview, telephone interview, and postal questionnaire or via Internet.

    During the collection of data, two types of variables should be collected:38 Clas-

    sification Variables classify the data set into groups. Most demographic, geo-

    graphic, psychographic or behavioural variable can be used to classify customer

    into segments.

    Demographic variables: Age, gender, income, ethnicity, marital status, ed-ucation, occupation, household size, length of residence, type of residence,

    etc.

    Geographic variables: City, state, zip code, census tract, county, region,metropolitan or rural location, population density, climate, etc.

    Psychographic variables: Attitudes, lifestyle, hobbies, risk aversion, per-sonality traits, leadership traits, magazines read, television programmes

    watched, etc.

    Behavioural variables: Brand loyalty, usage level, benefits sought, distribu-tion channels used, reaction to marketing factors, etc.

    Descriptor variables are variables used to describe and distinguish each sub-

    group from each other in a data set. We could say that the descriptor variables

    stand for the characteristic of the represented data set. Descriptor variables must

    be easily obtainable variables that already exist in or appended to the customer

    files. Many classification variables could be used as descriptor variables.

    The data is normally stored in a data warehouse. As the data warehouse contains

    all diverse types of data, so that to conducting data mining, the data that will

    be used in analysis should be selected in the first step.

    38WWW7

  • 28 2. Data Mining

    2.1.2 Data Preparation

    Before data can be analysed, the original collected data must be prepared first

    prepared in order make to let it suitable for the analysis. Data preparation

    consists of the following stages:

    1. Data cleaning:

    Check out abnormal, out of bounds or ambiguous items.

    Strip out unwanted fields or items. Some attributes are useless for analysispurpose, such as version numbers, email address, etc.

    Resolve inconsistent data formats, data encoding, geographical spellings,abbreviations and punctuation

    2. Data description

    Supply meta data such as row or value counts or variables

    3. Data Transformation:

    Convert string variables into numeral or numeric categorical variables, orinterpreting or replacing codes into text.

    Check missing values. Delete or replace them by default values.

    Add computed field as input or target.

    Combine data from multiple sources under a common code.

    Identify Find out multiple used fields that are multiple times.

    Convert continuous variable into category variable for some methods.

    Convert nominal data into metric data.

  • 2. Data Mining 29

    4. Data Sampling39

    Required for training or model building

    5. Data pruning

    Identify dependent, independent and correlated columns or variables

    2.1.3 Mining

    At the mining stage, various techniques could be used to extract the valuable in-

    formation from the final prepared data. For example: To create an accurate, sym-

    bolic classification model to predict whether a reader will continue to subscribe

    for a newspaper. First, clustering technique should be conducted to segment

    the subscribers database; then, the rule is introduced to create a classification

    model automatically for each desired cluster, through which one could predict

    the behaviour of a customer.

    2.1.4 Result Interpretation

    Result interpretation is not only to visualise (graphically or logically) the output

    of data mining, but also to filter the information and identify the most valuable

    and proper result, which will help in the decision making. If the interpreted result

    is not satisfactory, the data mining stage or even the whole data mining procedure

    should be repeated. The final extracted information must be comprehensible.

    2.2 The Aspects of Data Mining

    Data mining could be distinguished between the aspects of applications, opera-

    tions, techniques and algorithms.40 41

    39Ferguson, Mike40WWW 441IBMs Data Mining Technology, 1996

  • 30 2. Data Mining

    Applications Database marketing

    Customer segmentation

    Customer retention

    Fraud detection

    Credit checking

    Web site analysis

    Operations Prediction and classification modelling

    Link analysis

    Database segmentation

    Deviation detection

    Techniques Supervised Induction

    Clustering

    Association discovery

    Sequence discovery

    Tab. 2.1: The aspects of data mining

    2.2.1 Applications

    Data mining is widely used in customer analysis and marketing. The following

    areas cover the main application of data mining.42

    Customer segmentation: Data mining tools automate the process of find pre-

    dictive information in large database. The companies, especially the retailers,

    banks, are interested in knowing if there are sub-group customers who exhibit

    certain characteristics. They could use data mining to clustering the customers,

    discover interested groups. For example, companies use data mining to analyse

    the historical mailing list in order to find out the high return to investment group,

    so that they could determine the new mailing target groups. Banks and credit

    companies classify the credit scoring to identify the customer segments, which

    has lower risks.

    Relationship management: Data mining discovers and identifies the previous

    unknown relationships hiding in the data. The buying patterns of a customer

    are of interested to by the retailers and advertisers. Combined with customer

    segmentation, data mining could help them to find out the relationship between

    the purchase of product items, and customer types, or to improve the conduction

    of a advertisement campaign on special media for specific group of customers.

    42Carbone, Patricia L.

  • 2. Data Mining 31

    2.2.2 Operations

    Predictive and classification modelling: Predictive model uses the contentsof database, which reflect historical data to automatically generate a model

    that can predict a future behaviour. Classification sub-divides a data set

    according to number of special outcomes. The goal of modelling operation

    is to create the generalised character characteristics description for the data.

    For instance, a marketing executive may be interested in predicting whether

    a particular consumer will switch to a new product.

    Link analysis: The goal of link analysis is to establish the relationshipbetween the records in database. The retailers want to know which items

    will be purchased by a customer together in order to make decision in the

    items layout and goods purchasing. For instance, if it is found that customer

    will buy a CD after the purchasing a CD Player, then the store manager

    should decide to put the CD counter close to the CD player counter.

    Database segmentation: The database often contains various types of data,so that it is often necessary to segment the data into small groups with

    related records. The purpose could be either to obtain a general descrip-

    tion for each collection or to prepare for a further analysis, such as model

    creation or link analysis. Suppose the store manager wants to know the

    combination of goods purchased by customer in a particular visit period.

    The database could first be segmented according to time period attribute,

    such as Christmas sale. Then the link analysis could be conducted to

    find out the relationship between the combined goods.

    Deviation detection: The aim of deviation detection is to identifying theoutlier in a particular dataset whether its presentation is due to noise, im-

    purities or causal reason. This operation is opposite to database segmenta-

    tion, and is often carried out together with segmentation. Because outliers

    express the deviation from some known expectation and norm, therefore,

    deviation detection often is the source of true discovery.

    2.2.3 Data Mining Techniques

    Numerous techniques support the operations of data mining to find the desired

    groups or relationships.

  • 32 2. Data Mining

    Classification and predictive modelling is supported by supervised induction tech-

    niques. Clustering supports database segmentation. Association discovery and

    sequence discovery are used for the link analysis. The deviation detection is

    supported by statistical techniques.

    The desired relationships to be discovered by data mining are:43

    Classes: in which the data items is located into predetermined groups.

    Clusters: in which the data items are grouped by logical relationships.

    Associations: data is mined to identify associations.

    Sequential patterns: data is mined to anticipate the behaviour patterns and

    trends.

    Supervised Induction

    Supervised induction is the process to automatically create a classification model

    from a sets of records (example)44, which is called the training sets. The records

    in the training set must belong to a set of pre-defined classes. Each class has a

    distinguishable pattern, which is generated from the existing records. Once the

    model is set up and induced, a new record could be automatically put into a class

    according to its pattern.

    Supervised induction contains steps of classification and prediction to put ele-

    ments into ppredetermined erformed groups according to some criterion. The

    numbers of subgroups and the feature of each subgroup are defined at beginning.

    Then, the feature of the observation will be compared with the criterion and then

    be put into corresponding ed group.45 This is usually done in two steps:

    Step 1: Build a model to describe the predetermined data set groups orclasses. The model contains a set of classification rules (labels).

    Step 2: If the accuracy of the model or classifier is acceptable, the modelcan be used to classify the new unlabeled data groups or elements.

    Clustering Clustering is a method of grouping data elements into homogenous

    groups. It divides a heterogeneous data set into disjoint sub-groups, so that the

    elements in any ner one cluster is highly similar, while the elements in different

    43Chung, H. M., Gray, P. and Manino, M., 199844IBMs Data Minging Technology, 1996.45Han, J. and kamber M., 2001, P279-325

  • 2. Data Mining 33

    clusters are with highly dissimilarity. Clustering is an unsupervised technique and

    is employed when you wan to find groups of similar records without any precon-

    ditions. The elements inside a cluster are highly similar to each other, while the

    elements between clusters are highly dissimilar according to some criterion. The

    difference between clustering and classification is that in clustering, the numbers

    of subgroups and the features (label) of each subgroup are unknown in advance,

    while in classification, the numbers of subgroups and the feature of each subgroup

    are defined at the beginning.

    Cluster analysis has two steps:46

    Choose a proximity measureA proximity measure decides the similarity or closeness of objects. The

    homogenous objects are more similar and closer.

    Choose a clustering strategyIn this step, the clustering algorithm and/or initial parameters are decided.

    According to the chosen proximity measure and method, the whole data

    set is divided into groups (clusters). The elements within a group should

    be as closer as possible and the dissimilarity between groups should be as

    large as possible.

    After the clusters are built, normally some descriptive methods could will be

    employed to describe each cluster in order to get a comprehensive overview of the

    dissimilarity between clusters.

    1. Proximity measure

    The commonly used proximity measures include Jaccard, Tanimoto, Simple

    Matching, Minkowski Kulczynski and Euclidean distance.

    2. Clustering strategy (method)

    The clustering methods generally belong to several major family:47

    1. Hierarchical algorithms

    2. Iterative partitioning

    3. Density search

    46Hardle, W. and Simar, L, P295-313.47Aldenderfer M. S. and Blashfield, R. K., P35.

  • 34 2. Data Mining

    4. Factor analytic

    5. Clumping

    6. Graphic theoretic

    Here we only discuss two basic clustering algorithm methods: Hierarchical algo-

    rithms and Iterative partitioning algorithm.

    (1) Hierarchical algorithms

    Hierarchical clusteringc can be performed using algorithm is composed of two

    main types different of procedures: Agglomerative procedure and Splitting pro-

    cedure.

    Agglomerative procedure starts from the finest partition. It considers eachobservation as a cluster, then puts groups together to form new clusters.

    At each stage in the procedure, the number of clusters is reduced by one,

    by through the joining or fusing two groups into one, which are considered

    to be the closest or most similar groups. Aggolomerative algorithm is a

    frequently used procedure. It contains the following steps:48 49

    1. Construct the finest partition. Normally each observation is a group.

    2. Compute the distance or dissimilarity matrix.

    3. Find out the closest or most similar groups.

    4. Put the two most similar groups together to form a cluster.

    5. Computer the distance or dissimilarity between the new groups, get a

    reduced distance or similarity matrix.

    6. Repeat the step 3 to step 5, until the optimal clusters are formed.

    Splitting procedure is opposite to the agglomerative procedure. It considersthe whole data set as a cluster to start with, then splits the cluster into sub

    groups to form new clusters.

    The linkage for Agglomerative algorithm There are many linkages to mea-sure the proximity or similarities of elements and groups. The frequently

    normally used linkages are:

    48Mardia, K.V., Kent, J.T. and Bibby, J.M., 1979, P360-390.49Everitt, B. S. and Dunn, G., 1991, P99-126.

  • 2. Data Mining 35

    Single linkage defines the smallest distance of individual as the distance of

    two groups.

    Complete linkage is opposite to the single linkage, defines the largest dis-

    tance of individuals as the distance of two groups.

    Average linkage (non-weighted and weighted) computes the average distance.

    Centroid linkage uses the natural geometrical distance as the distance of

    groups.

    Median linkage chooses the median of individual distances as the distance

    of groups.

    Ward Linkage is related to the centroid linkage, but it uses rather an in-

    teria distance rather than a geometric distance.

    (2) Iterative Partitioning algorithms

    Partitioning algorithms starts with given groups. Then the elements exchange

    between groups until the highest homogeneity within groups and highest hetero-

    geneity between groups or some criterion is reached.

    The iterative partitioning algorithms are normally undertaken according to the

    following steps :50

    1. Begin with an initial partition of a chosen certain numbers of clusters.

    Compute the centriods of these clusters.

    2. Allocate each data point to the cluster that has closest centroid.

    3. Compute the new centroids for new clusters. The clusters are not changed

    until a complete pass through of the data.

    4. Iterated the steps of (2) and (3) until no data points change clusters and

    reach the highest similarity inside the cluster.

    Association rule discovery

    Association rule discovery is an iterative approach, also known as level-wise

    search. Association rule methods try to discover interesting relationships be-

    tween the items in data and identify the customers behaviour patterns. The A

    typical association rule example is the Marketing basket analysis. This analysis

    tries y to find out when the customers do shopping, what kinds of products are

    50Aldenderfer M. S. and Blashfield, R. K., P45-49.

  • 36 2. Data Mining

    more likely to be put into the shopping basket together. Through this analysis,

    retailers are able to identify which items are frequently purchased together by the

    customers.

    An association rule is the relationship of the form X Y , where X is theantecedent item set and Y is the consequent item set. For example: customers

    who purchased itemX are very likely also to purchase item Y at the same time.51

    There are two measures for each rule: support and confidence.52

    Support (or prevalence) indicates the occurrence frequency of an itemset.s(A B) = P (A B)

    Confidence (Certainty or Predictability) measures the validity of the pat-tern. It indicates, denotes how strong the strength of the relationship be-

    tween the items, and to what degree an item depends on the others.

    For example: Among the customers who buy computers, only 5% customers are

    students. and buy laptop. But if a customer is also a student, the possibility

    of his buying a computer is 20%. In this rule: 5% is support and 20% is the

    confidence.

    Two other important measures for association rule discovery are: Expected confi-

    dence - the possibility of an items purchasing regardless what other items haves

    been bought together. For instance, customers buy a computer 40% of the

    time, 40% is Expected confidence.

    Lift - refers to the difference between the confidence of a rule and the expected

    confidence, either in the form of absolute difference or in the form of ratio. When

    Lift is negative or less than one, it means the itemset of the rule are unlikely to

    happen or two products are unlikely to be purchased at the a same time.

    The goal of association discovery is to find out all the associations with s% support

    and c% confidence in the data of transaction.

    1. Data format

    Two types of format are used to form the data for association discovery:

    1. Horizontal format: each entry as a row, each attribute is a column.

    51Kotala, P. K, Perera, A., Kai Zhou, J., etc., 200152WWW4

  • 2. Data Mining 37

    2. Vertical format: Only one column for attributes. Different entries are de-

    noted by different ID. Attributes belonging ed to the same entry will be

    assigned the same ID number.

    2. Apriori Algorithm

    The most often used algorithm of association rule is called Apriori algorithm. It

    uses the prior knowledge of itemset features to explore their further associations.

    The steps are as following:

    Step 1: Set percentage of support and confidence as s% and c%.

    Step 2: Find out all the items with frequency percentage above the setminimal support.

    Step 3: Generate the association that have the same or higher set confidencelevel based on the set of frequent items.

    Step 4: Scan all the items to identify all the items with , which at have atleast s% support.

    Assign them as L1

    Step 5: Form item pairs from L1, assign these candidate set as C2.

    Step 6: Scan all the item pairs to find all the pairs in C2 at least with s%and c% confidence. Denote Let these sets as L2;

    Step 7: Iteration: Do Step 5 and Step 6 iteratively, until there are no moresets satisfying the constraints.

    The general description for Step 5 and Step 6 is:

    Build sets of k items from Lk1, let it to be Ck.

    Scan all transactions and find out all frequent set in Ck with at least s%support and c% confidence level, let it be Lk.

  • 38 2. Data Mining

    Sequential pattern discovery

    Sequential pattern methods can be seen as an extended association rule method

    that analyses the sequenced data. It extends association by adding time to the

    transactions. For each transaction, there is a transaction time. Therefore, not

    only the attributes of each transaction, but should be considered the , time when

    of the transaction took place happening should also be taken into account. Se-

    quential analysis searches temporal links between items, rather than relationships

    between items in a single transaction.53

    Sequential ce pattern method can find out the relationship patterns between the

    items or itemsets in a time episode. For example, a typical sequence pattern

    could be Six percent of customers who bought a CD player bought a CD within

    a week.

    1. Data format

    To start a sequential pattern discovery, each time series is converted into a multi-

    item entry and duplicated items are deleted. Afterwards, the association rule can

    be used. The constraints of sequential pattern that are all sequential patterns

    satisfy the customer specified minimal support.

    The sequential data is composed of sequences, or customer sequences. Each

    sequence is a list of customer orders. Each transaction contains a set of items.

    The length of a sequence is the number of itemsets that are contained in it. A

    sequence of length k is call k-sequence.

    2. Procedure

    Sequential pattern discovery could be conducted by using the following steps: 54

    Step 1: Sort phase. Sort he database according to customer id and trans-action id.

    Step 2: Itemset phase. Find all large sequences of length 1. Step 3: Transformation phase. Transform each item in the sequence intointeger.

    Step 4: Sequence phase: Find all large sequences. Step 5: Maximal phase: delete all non-maximal sequences.

    53Wojciechowski, Marek54Han, J and Kamber M, 2001, P225-271.

  • 3. XploRe user and customer analysis55

    3.1 About XploRe

    XploRe is a professional statistical software for high-end statistical analysis, ad-

    vanced research and interactive teaching. It was developed in 1999 by Prof. Wolf-

    gang Hardle and his team at Humboldt University of Berlin, Germany. XploRe

    is a module structured, command driven software. The statistical methods of

    XploRe are supported by various libraries. Therefore, one can incorporate his/her

    ones own methods in XploRe and easily extend the environment. The competitive

    advantage of XploRe lies on rather advanced methods, particularly smoothing.

    The purpose of XploRe lies in the exploration and analysis of data. According to

    Prof. Hardle (1999), it aims at sophisticated users who are looking for a flexible,

    programmable statisticals package with emphasis on more advanced procedures.

    The Internet is currently the main marketing instrument of XploRe. A free trail

    version with limitations of XploRe (with limitations) could be downloaded from

    the net.

    3.2 XploRe user(2002) and customer descrip-

    tive analysis

    3.2.1 Data collection

    XploRe user data collection

    XploRe users refer to the XploRe downloaders, who have downloaded XploRe

    from the website. They are the potential customers of XploRe.

    The collected raw data of XploRe users consists of 1734 profiles of individuals

    who have downloaded the statistic software XploRe from October 11, 2001 to

    July 22, 2002. The data was collected through an online survey. A free trail

    version of XploRe could download via the homepage http://www.xplore-stat.de.

    55User refers to the person who downloaded XploRe from Internet, while Customer refersto the person who bought XploRe.

    39

  • 40 3. XploRe user and customer analysis

    All trial versions of XploRe (except for the Linux local version) do not include all

    function and commands of XploRe, will expire after two months, and are limited

    to 1000 observations. The Linux local version has no expiration date and no limit

    on the size of observations.

    Fig. 3.1: Sample of online survey questionnaire.

    Before the downloading, users are asked to participate in an a online survey.

    The online questionnaire composes mainly has two parts. All questions (except

    for E-mail address) are answered by selecting from a set of items from possible

    responses.

    The first part of the questionnaire is Personal information, in which the informa-

    tion about personal identity and preference are inquired. Some questions in this

    part, such as e-mail address and country, ask for the personal identity of down-

    loaders identity. We call them Identity questions. The other kind of questions

    inquire about the preferences of downloaders, such as the way they learnt about

    XploRe, the work place where they use XploRe, the software they currently use,

    and the statistical methods they look for in XploRe, etc.. The answers to these

    questions are important to reveal the preferences of users and play a prominent

    role in user analysis. We call these questions substantive questions, because

    they provide the basic factors needed to subdivide the total user group into small

    homogenous groups for our statistic user analysis.

    The second part of the questionnaire are contains technical questions. The

  • 3. XploRe user and customer analysis 41

    downloaders are asked to choose the preferred versions of XploRe56 and the op-

    erating system, on which XploRe will be installed, such as Windows, Linux, Sun

    etc.. An example questionniare is attached in the Appendix.

    During downloading, the date and IP-address are automatically recorded. They

    are very helpful in in data cleaning procedure.

    XploRe Customer data collection

    XploRe customer here refers to who haves actually bought XploRe. I call them

    also call them actual customers. The data of XploRe customer is collected

    through registration forms, which are sent to customer together with XploRe.

    The return of the registration form is not compulsory. The customer data is from

    1 July 2000 to 30 August 2002. Because of the change in registration form, the

    data after this date was not used. In the Appendix, the new registration form is

    attached for the reference.

    The registration form includes the questions about the identity of the customer

    like country, language and the questions about their fields, as well as the operating

    systems.

    As a the result, we get 8 variables of customer data: country, federal state (Ger-

    many), language, title, operating system, profile sector, profile branch and sex.

    3.2.2 Data cleaning and preparation

    A analysis based on poor quality or wrong data could deliver erroneous results

    no matter how sophisticated the statistical method is. Therefore, the raw data

    are thoroughly cleaned before using them for analysis.

    XploRe user data cleaning

    When people download XploRe, obviously they would like to complete the down-

    load process as quick as possible and answer the question as promptly as possible.

    If the questionnaire is too tedious or too complicated, the downloader may get

    impatient so that they give wrong or incomplete answers. In addition, in survey

    56XploRe has three versions: Local version, Java-Client version and ReX, which is a Exceladd-in.

  • 42 3. XploRe user and customer analysis

    it often happens that the questionees are not very serious about the answer and

    dont give actual information.

    To avoid including the false information into the data, I used the personal ques-

    tions as the indicators for the degree of seriousness to the questionnaire and the

    possibility of false answers. Many people gave obviously wrong answers to the

    personal questions. I assume that, if people gave false answer to the personal

    questions, they would give false answer to substantive questions as well. Fur-

    thermore, according to the given IP addressed, the suspicious observations were

    inspected and then deleted according to a set of criteria.

    The cleaning process was carried out mainly automatically by Excel Visual Editor.

    However, the whole process of data cleaning could hardly be carried out fully

    automatically. Therefore, the manually cleaning work was also taken to delete

    the false information that the computer program could not identify, for instance,

    the matching of IP address and the deletion of the profiles of those from XploRe

    team. At the end, there was 1181 profiles for analysis after the cleaning.

    XploRe customer data cleaning

    The cleaning procedure of customer data is relativelyly simple. We suppose that

    the customer knows their answer will help XploRe to improve its service, there-

    fore, they intend to provide right information. The cleaning process, therefore,

    only include the deletion of doubled customer information.

    3.2.3 Data descriptive analysis and result

    In the first step, the descriptive analysis was conducted with XploRe to give an

    overview of the data.

    XploRe User descriptive analysis

    From the Table in Appendix 1, XploRe user frequency analysis, we can see the

    frequency and percentage of each variable.

    Concerning the resources of getting to know XploRe, WWW/Newsgroup are

    the main resource. 42.9% of the downloader first learn about XploRe through

    Internet. The second main resource is Publications and Journals, 20% users use

    these channels to know about XploRe.

  • 3. XploRe user and customer analysis 43

    49.4% of users work in a university, and 9.1% of users work in research institute.

    The users from Private, Non-research Company have a percentage of 6.6%. The

    interesting point is that a high percentage of users work at home. With 28.9% of

    the users, this group is the second biggest group in this category.

    Excel is the most popular software, which is used by 25.1% of total users. The

    next are SPSS and MatLab, with 11.2% and 10.4% of users respectively. XploRe

    is a command driven software, competitive in rather advance statistical methods.

    The software such as S-Plus and GAUSS have more similar feature and scope

    with XploRe, their users comprise 5.5% and 4% of the total respectively. This

    fact shows that most users are more likely to choose more standard software

    such as Excel and SPSS, because of the higher programming requirement and

    difficulties in using a programmable matrix oriented software like XploRe. But

    the relatively high percentage of MatLab user underlies a sign for opportunity for

    XploRe because MatLab is also a program-oriented software. There is chance for

    XolpRe marketing to get this type customer.

    A great part of XploRe users work in the field of Econometrics. The other pop-

    ular work fields are Mathematical Statistics, Finance and actuarial science, and

    Physics and engineering. Each consists of about 10% of users.

    The most often used statistical methods, corresponding to the users work, are

    time series, followed by Basic statistics, Multivariate methods and Linear models.

    But regarding to the methods that the users look for in XploRe, there are some

    differences. The most wanted statistical method are Time series and Multivariate

    methods, while Non- and semi- parametric methods, Graphics and exploratory

    data analysis are ranked as the third and forth most wanted methods, respectively.

    This difference indicates that the existing statistical software are weak at Non-

    and Semiparametric methods and Graphic/Exploratory methods. Therefore, the

    users try to discover more powerful instrument related to these two methods.

    XploRe could emphasis its strength in these two analysing methods, thus, expand

    its customer base.

    86.5% of users downloaded the local version of XploRe, 9.3% downloaded ReX

    version of XploRe, which is a statistical Microsoft Excel 2000 add-in. Only 4.1%

    of users downloaded the XploRe - Java - Client version.

    Windows-NT is the dominant platform of local version with 84.1% of users. Linux

    is also relativelyly popular, 13.2% of users downloaded XploRe Linux version.

    Concerning Client version, windows- NT is still the dominant platform. Linux

    only account for 6.1%. Other platforms account for very small fractions.

  • 44 3. XploRe user and customer analysis

    Name Type Modal Value Modal Freq. No. of Values

    First Learn Categorical WWW, Newsgroup 42.9% 5

    Work Place Categorical University 49.4% 6

    Software Categorical Excel 25.1% 17

    Work Field Categorical Econometrics 24.1% 10

    Method Used Categorical Time Series 18.7% 12

    Method Looked for Categorical Time Series 17.3% 12

    Xversion Categorical Local 86.5% 3

    Platform L Categorical Windows NT 84.1% 4

    Platform C Categorical Windows NT 87.8% 4

    OS Platform Categorical Windows NT 84.2% 4

    Country Categorical Germany 16.9% 77

    Continent Categorical Europe 52.7% 4

    Tab. 3.1: Summary and decription of the varibale of User 22/07/02 data

    XploRe Users are with various national backgrounds. Users from Germany

    (16.9%), USA (15.7%) and Japan (8.6%) consist of half of the population.

    More than half users are from Europe, 52.7%. The following are America and

    Asia-Pacific, with 24.5% and 20.5% respectively. The reason might be that

    XploRe origins from Germany. The information and marketing are more active

    in Europe than in other areas.

    Since the variables are categorical, we could draw a picture of the typical user of

    XploRe. The modal user of XploRe is some one who is from Germany, works in

    a university, learnt about XploRe through Internet. He uses excel as the main

    software for statistical, and he works in the field of econometrics. Time series are

    his main analysis method, and he looks for the software that performs better in

    Time series methods. He downloads the local version of XploRe and windows-NT

    is his platform.

    XploRe Customer descriptive analysis

    The result of the descriptive analysis of XploRe customer is summarised in the

    Table of Appendix 2.

    The customers of XploRe are come mostly from Germany, which compose 34.4%

    of the total customers. Customers from USA are the second biggest group, with

  • 3. XploRe user and customer analysis 45

    Name Type Modal Value Modal Freq. Missing value

    State Categorical Germany 34.4% 3.1%

    Federal State Categorical Baden-Wurttenberg 3.1% 84.4%

    Sex Categorical Man 21.9% 0.0%

    Language Categorical English 18.8% 59.4%

    Title Categorical Prof. 9.4% 78.1%

    OS Platform Categorical Windows 31.1% 68.8%

    Sector Categorical Research Institute 34.4% 62.5%

    Branch Categorical Economics 9.4% 78.1%

    Note: 1. Federal state refers to the states of Germany

    2. Federal state has no modal value, because all the value have the

    same percentage (3.1%).

    Tab. 3.2: Summary and descripiton of the variables for customer data

    percentage of 25%. The following are Japanese customers, 9.4%. The customers

    from Italy consist of 6.2% of the total customers. There are customers from

    Denmark, France, Norway, The Netherlands, UK, China and Taiwan, they each

    have 3.1% percentage of the customers. Therefore, Europe is the main customer

    market of XploRe, followed by America and Asia.

    78.1% of XploRe customers are men. Women have a relativelyly lower percentage,

    only 21.9%. This is in correspondence with the facts of the XploRe users.

    English is the main language used among the customers, followed by German,

    French and Italian.

    The customer of XploRe are highly intellectual, 21.8% of them own the title of

    Prof., Dr, or Prof.Dr..

    34.4% of customers work in research institutes. 3.1% of them work in companies.

    Windows is the most popular platform. 21.3% of the customers use Windows as

    their computing platform.

    The professional fields, in which the customers work, are diverse. Econometrics

    has a higher percentage of 9.4% among them. The other professional fields indi-

    cated in the data are statistics, biostatistics, mathema


Recommended