Graph Data Science 101: Detecting Fraud with Graph Data Science

Tech First
9 min readNov 29, 2020

--

In this chapter, we walk you through an example of applying graph data science (GDS) techniques to investigate and predict financial fraud. After we familiarize you with a sample financial

transaction dataset, we then remove the outlier information that may skew your results and identify suspicious clusters of clients. After that, you visually explore one of the clusters for graph-based indicators of fraud and look at how graph-based features can help predict fraudulent behavior in the larger dataset.

Finding a Good Fraud Dataset

To simulate a good fraud dataset, you want to create realistic, synthetic data to describe fraudulent transactions, so in this section, we give you a model of a finance network, where users make transactions with merchants and each other via mobile devices. This has similar patterns to traditional credit card networks more common in the United States, Canada, and Europe. Figure 5–1 is a graph example that uses a subset of available nodes and relationships from data that we modified with additional identifiers.

This example uses the following node labels:

» Clients: People who have personally identifiable information (PII) such as Social Security Numbers (SSNs), phone numbers, and email addresses

» Mules: Clients who are known to have fraudulently transferred money

» Clients’ PII:

  • SSNs
  • Phone: Phone numbers
  • Email: Email addresses

These nodes are connected by the following relationship types:

» (Client)-[:HAS_PHONE]->(Phone)

» (Client)-[:HAS_SSN]->(SSN)

» (Client)-[:HAS_EMAIL]->(Email)

The analysis performed here is focused on the above information, but the dataset also contains additional information, such as transactions performed to banks, merchants, and clients.

Removing Outliers

An important first step when performing fraud analysis is to check the quality of the data. In fraud datasets, you may have outliers that aren’t relevant for your analysis. Outliers are rare events or items that raise suspicions by being significantly different to the majority of data. Outliers in a graph are based on their connectivity and the topology of the graph instead of a property value.

The Degree Centrality algorithm measures the number of relationships that a node has. Running the Degree Centrality algorithm is therefore a good way of finding potential outliers.

Because you should expect different types of nodes to have different connectivity, you need to project a graph that consists of a single node type so you can look for outliers relevant to one type of node at a time. For example, each bank should receive many deposits (so it would have a high degree of centrality), but an individual account holder should receive far fewer. You want to compare banks against banks and accounts against accounts.

To find potential outliers, you can run the Degree Centrality algorithm against the fraud dataset with the following query:

CALL gds.alpha.degree.stream({

nodeQuery:’MATCH (n) WHERE n:Phone OR n:Email OR n:SSN RETURN id(n) as id’,

relationshipQuery:’MATCH (n1)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-(c:Client),

(n2)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]->© RETURN id(n1) as source,

id(n2) as target’ })

YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS node, nodeId,

score
RETURN labels(node) as label, nodeId, score, node.

email, node.phoneNumber, node.ssn ORDER BY score DESC
LIMIT 10

When you execute this query, you get many connections to fake identifiers. The output is shown in Figure 5–2. See the appendix for a full-featured view of this figure.

Four big outliers present high scores (column three) for the number of connections. These outliers represent nodes that have fake email accounts, SSNs, and phone numbers.

Exclude these fake result nodes from your analysis because more than likely these people chose not to fill in the form’s information accurately instead of them representing the fraudulent activity. If not excluded, you’d find many false positives based on people sharing common bogus filler information such as an email of “fake@ fake.com.”

Next, update the labels on these nodes so they’ll be easier to exclude from future analysis. The following queries remove the original labels while adding the new “Bad” labels:

MATCH (n:Email)
WHERE n.email=’fake@fake.com’ or n.email=’no@

gmail.com’
SET n:BadEmail REMOVE n:Email; MATCH (n:SSN)
WHERE n.ssn=’000–00–0000'
SET n:BadSSN REMOVE n:SSN;
MATCH (n:Phone)
WHERE n.phoneNumber=’000–000–0000' SET n:BadPhone REMOVE n:Phone;

Finding Suspicious Clusters

Want to find some actual fraudsters? Now is your time! In first-party fraud, fake accounts are created with no intention of repayment of loans or debt. A common way of finding these fakesters is to look for accounts that share identifiers, like SSNs, phone numbers, and email addresses.

Islands of interacting nodes that have little connection to the larger graph aren’t representative of typical financial behavior. You can use this information and the Weakly Connected Components algorithm to find disjointed subgraphs that suspiciously share common identifiers.

The Weakly Connected Components algorithm is a community detection algorithm that finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set.

The following query runs the Weakly Connected Components over a projected graph of clients:

CALL gds.wcc.write({
nodeQuery:’MATCH (c:Client) RETURN

id(c) as id’, relationshipQuery:’MATCH

(c1:Client)-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]- >(intermediate)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]- (c2:Client)

WHERE not(intermediate:BadSSN)
AND not(intermediate:BadEmail)
AND not(intermediate:BadPhone)
RETURN id(c1) as source, id(c2) as target’, writeProperty:’componentId’

});

In the “Removing Outliers” section earlier in this chapter, you assumed that many non-fraudsters use similar bogus form information, so now when you’re building that projected graph, you need to exclude the bad SSNs, email addresses, and phone numbers identified in the preceding section. Otherwise, you’ll end up with just one large cluster due to the commonality of bogus form data.

This Weakly Connected Components algorithm matches clients that share an email, phone number, or SSN, and assigns a label to the property componentId for each client node. Nodes that have the same componentID values are considered to be in the same cluster.

After that, you can use the following query to see a distribution of the cluster sizes returned by this algorithm:

MATCH (c:Client)
WITH c.componentId AS componentId, count(*) AS

size
WITH size, count(*) AS count
RETURN CASE WHEN 1 <= size <= 2 THEN “1–2”

WHEN 3 <= size <= 5 THEN “3–5” WHEN 6 <= size <= 9 THEN “6–9” ELSE “>= 10” END AS size,

sum(count)
ORDER BY size

The result of this query shows that most clients are in small clusters with only 1 or 2 clients and is illustrated in Figure 5–3. See the appendix for a full-featured view of this figure.

The majority of clients are in clusters of 1 or 2 clients, more than those in clusters of 3 to 5 and 6 to 9, and then very few are part of clusters of ten clients or larger. You might expect to see this; most people don’t share personal identifiers, but there are nine clusters with ten or more clients sharing at least one identifier. If you zoom in on just those clusters with ten or more clients and explore those further, you get the following query:

MATCH (c:Client)
WITH c.componentId AS componentId, count(*) AS

numberOfClients, collect(c) AS clients WHERE numberOfClients >= 10

WITH componentId, numberOfClients,
// Find all the identifiers of clients in a

cluster apoc.coll.toSet(apoc.coll.flatten(

[client in clients | [(client)-[:HAS_ SSN|HAS_EMAIL|HAS_PHONE]->(id) | id]])) AS ids,

clients
return componentId, numberOfClients,

// Find out how many of those identifiers
are shared

// Only return identifiers shared by > 1
Client in the cluster

size([record in [id in ids | { id: id,

sharedClients: size([(id)← (client:Client) WHERE client in clients | client])

}] WHERE record.sharedClients > 1 | record]) AS sharedIdentifiers

ORDER BY numberOfClients DESC

These query results are shown in Figure 5–4. See the appendix for a full-featured view of this figure.

Cluster 106 looks like an interesting one to explore further because it has a large number of clients and five shared identifiers between them. In the next section, you visually investigate this cluster.

Visually Exploring a Suspicious Cluster

Exploring cluster 106 in a tool like Neo4j Bloom, which we cover in Chapter 4, can help you further understand this group. You can visualize the relationships between the clients in that cluster with a Bloom search phrase. A Bloom search phrase is a way that you can define a natural language construct that executes a query against the database for you. The search phrase “explore cluster 106” finds the relationships in Figure 5–5 between clients in cluster 106.

The nodes with horseshoe icons represent mules, the ones with people icons are clients, the ones with mail icons are email addresses, and the others are SSNs.

In this cluster, you have four mules, and you can also see three email addresses that are shared by 13 clients. At this point, you probably want to send a list of the people in this cluster to a domain expert to explore further.

From this visualization, you can see that most of the clients in this cluster are sharing just three email accounts. We can imagine a couple of people sharing an email address but having more than that may be something to explore further.

Within the cluster, some of these nodes seem more important, acting as local bridges between clients in different areas of Figure 5–5. You can then use the Betweenness Centrality algorithm to confirm your suspicions.

The Betweenness Centrality algorithm estimates the shortest path between every node-pair and then each node receives a score, based on the number of the shortest paths that pass through the node. Nodes that most frequently lie on these shortest paths will have a higher betweenness centrality score.

Run this algorithm by executing the following query:

CALL gds.betweenness.write({ nodeQuery: ‘MATCH (c:Client) WHERE

c.componentId=106 RETURN id(c) as id’, relationshipQuery: ‘MATCH

(c1:Client)-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]- >(intermediate)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]- (c2:Client)

WHERE not(intermediate:BadSSN)
AND not(intermediate:BadEmail)
AND not(intermediate:BadPhone)
RETURN id(c1) as source, id(c2) as target’, writeProperty:’betweennessCentrality’

})

This query stores a betweenness centrality score in the between- nessCentrality property on each client node for this cluster. After that, you can then update the styling rules in Neo4j Bloom to inspect the results in Figure 5–6.

The largest nodes are the most influential nodes in the cluster. These nodes represent mules that are known to commit fraud. At this point, you’ve identified suspicious behaviors and clusters. After your fraud analysts confirm this likely nefarious activity, you can use this information to predict mules in the larger dataset.

Predicting Fraudsters Using Graph Features

In a real dataset, you wouldn’t actually know who the mules are, but in the dataset we use, they’re identified. This identification allows you to test your prediction that a higher betweenness centrality score is predictive of fraud using the whole graph. A quick check of your theory shows that clients with the mule label have on average a 0.9685 betweenness centrality score, which is significantly higher than non-mule scores as shown in Figure 5–7. See the appendix for a full-featured view of this figure.

Although this is a considerable indicator, the deviation and distribution of scores mean there’s overlap that could lead to false positives and negatives. In this situation, you’d want to combine this betweenness centrality score with other predictive elements and work with a data scientist to create a ML model.

One ML scenario that you could use is an approach that extracts graph features for use in a binary classifier to predict mules. Examples of graph features include

» The betweenness centrality score

» The number of clients sharing identifiers

» The weighting of shared identifiers

» The number of known mules within <n> hops

» The size of clusters

Figure 5–8 shows several graph features we might extract for different people in the graph. See the appendix for a full-featured view of this figure.

These features can be extracted to a tabular format for training an ML model.

After you’re happy with your fraud detection model, you can use it in production to identify other mules as your graph evolves. As new information is added to real-world graphs, it’s common to iterate on this process and create new graph features and update models.

Read More…

--

--

Tech First
Tech First

Written by Tech First

Latest Tech, Best Practices and Love.

Responses (1)