Cloud Spanner: Graph Intelligence using Spanner Graph Algorithms

1. Case-study: Intelligent Retail

For the case-study we take a Retail Customer with a fast-growing digital marketplace. Customer's traditional data view is limited because it shows what people buy, but not how they are connected. This gap leads to missed opportunities and rising fraud. Now, they are pivoting to a Network-First philosophy to value social and logistics connections in addition to transactional data.

Core Business Challenges to address

You have four critical challenges that require understanding how customers and logistics are interconnected:

Challenge

The Problem

The Goal

Influence Gap

Broad advertising yields low ROI; currently It's not possible to identify the real trendsetters (influencers).

Identify Influencers; that are central to the community through their connection in a connected network of customers.

Logistics Resilience

The supply chain can be vulnerable (considering they operate in different geos). If one key hub fails, the entire region can potentially lose product access.

Identify Gatekeepers; those who are critical to bridge the logistics networks together.

Ghost Networks

Fraud rings use fake profiles and shared addresses to coordinate theft and inflate ratings.

Expose Isolated Islands; hyper-connected groups with no ties to the legitimate community.

Paradox of Choice

Current suggestion-/recommendation- engine is rudimentary, generic and often ignored (e.g., "Customers who bought this also bought ...").

Build Behavioral Twins; i.e. recommendations based on similar shipping patterns and social circles.

Mapping Business Challenges to a Technical Strategy (Rows → Relationships)

In a traditional database, data is stored in isolated silos: customers in one table, transactions in another, shipping in a third. While SQL is perfect for answering "Who bought what?", it struggles to answer network-based questions.

To solve these challenges, technical strategy is to shift this perspective:

  • The Relational View (The "What"): Treats every customer as an isolated row. Finding a connection between a customer and a friend's purchase requires multiple complex "joins", which become exponentially slower as the network grows.
  • The Graph View (The "How"): Treats relationships as first-class citizens. Instead of searching through lists, we navigate a map. We can see instantly that Customer A is connected to Customer B, who ships to Location Z.

Deep diving into requirements

Solution architects come to a conclusion that business requirements and technical strategy requires a multi-model approach, and identify the following key requirements.

How Cloud Spanner fits those technical requirements

Cloud Spanner is chosen as the heart of this transformation. It allows the Customer to keep its rock-solid relational foundation while simultaneously unlocking deep graph insights.

Here is a quick rundown of how Cloud Spanner addresses technical requirements and more.

On top of that Cloud Spanner provides a Future proof technical architecture

2. Setting up Data foundation

Following the business case we now move into the implementation phase. In this section, we define our data architecture, explore the limitations of the traditional relational model, and introduce the Property Graph as our primary tool for uncovering deep insights.

Setup Cloud Spanner Enterprise Instance

Step 1: Enable Cloud Spanner API

In the Google Cloud Console, click the Menu icon on the top left of the screen for the left navigation. Scroll down and select "Spanner", or alternatively search for "Spanner"

You should now see the Cloud Spanner UI, and assuming you are using a project that hasn't enabled the Cloud Spanner API yet, you will see a dialog asking you to enable it. If you have already enabled the API, you can skip this step.

Click on "Enable" to continue:

1d9ce329b076805a.png

Step 2: Create Cloud Spanner Instance

First, you will create a Cloud Spanner instance. In the UI, click on "Create a Provisioned Instance" to create a new instance.

In the first step you have to select an edition. Please note that you can upgrade Edition afterwards as well. In order to use multi-model capabilities (Spanner Graph) we can go with Enterprise Edition.

c44a0b5b5ba24877.png

Naming your instance

9cf487f702beece3.png

Select a deployment configuration and select a regional of your choice.

f2c4c364703aecf.png

You can also compare various configuration options. For instance, the deployment configuration has at minimum 3 R/W replicas in 3 separate zones of your selected region. i.e. even if you go with a single node deployment you have 3 copies through 3 R/W replicas. In addition even with Regional deployment configuration you can further extend by having additional R/O replicas in your deployment topology.

Upon configuration of capacity, you can either start at full-node and autoscaling at nodes; or you can use a granular instance (Processing Units; 1000 PUs = 1 node). Optionally you can also set autoscaling targets of instance. For low latency workloads, we recommend 65% for regional instances and 45% for multi-region instances.

6325a8561bbc61c7.png

Step 3: Create a Database

Once your instance is provisioned click "Create Database" to create a database for the rest of your codelab.

422b0317859ec7df.png

Setting-up a Relational Foundation

Our journey begins with the core tables that store operational data. In Cloud Spanner, we use Interleaving to physically co-locate related data, such as a customer's friendships and transactions directly with the customer record. This ensures high-performance access and physical locality.

DDL: Creating the Tables

Copy and execute the following blocks to establish your relational schema:

-- NODE: Customer (Parent)
CREATE TABLE Customer (
  customer_id STRING(60) NOT NULL,
  customer_email STRING(32),

  -- Placeholder fields for Algorithm results
  pagerank_score FLOAT64,        
  centrality_score FLOAT64,      
  community_id INT64            
) PRIMARY KEY(customer_id);

-- EDGE: CustomerFriendship (Interleaved in Customer)
CREATE TABLE CustomerFriendship (
  customer_id STRING(60) NOT NULL,
  friend_id STRING(60) NOT NULL,
  friendship_strength FLOAT64,
  created_at TIMESTAMP,
  CONSTRAINT FK_Friend FOREIGN KEY(friend_id) REFERENCES Customer(customer_id)
) PRIMARY KEY(customer_id, friend_id),
  INTERLEAVE IN PARENT Customer ON DELETE CASCADE;

-- NODE: Product
CREATE TABLE Product (
  product_id STRING(60) NOT NULL,
  product_name STRING(32),
  unit_price FLOAT64,
  pagerank_score FLOAT64
) PRIMARY KEY(product_id);

-- NODE: Shipping
CREATE TABLE Shipping (
  shipping_id STRING(60) NOT NULL,
  city STRING(32),
  country STRING(32)
) PRIMARY KEY(shipping_id);

-- EDGE: Transactions (Interleaved in Customer)
CREATE TABLE Transactions (
  customer_id STRING(60) NOT NULL,
  row_id STRING(36) DEFAULT (GENERATE_UUID()),
  product_id STRING(60) NOT NULL,
  shipping_id STRING(60) NOT NULL,
  transaction_date TIMESTAMP,
  amount FLOAT64,
  CONSTRAINT FK_Prod FOREIGN KEY(product_id) REFERENCES Product(product_id),
  CONSTRAINT FK_Ship FOREIGN KEY(shipping_id) REFERENCES Shipping(shipping_id)
) PRIMARY KEY(customer_id, row_id),
  INTERLEAVE IN PARENT Customer ON DELETE CASCADE;

Seeding the Network

With our tables ready, we must populate them with the users, products, and connections that define The Customer's ecosystem.

-- Populate Products & Shipping
INSERT INTO Product (product_id, product_name, unit_price) VALUES
('P1', 'Smartphone Pro', 999.00), ('P2', 'Wireless Earbuds', 150.00),
('P3', 'USB-C Cable', 25.00), ('P4', '4K Monitor', 450.00),
('P5', 'Ergonomic Chair', 300.00), ('P6', 'Desk Lamp', 45.00);

INSERT INTO Shipping (shipping_id, city, country) VALUES
('S1', 'New York', 'USA'), ('S2', 'London', 'UK'), ('S3', 'Tokyo', 'Japan'),
('S4', 'San Francisco', 'USA'), ('S5', 'Berlin', 'Germany');

-- Populate Customers
INSERT INTO Customer (customer_id, customer_email) VALUES
('C1', 'alice@example.com'), ('C2', 'bob@example.com'), ('C3', 'charlie@example.com'),
('C4', 'david@example.com'), ('C5', 'eve@example.com'), ('C6', 'frank@example.com'),
('C7', 'grace@example.com'), ('C8', 'heidi@example.com'), ('C9', 'ivan@example.com'),
('C10', 'judy@example.com'), ('C11', 'mallory@example.com'), ('C12', 'trent@example.com');

-- Populate Friendships
INSERT INTO CustomerFriendship (customer_id, friend_id, friendship_strength, created_at) VALUES
('C1', 'C2', 1.0, CURRENT_TIMESTAMP()), ('C1', 'C3', 1.0, CURRENT_TIMESTAMP()), 
('C2', 'C1', 0.8, CURRENT_TIMESTAMP()), ('C3', 'C1', 0.9, CURRENT_TIMESTAMP()),
('C3', 'C4', 0.5, CURRENT_TIMESTAMP()), ('C4', 'C5', 0.5, CURRENT_TIMESTAMP()),
('C5', 'C6', 1.0, CURRENT_TIMESTAMP()), ('C5', 'C7', 0.8, CURRENT_TIMESTAMP()), 
('C7', 'C8', 0.7, CURRENT_TIMESTAMP()), ('C8', 'C5', 0.6, CURRENT_TIMESTAMP()),
('C11', 'C1', 1.0, CURRENT_TIMESTAMP()), ('C11', 'C5', 1.0, CURRENT_TIMESTAMP()), 
('C11', 'C7', 1.0, CURRENT_TIMESTAMP()), ('C11', 'C12', 0.5, CURRENT_TIMESTAMP()),
('C1', 'C11', 0.9, CURRENT_TIMESTAMP()), ('C5', 'C11', 0.9, CURRENT_TIMESTAMP()),
('C9', 'C10', 1.0, CURRENT_TIMESTAMP()), ('C10', 'C9', 1.0, CURRENT_TIMESTAMP());

-- Populate Transactions
INSERT INTO Transactions (customer_id, product_id, shipping_id, amount, transaction_date) VALUES
('C1', 'P1', 'S1', 999.00, CURRENT_TIMESTAMP()), ('C2', 'P1', 'S1', 999.00, CURRENT_TIMESTAMP()),
('C11', 'P4', 'S4', 450.00, CURRENT_TIMESTAMP()), ('C11', 'P5', 'S4', 300.00, CURRENT_TIMESTAMP()),
('C7', 'P5', 'S5', 300.00, CURRENT_TIMESTAMP()), ('C8', 'P6', 'S5', 45.00, CURRENT_TIMESTAMP()),
('C9', 'P1', 'S1', 999.00, CURRENT_TIMESTAMP()), ('C10', 'P1', 'S1', 999.00, CURRENT_TIMESTAMP());

Relational Challenge

Before we introduce the graph, let's see how traditional SQL handles The Customer's challenges. Run this query to find "Social Spenders"customers who spend significantly and have several friends.

SELECT 
    c.customer_id, 
    c.customer_email,
    SUM(t.amount) AS total_spent,
    COUNT(DISTINCT f.friend_id) AS friend_count
FROM Customer AS c
LEFT JOIN Transactions AS t ON c.customer_id = t.customer_id
LEFT JOIN CustomerFriendship AS f ON c.customer_id = f.customer_id
GROUP BY c.customer_id, c.customer_email
HAVING total_spent > 500
ORDER BY total_spent DESC;

The Limitations of the Relational Approach

Overcoming Relational Challenges through a Property Graph

To overcome these limits, we define a Property Graph. This creates an "overlay" that allows us to treat relationships as first-class citizens without moving our data out of Spanner.

DDL: Creating the Property Graph

This DDL defines our Nodes (Entities) and Edges (Relationships). In this example we are following a schematized graph, however Spanner Graph allows allows modeling schemaless graphs to enable flexible, rapid iterative development and to handle evolving data models without constant DDL (Data Definition Language) changes.

CREATE OR REPLACE PROPERTY GRAPH RetailTransactionGraph
NODE TABLES (
  Customer KEY (customer_id),
  Product KEY (product_id),
  Shipping KEY (shipping_id)
)
EDGE TABLES (
  CustomerFriendship AS IsFriendsWith
    SOURCE KEY (customer_id) REFERENCES Customer (customer_id)
    DESTINATION KEY (friend_id) REFERENCES Customer (customer_id)
    LABEL IsFriendsWith,
    
  Transactions AS Purchased
    SOURCE KEY (customer_id) REFERENCES Customer (customer_id)
    DESTINATION KEY (product_id) REFERENCES Product (product_id)
    LABEL Purchased,

  Transactions AS LivesAt
    SOURCE KEY (customer_id) REFERENCES Customer (customer_id)
    DESTINATION KEY (shipping_id) REFERENCES Shipping (shipping_id)
    LABEL LivesAt
);

Now that our graph is defined, we can use Graph Query Language (GQL) to perform multi-hop traversals with a simple, readable syntax.

Exploration 1: Collaborative Discovery

This query traverses the graph to find products purchased by your friends and serves as the foundation of a recommendation engine.

GRAPH RetailTransactionGraph
MATCH (me:Customer)-[:IsFriendsWith]->(friend:Customer)-[:Purchased]->(p:Product)
WHERE me.customer_id = 'C1'
RETURN 
    me.customer_id AS my_id, 
    friend.customer_id AS friend_id, 
    p.product_name AS recommendation

Exploration 2: The Hybrid Query (Relational + Graph)

Spanner allows you to embed GQL patterns inside a standard SQL FROM clause using the GRAPH_TABLE function. This query finds customers who live at the same location as their friendsa "diamond" pattern match.

SELECT *
FROM GRAPH_TABLE(RetailTransactionGraph
  MATCH (a:Customer)-[:IsFriendsWith]-(b:Customer),
        (a)-[:LivesAt]->(loc:Shipping),
        (b)-[:LivesAt]->(loc)
  RETURN a.customer_id AS user_A, b.customer_id AS user_B, loc.city
)

Visualizing The Customer's Connections

Finally, let's use GQL to visualize our network. These queries wrap the path results in SAFE_TO_JSON, allowing visualizers to draw the nodes and lines.

Visualizing the Super-Influencer

This highlights Mallory (C11) and her direct social reach.

GRAPH RetailTransactionGraph
MATCH p = (c:Customer {customer_id: 'C11'})-[:IsFriendsWith]->(f:Customer)
RETURN SAFE_TO_JSON(p) AS social_paths

f0c713fc048cd0a5.png

Visualizing Potential Fraud Patterns

This query roots out the "Isolated Cluster" (Ivan & Judy) to see where their products are being shipped.

GRAPH RetailTransactionGraph
MATCH p = (c:Customer)-[:Purchased]->(prod:Product),
      q = (c)-[:LivesAt]->(loc:Shipping)
WHERE c.customer_id IN ('C9', 'C10')
RETURN SAFE_TO_JSON(p) AS purchase_path, SAFE_TO_JSON(q) AS shipping_path

3. Introduction to Spanner Graph Algorithms

To prepare for our deep dive into Graph Intelligence, this section outlines the technical architecture and foundational rules of Cloud Spanner Graph Algorithms. Understanding these principles is key to moving from simple traversals to petabyte-scale relationship analysis.

The Algorithm Portfolio

Cloud Spanner currently supports 14 industry-standard graph algorithms, categorized into four functional groups to solve diverse business problems:

Category

Supported Algorithms

Business Use Case

Centrality

PageRank, Personalized PageRank, Betweenness, Closeness

Identify influencers, hubs, and bottlenecks.

Community

WCC, Label Propagation, Clique Finding, Correlation Clustering

Detect fraud rings, social communities, and silos.

Similarity

Jaccard, Cosine, Common Neighbors, Total Neighbors

Power recommendation engines and entity resolution.

Path Finding

Set-to-set Shortest Path, GA Path helpers

Optimize logistics and traversal proximity.

Crucial Schema & Query Considerations

In order to ensure efficient Graph algorithms execution, Spanner Graph needs to adhere to these rules:

Requirement 1. Physical Data Locality (Interleaving)

The most critical requirement for high-performance graph traversal is Interleaving. This ensures that edge data is physically stored on the same server split as the source node, minimizing network latency during algorithm execution.

  • The Rule: Edge tables MUST be interleaved in their source node tables.
  • Forward Traversal: Interleaving the edge table into the source node table ensures cache locality for outgoing links.
  • Reverse Traversal: For efficient "incoming" link analysis, use Foreign Keys to automatically create backing indexes, or create a secondary index interleaved in the destination table.

Requirement 2. Unique Labeling Requirements

Every table participating in the property graph must have a unique identity. Algorithms rely on these labels to correctly identify and load the subgraphs they need to analyze.

  • The Rule: Each input table must have a uniquely identifying label within the property graph.
  • The Conflict: You cannot map a single label to multiple tables if you intend to run algorithms on them.

Logic

Example

Result

❌ Bad

NODE TABLES (Person LABEL Entity, Account LABEL Entity)

Invalid: The algorithm cannot distinguish between a Person and an Account.

✅ Good

NODE TABLES (Person LABEL Customer, Account LABEL Account)

Valid: Each entity has a distinct, unique label.

Requirement 3. Algorithm Query Structure (The MATCH Clause)

When calling an algorithm, the MATCH clause follows more restrictive rules than standard GQL queries to ensure the execution engine can optimize the analytical pipeline.

  • One Pattern Per MATCH: Each MATCH statement can only name one variable.
  • No Multi-Node Patterns: You cannot define a relationship pattern (e.g., (a)-[e]->(b)) directly inside a MATCH clause intended for an algorithm call.
  • Literal Filters Only: While you can use WHERE clauses to filter nodes (e.g., WHERE a.id > 400), query parameters (@param) are not currently supported in graph algorithm queries.

Requirement 4. The RETURN Clause (Scalars Only)

The RETURN clause in an algorithm query acts as the bridge between the graph world and the relational world. It is strictly limited to returning scalars and constants.

  • The Rule: You cannot return a "Graph Element" (the raw node or edge object).
  • No Transformations: You cannot perform mathematical operations or apply functions to the properties being returned within the RETURN statement itself.

RETURN Clause Restrictions

✅ Supported

❌ Not Supported

RETURN node.id, score

RETURN node, score (Cannot return Graph Element)

RETURN PATH_LENGTH(p)

RETURN node.id + 1, score (No operations on properties)

RETURN node.name

RETURN JSON_OBJECT(node.id, score) (No functions)

Requirement 5. Data Integrity: Eliminating Dangling Edges

A "Dangling Edge" occurs when an edge points to a destination node that does not exist in the graph. This causes algorithm execution to fail because the graph structure is inconsistent.

  • The Solution: Use referential constraints (Foreign Keys) and ON DELETE CASCADE to maintain graph integrity.
  • Query Safety: When calling an algorithm, you must ensure that all nodes referred to by the selected edges are also included in the node_labels argument.

Persistent Output: EXPORT DATA Options

Because graph algorithms are compute-intensive, they are executed in Scale-up execution mode using the EXPORT DATA statement. This leverages Data Boost, using independent serverless compute resources to prevent any lag on your production transactions.

Option 1: Persist back to Cloud Spanner

To push results directly back into your tables (e.g., saving a PageRank score), use format = ‘CLOUD_SPANNER'.

  • update_ignore_all: Only updates rows for keys that already exist in the target table.
  • upsert_ignore_all: Updates existing rows or inserts new rows if the keys are missing.
EXPORT DATA OPTIONS (
  format = 'CLOUD_SPANNER',
  table = 'Customer',
  write_mode = 'update_ignore_all'
) AS
GRAPH RetailTransactionGraph
CALL PageRank(...)
RETURN node.customer_id, score;

Option 2: Persist results to Google Cloud Storage (GCS)

For large-scale offline analysis, you can export to GCS in CSV, Avro, or Parquet formats.

  • Wildcards: Use uri => 'gs://bucket/file_*.csv' to enable sharded output, allowing Spanner to write to multiple files in parallel for massive datasets.
  • Compression: Supports GZIP, SNAPPY, and ZSTD to optimize storage costs.
EXPORT DATA OPTIONS (
  uri = 'gs://bucket/pagerank_*.csv',
  format = 'CSV',
  overwrite = true
) AS
GRAPH RetailTransactionGraph
CALL PageRank(...)
RETURN node.customer_id, score;

4. Challenge 1: Influence Gap (PageRank)

In this section, we tackle The Customer's first business hurdle: the "Influence Gap." We will move from a basic "popularity contest" to a mathematically driven map of true social influence.

Problem Statement: The Customer's marketing team has a problem. They are spending millions on broad advertising with shrinking returns because they cannot identify the "Social Superstars", those rare individuals whose endorsements ripple through the entire network.

To solve this, we need to rank our customers by Influence.

Relational Solution (Degree Centrality)

In a standard database, the easiest way to find an influencer is to simply count their followers (a metric known as Degree Centrality).

Run this query to find most "popular" users:

SELECT 
    friend_id AS customer_id, 
    COUNT(*) AS follower_count
FROM CustomerFriendship
GROUP BY friend_id
ORDER BY follower_count DESC;

customer_id

follower_count

C1

3

C5

3

C11

2

C7

2

C10

1

C12

1

C2

1

C3

1

C4

1

C6

1

C8

1

C9

1

Graph Intelligence (PageRank)

To find the real leaders, we use PageRank. This is the same algorithm that powered early web search; it measures a node's importance based on the quantity AND the quality of incoming links .

  • Random Surfer Model: PageRank simulates a user moving through the graph. The Damping Factor (default 0.85) represents the probability they continue clicking; otherwise, they "teleport" to a random node .
  • Power of Association: A link from an influential person (like Mallory) is worth significantly more than a link from someone with no other connections.

We will execute PageRank Algorithm and use EXPORT DATA to save the results directly into our pagerank_score column .

EXPORT DATA OPTIONS (
  format = 'CLOUD_SPANNER',
  table = 'Customer',
  write_mode = 'update_ignore_all' -- Updates existing rows
) AS
GRAPH RetailTransactionGraph
CALL PageRank(
  node_labels => ['Customer'],      -- Target our Customer nodes 
  edge_labels => ['IsFriendsWith'], -- Analyze the social ties
  damping_factor => 0.85,           -- Standard decay
  max_iterations => 10              -- Higher iterations for better precision
)
YIELD node, score               
RETURN node.customer_id, score as pagerank_score;

"Influence" Dashboard using PageRank

Now that the scores are persisted, let's compare our "Before" (Follower Count) with our "After" (PageRank Score).

-- Note that Higher PageRank score means more influential
SELECT 
    c.customer_id, 
    c.customer_email,
    count_query.follower_count,
    c.pagerank_score
FROM Customer c
JOIN (
    SELECT friend_id, COUNT(*) AS follower_count 
    FROM CustomerFriendship GROUP BY friend_id
) AS count_query ON c.customer_id = count_query.friend_id
ORDER BY c.pagerank_score DESC;

customer_id

customer_email

follower_count

pagerank_score

C5

eve@example.com

3

0.158392489

C10

judy@example.com

1

0.1093561724

C9

ivan@example.com

1

0.1093561724

C1

alice@example.com

3

0.1000888124

C8

heidi@example.com

1

0.09759821743

C11

mallory@example.com

2

0.09466411918

C7

grace@example.com

2

0.08016719669

C6

frank@example.com

1

0.06022448093

C2

bob@example.com

1

0.0547891818

C3

charlie@example.com

1

0.0547891818

C12

trent@example.com

1

0.04029225558

C4

david@example.com

1

0.04028172791

Analysis: Who are the Real Superstars?

By analyzing the output, you can now make three critical marketing discoveries:

Business Takeaway

Instead of blindly emailing everyone with more than five followers, The Customer's marketing team can now focus exclusively on those with the highest pagerank_score. These individuals are the true "Social Superstars" capable of driving systemic virality across the entire marketplace.

Now lets try to identify the Gatekeepers that keep The Customer's logistics network running.

5. Challenge 2: Logistic Resilience (BetweennessCentrality)

In this section, we address Logistics Resilience. We will move beyond measuring success by "volume" to identifying the vital "gatekeepers" that keep the network connected.

Relational Solution (Volume-Based Analysis)

In a standard relational setup, a "critical" shipping hub is typically defined as the one processing the most orders or generating the most revenue.

Run this query to identify "top" hubs by transaction count:

-- Identify "Critical" hubs by transaction volume
SELECT 
    s.city, 
    s.country, 
    COUNT(t.row_id) AS transaction_count,
    SUM(t.amount) AS total_revenue
FROM Shipping s
JOIN Transactions t ON s.shipping_id = t.shipping_id
GROUP BY s.city, s.country
ORDER BY transaction_count DESC;

city

country

transaction_count

total_revenue

New York

USA

4

3996

Berlin

Germany

2

345

San Francisco

USA

2

750

To address the mismatch, we will use both IsFriendsWith and LivesAt edges. This transforms our analysis from a transaction hub to also include social check.

Graph Intelligence (Betweenness Centrality)

To find the real bottlenecks, we use Betweenness Centrality. This algorithm quantifies how often a node acts as a "bridge" along the shortest paths between all other pairs of nodes in the graph. High scores pinpoint the true gatekeepers who control the flow of goods or information.

Running & Persisting Betweenness Centrality

We will execute the algorithm using EXPORT DATA and save the scores into the centrality_score column. We use Data Boost to ensure this heavy "shortest path" calculation has near-zero impact on The Customer's live operations.

EXPORT DATA OPTIONS (
  format = 'CLOUD_SPANNER',
  table = 'Customer',
  write_mode = 'update_ignore_all' 
) AS
GRAPH RetailTransactionGraph
CALL BetweennessCentrality(
  -- We include both Customer and Shipping nodes for a full ecosystem view
  node_labels => ['Customer', 'Shipping'], 
  -- We factor in social ties AND physical shipping locations
  edge_labels => ['IsFriendsWith', 'LivesAt'], 
  num_source_nodes => 100
)
YIELD node, score
-- We only persist scores for Customers; Shipping node results are safely ignored
RETURN node.customer_id, score as centrality_score;

Analysis: Identifying the "Hidden Bottlenecks"

Now, we compare our structural risk (centrality_score) with our transactional volume (order_count) to find the nodes that The Customer's leadership should worry about.

SELECT 
    c.customer_id, 
    c.customer_email,
    c.centrality_score,
    count_query.order_count
FROM Customer c
LEFT JOIN (
    SELECT customer_id, COUNT(*) AS order_count 
    FROM Transactions GROUP BY customer_id
) AS count_query ON c.customer_id = count_query.customer_id
ORDER BY c.centrality_score DESC;

customer_id

customer_email

centrality_score

order_count

C11

mallory@example.com

44.5

2

C1

alice@example.com

35.5

1

C5

eve@example.com

35.5

C7

grace@example.com

12

1

C8

heidi@example.com

10

1

C3

charlie@example.com

6

C4

david@example.com

3.5

C10

judy@example.com

0

1

C12

trent@example.com

0

C2

bob@example.com

0

1

C6

frank@example.com

0

C9

ivan@example.com

0

1

By analyzing these results, The Customer makes three startling discoveries:

Business Takeaway

The Customer can now prioritize its logistics redundancy and security protocols based on multi-modal structural risk. Mallory, Alice, and Eve are the gatekeepers who must be protected to ensure the stability of the logistical network.

Now let's try to isolate fraud islands.

6. Challenge 3: Ghost Networks (WCC)

In this section, we tackle the third business hurdle: The "Ghost Networks." We will move from simple "hotspot" detection to uncovering sophisticated, isolated fraud rings using community detection. The challenge here is that bad actors create fake profiles that share shipping addresses or interact in closed loops to coordinate thefts and inflate product ratings. But they are often completely isolated from the legitimate The Customer community.

To solve this, we need to expose these "Isolated Islands."

Without graph algorithms, the standard way to find fraud is to look for "hotspots" of shared data, such as multiple customers shipping to the exact same address .

Run this query to find customers linked by a shared shipping location:

SELECT 
    shipping_id, 
    COUNT(DISTINCT customer_id) AS customer_count,
    ARRAY_AGG(customer_id) AS linked_customers
FROM Transactions
GROUP BY shipping_id
HAVING customer_count > 1;

shipping_id

customer_count

linked_customers

S1

4

["C1","C10","C2","C9"]

S5

2

["C7","C8"]

To find the fraud networks, we need to understand transitive reachability.

Graph Intelligence (Weakly Connected Components)

To find the full extent of these rings, we use Weakly Connected Components (WCC). WCC is a clustering algorithm that identifies sets of nodes where a path exists between any two nodes, regardless of the direction of the edges.

  • Reachability Zones: It effectively carves the graph into "islands" or "reachability zones".
  • Unified Entity View: By analyzing both social ties (IsFriendsWith) and logistics ties (LivesAt) simultaneously, we can group fragmented profiles into a single, unified "Impact Cluster".

Running & Persisting WCC

We will execute the WCC algorithm and save the results into the community_id column. We use Data Boost to ensure this deep reachability analysis happens on independent compute resources.

EXPORT DATA OPTIONS (
  format = 'CLOUD_SPANNER',
  table = 'Customer',
  write_mode = 'update_ignore_all' 
) AS
GRAPH RetailTransactionGraph
CALL WeaklyConnectedComponents(
  node_labels => ['Customer', 'Shipping'], 
  edge_labels => ['IsFriendsWith', 'LivesAt']
)
YIELD node, cluster
-- node.customer_id will be NULL for Shipping nodes; 
-- EXPORT DATA will safely ignore those rows.
RETURN node.customer_id, cluster AS community_id;

Analysis: Fraud Rings

Now, let's run a validation query to see our isolated communities. Legitimate users typically belong to the "Mainland," while fraudsters are often stranded on small "Islands".

SELECT 
    community_id, 
    COUNT(*) AS member_count,
    ARRAY_AGG(customer_email) AS members
FROM Customer
GROUP BY community_id
ORDER BY member_count ASC;

community_id

member_count

members

1

2

["judy@example.com","ivan@example.com"]

0

10

["alice@example.com","mallory@example.com","trent@example.com","bob@example.com","charlie@example.com","david@example.com","eve@example.com","frank@example.com","grace@example.com","heidi@example.com"]

By running this community detection, you can identify a critical anomaly:

Business Takeaway

The Customer can now automate its security responses. Instead of manually chasing individual accounts, they can write a simple rule: "If a community_id has fewer than three members, flag the entire group for manual KYC (Know Your Customer) review"

.

With our fraud rings exposed, we can solve the "Behavioral Twin."

7. Challenge 4: Behavioral Twin (JaccardSimilarity)

In this final challenge, we address the fourth hurdle: the "Paradox of Choice"/"Behavioral Twin". We will move from generic "frequently bought together" lists to hyper-personalized recommendations based on behavioral "fingerprints."

The Customer's current product suggestions are too generic. Recommending a popular USB cable to every customer is safe, but it isn't personal. The Customer wants to build "Behavioral Twin" recommendations identifying customers who share unique shipping patterns and social circles to suggest products with high-precision matching.

To solve this, we need to calculate the "Proximity" between users.

Relational Solution (Absolute Overlap)

In a standard relational setup, you might look for people who ship to the same locations as a reference user, like Alice (C1).

Run this query to find Alice's geographic neighbors:

SELECT 
    t2.customer_id AS similar_customer,
    COUNT(DISTINCT t1.shipping_id) AS shared_locations
FROM Transactions t1
JOIN Transactions t2 ON t1.shipping_id = t2.shipping_id
WHERE t1.customer_id = 'C1' AND t2.customer_id != 'C1'
GROUP BY similar_customer
ORDER BY shared_locations DESC;

similar_customer

shared_locations

C2

1

C10

1

C9

1

Graph Intelligence (Jaccard Similarity)

To find true behavioral twins, we use Jaccard Similarity. This algorithm calculates a normalized score (0.0 to 1.0) by dividing the number of shared neighbors (Intersection) by the total number of unique neighbors (Union).

Here a "Behavioral Twin" is defined by more than just a shared shipping address. By analyzing the intersection of physical footprints (LivesAt) and social ecosystems (IsFriendsWith), we can identify users who share the same lifestyle and community influence, leading to far more accurate product recommendations.

First create a Mapping Table

Because similarity is a pairwise relationship (Customer A is similar to Customer B), we create a dedicated table interleaved in Customer to store these mappings.

CREATE TABLE CustomerSimilarity (
  customer_id STRING(60) NOT NULL, -- Renamed from source_id to match Parent PK
  target_id STRING(60) NOT NULL,
  similarity_score FLOAT64,
  CONSTRAINT FK_SourceCustomer FOREIGN KEY(customer_id) REFERENCES Customer(customer_id),
  CONSTRAINT FK_TargetCustomer FOREIGN KEY(target_id) REFERENCES Customer(customer_id)
) PRIMARY KEY(customer_id, target_id),
  INTERLEAVE IN PARENT Customer ON DELETE CASCADE;

Now Run Jaccard Similarity

We will now execute the algorithm. Note: This query includes a common "Guardrail" lesson. If you only select Customer nodes but use the LivesAt edge (which points to Shipping nodes), the query will fail citing a "Dangling Edge" . To fix this, we must include both node labels.

EXPORT DATA OPTIONS (
  format = 'CLOUD_SPANNER',
  table = 'CustomerSimilarity', 
  write_mode = 'upsert_ignore_all' 
) AS
GRAPH RetailTransactionGraph
CALL JaccardSimilarity(
  node_labels => ['Customer', 'Shipping'], -- Added Shipping to avoid dangling edges
  edge_labels => ['LivesAt', 'IsFriendsWith'], -- Use both logistics and social edges for holistic similarity
  source_nodes => ARRAY(
    SELECT s FROM GRAPH_TABLE(RetailTransactionGraph 
      MATCH (s:Customer {customer_id: 'C1'}) 
      RETURN s)
  ),
  target_nodes => ARRAY(
    SELECT t FROM GRAPH_TABLE(RetailTransactionGraph 
      MATCH (t:Customer) 
      WHERE t.customer_id != 'C1' 
      RETURN t)
  )
)
YIELD source_node, target_node, similarity
RETURN 
    source_node.customer_id AS customer_id, 
    target_node.customer_id AS target_id, 
    similarity AS similarity_score;

Analysis: "Behavioral Twin" Check

Now that the analytical job is complete, we run a validation query. By joining our new mapping table (CustomerSimilarity) with our original Customer metadata, we can see exactly who Alice's "Behavioral Twins" are.

Run this query to inspect Alice's similarity rankings:

SELECT 
    c.customer_email AS peer_email,
    s.similarity_score,
    c.community_id,
    c.pagerank_score
FROM CustomerSimilarity s
JOIN Customer c ON s.target_id = c.customer_id
WHERE s.customer_id = 'C1'
ORDER BY s.similarity_score DESC;

peer_email

similarity_score

community_id

pagerank_score

judy@example.com

0.200000003

1

0.1093561724

bob@example.com

0.200000003

0

0.0547891818

ivan@example.com

0.200000003

1

0.1093561724

eve@example.com

0.1666666716

0

0.158392489

mallory@example.com

0

0

0.09466411918

trent@example.com

0

0

0.04029225558

charlie@example.com

0

0

0.0547891818

david@example.com

0

0

0.04028172791

frank@example.com

0

0

0.06022448093

grace@example.com

0

0

0.08016719669

heidi@example.com

0

0

0.09759821743

What to look for in results:

Now lets try to build a final Unified Intelligence view.

8. Unified Intelligence

Now we move from individual technical tasks to Unified Intelligence. Here, we blend transactional data with all four graph algorithms to provide clear, actionable insights.

Report 1: Unified Intelligence

The power of a multi-model database like Spanner is the ability to join relational spend data with graph-derived influence, risk, and similarity scores in a single request. This query categorizes every customer into a specific business persona.

Run the Unified Intelligence query to see complete ecosystem:

SELECT 
    c.customer_id,
    c.customer_email,
    -- Transactional Data (Relational)
    COALESCE(t.total_spend, 0) AS spend,
    -- Graph Intelligence Data (Algorithms)
    c.pagerank_score AS influence,
    c.centrality_score AS bottleneck_risk,
    c.community_id,
    -- Persona Categorization Logic
    CASE 
        WHEN c.community_id = 1 THEN '🔴 HIGH RISK: Isolated Fraud Ring'
        WHEN c.centrality_score > 25 THEN '🔵 CRITICAL: Network Bridge'
        WHEN c.pagerank_score > 0.08 AND t.total_spend > 500 THEN ' VIP: Influential Spender'
        WHEN c.pagerank_score > 0.08 THEN '📱 SOCIAL: High-Reach Influencer'
        WHEN sim.similarity_to_alice = 1.0 AND c.community_id != 0 THEN '⚠️ WARNING: Identity Anomaly'
        ELSE '🟢 STANDARD: Active Customer'
    END AS business_persona
FROM Customer c
LEFT JOIN (
    -- Aggregate total spend per customer
    SELECT customer_id, SUM(amount) AS total_spend 
    FROM Transactions GROUP BY customer_id
) t ON c.customer_id = t.customer_id
LEFT JOIN (
    -- Pull similarity relative to our reference user 'C1'
    SELECT target_id, similarity_score AS similarity_to_alice 
    FROM CustomerSimilarity WHERE customer_id = 'C1'
) sim ON c.customer_id = sim.target_id
ORDER BY c.centrality_score DESC, c.pagerank_score DESC;

customer_id

customer_email

spend

influence

bottleneck_risk

community_id

business_persona

C11

mallory@example.com

750

0.09466411918

44.5

0

🔵 CRITICAL: Network Bridge

C5

eve@example.com

0

0.158392489

35.5

0

🔵 CRITICAL: Network Bridge

C1

alice@example.com

999

0.1000888124

35.5

0

🔵 CRITICAL: Network Bridge

C7

grace@example.com

300

0.08016719669

12

0

📱 SOCIAL: High-Reach Influencer

C8

heidi@example.com

45

0.09759821743

10

0

📱 SOCIAL: High-Reach Influencer

C3

charlie@example.com

0

0.0547891818

6

0

🟢 STANDARD: Active Customer

C4

david@example.com

0

0.04028172791

3.5

0

🟢 STANDARD: Active Customer

C10

judy@example.com

999

0.1093561724

0

1

🔴 HIGH RISK: Isolated Fraud Ring

C9

ivan@example.com

999

0.1093561724

0

1

🔴 HIGH RISK: Isolated Fraud Ring

C6

frank@example.com

0

0.06022448093

0

0

🟢 STANDARD: Active Customer

C2

bob@example.com

999

0.0547891818

0

0

🟢 STANDARD: Active Customer

C12

trent@example.com

0

0.04029225558

0

0

🟢 STANDARD: Active Customer

By blending these mathematical lenses, we move beyond "who spent the most" to "who matters the most." The unified dashboard integrates relational transaction data with multi-modal graph intelligence to categorize your ecosystem into three clear, actionable personas.

The "Critical Network Bridges" (Resilience)

Nodes like Mallory (C11), Eve (C5), and Alice (C1) are flagged because their bottleneck_risk (Betweenness Centrality) is >25.

  • The Structural Anchors: Mallory holds the highest risk score at 44.5, marking her as the primary gateway for the entire network.
  • The Zero-Spend Paradox: Eve (C5) has an order count of zero, yet she is structurally indispensable with a risk score of 35.5. Standard SQL would have ignored her entirely, but Graph Intelligence reveals she is a vital bridge to an entire sub-community.
  • The High-Value Gateway: Alice (C1) tied with Eve at 35.5, proving that high spenders can also be critical structural anchors.

The "Social Superstars" (Reach)

Heidi (C8) and Grace (C7) are identified as high-reach influencers due to their PageRank scores .

The "Isolated Fraud Ring" (Anomalies)

Judy (C10) and Ivan (C9) are flagged because they belong to the isolated community_id 1

Business Insight to Strategic Actions

Persona

Key Metric

Business Insight

Strategic Action

🔵 Network Bridges

High Centrality

Structural Anchors: Eve (C5) and Mallory (C11) hold the network together.

Retention: Protect these gatekeepers to prevent community fragmentation.

📱 Social Superstars

High PageRank

Viral Engines: Users like Heidi (C8) have the highest reach in their circles.

Marketing: Use for high-impact referral and ambassador programs.

🔴 Fraud Risks

Isolated WCC

Ghost Networks: Judy (C10) and Ivan (C9) are high-spenders but live on "islands."

Security: Immediate manual KYC review; these are classic fraud signatures.

🟢 Standard Users

Balanced Scores

Healthy Core: The majority of the network, including "local" bridges like David (C4).

Growth: Apply standard personalized ads and "Behavioral Twin" recommendations.

Report 2: The Identity Anomaly Report

Now you need to know if legitimate accounts are being "mimicked" by fraudsters. We can solve this by finding users who have 100% Behavioral Similarity but Zero Social Connection.

Run this query to flag potential "Identity Anomalies":

SELECT 
    s.target_id AS suspect_id,
    c.customer_email,
    s.similarity_score AS behavioral_overlap,
    c.community_id AS social_group
FROM CustomerSimilarity s
JOIN Customer c ON s.target_id = c.customer_id
WHERE s.customer_id = 'C1' -- Reference Alice (Legitimate)
  AND s.similarity_score > 0.15 
  AND c.community_id != 0 -- Filter for social strangers
ORDER BY s.similarity_score DESC;

The Identify Anomaly Report provides critical information. By isolating users who act like legitimate customers but lack their social ties, we move from guessing to mathematical certainty .

suspect_id

customer_email

behavioral_overlap

social_group

C10

judy@example.com

0.200000003

1

C9

ivan@example.com

0.200000003

1

Analysis of Results

By unifying Similarity (Jaccard) with Community Detection (WCC), we expose hidden risks that traditional transactional data cannot see.

  • The "Behavioral Twins" (Proximity): Nodes like Judy (C10) and Ivan (C9) are flagged because they share a Jaccard Similarity score of 0.20 relative to Alice (C1).
  • Isolation Behavior: Judy (C10) and Ivan (C9) are grouped into the isolated community_id 1, while Alice belongs to the social "Mainland" (Community 0).
  • Fraud Flags: The report identifies users with high behavioral overlap (>0.9) who remain socially disconnected from the primary network.

9. Congratulations and Summary

This lab shows how Cloud Spanner turns a relational database into a multi-model powerhouse. By applying graph intelligence to The Customer, we moved from static data to actionable business strategy.

The Spanner Multi-Model Advantage

  • Unified Architecture: Spanner allows you to maintain a rock-solid relational foundation while instantly "overlaying" a property graph for relationship mining all without the risk and lag of ETL.
  • Off-Box Analytical Isolation: By leveraging Data Boost, you can execute memory-intensive algorithms like PageRank or WCC on independent, serverless compute resources, ensuring zero impact on your production checkout performance.
  • Interleaved Performance: Spanner's unique interleaving ensures that nodes and their relationships are physically co-located, turning complex global traversals into high-speed local lookups.

Surfacing "Hidden Gems" & Anomalies

  • Identifying Structural Value: Graph algorithms like Betweenness Centrality revealed "Hidden Bridges" with zero spend who can be more critical to network's resilience than highest-spending customers.
  • Exposing Behavioral Mimicry: By combining Jaccard Similarity and Weakly Connected Components, we identified "Social Strangers". These accounts look like legitimate customers but are mathematically proven to be isolated fraud rings.
  • Global vs. Local Truth: While manual SQL analysis can surface bridges, global algorithms can surface key Gatekeepers of the network.

Making Data Intelligent and Actionable

  • Persona-Driven Strategy: We successfully transformed our rows to relationship, and by running algorithms we can address four business problems, namely: Network Bridges, Social Superstars, Fraud Risks, and Standard Users.