當 AI 會寫程式，程式碼會變得即寫即棄嗎？

人類從事的領域，不斷被生成式 AI (GenAI) 入侵。

社群媒體上，早就充斥一大堆「AI 體」的文案圖案，現在就連程式設計領域也開始淪陷。像 Google 執行長就透露：「在 Google，超過 1/4 的新程式碼已經是先由 AI 生成，再讓程式設計師去複審與確認。」¹ 而 GitHub 執行長更大膽預言 80% 的程式碼都將會由 GenAI 生成。²

也就是說，以前，程式設計師得親手撰寫大部分的程式碼；現在則開始挪出一段時間去跟 GenAI 對話：調教提示詞 (prompt)、審核生成的程式碼、確認正確性，如此反覆進行好幾個回合。

GenAI 進展神速。如果某一天進展到，只要提示詞下得夠精準，第一次生出來的程式碼幾乎就合格過關了，是否也就意謂著，程式設計師的專業訓練及工作重點，將會更轉向 ⑴ 鑽研提示工程、⑵ 審核生成的程式碼這兩條路線發展？換句話說，將會更朝向 ⑴ 前段的需求規格、⑵ 後段的測試驗收這兩條路線？

如果頭尾兩端（需求規格與測試驗收）都由人來看守定義，中間步驟（撰寫程式碼）改由 GenAI 來操刀，是否也就意謂著，程式碼會變成像是可拋棄的東西，反正隨時都可以叫 GenAI 再生出一份出來？甚至當 GenAI 能力升級之後，又能夠生出比拋棄掉的還要更好的程式碼？

聽起來可能有點兒瘋狂。可是如果以上屬實，那麼，敏捷四大宣言之一「可用的軟體 重於 詳盡的文件」，在 GenAI 時代是否依然成立，可能都值得再重新評估檢討。

為此，我在 ChatGPT 4o 進行一系列小小的實驗。試試看當 GenAI 收到前段的需求規格之後，能夠生出什麼樣的程式碼以及測試驗收案例；緊接著會與 GenAI 協作調教頭尾兩端（需求規格與測試驗收），再看看 GenAI 重新生出來的程式碼，會有什麼不同。

實驗素材

我選用 TDD 社群常見的 Tennis Kata 作為實驗素材。為了怕 GenAI 早已被餵食過 Tennis Kata，想也不想就直接吐出記憶中的標準答案，也就是所謂的「資料記憶」問題，我稍加修改 Samman Technical Coaching 提供的 Tennis Kata 版本。微調過的文字如下：

The rule of the 2-player game “Sinnet” is summarized below:

A game is won by the first player to have won at least four points in total and at least two points more than the opponent.

The running score of each game is described in a manner peculiar to this: scores from zero to three points are described as “Love”, “Fifteen”, “Thirty”, and “Forty” respectively.

If at least three points have been scored by each player, and the scores are equal, the score is “Deuce”.

If at least three points have been scored by each side and a player has one more point than his opponent, the score of the game is “Advantage” for the player in the lead.

為了進一步消除資料記憶的疑慮，我也故意翻譯成中文（儘管在 ChatGPT 大神面前，這麼做未必真的有用，哈）：

以下是二人遊戲 “Sinnet” 的規則：

遊戲由首先贏得至少四分，且比對手多至少兩分的玩家獲勝。

每局的進行得分以一種特殊的方式描述：從零到三分分別描述為 “Love”、 “Fifteen”、 “Thirty” 和 “Forty”。

如果每位玩家至少得了三分，且比分相同，則比分為 “Deuce”。

如果每方至少得了三分，且有一名玩家比對手多一分，則領先玩家的比分為 “Advantage”。

接下來的前幾個實驗，都會以這段文字作為對話的開始，姑且就把它叫做「初始規格/v1」吧。不過，GenAI 有一些隨機性的行為，不一定每次都會得到一樣的結果，我會挑出較具代表性的來探討。

系列實驗如下：

🅐 基礎實驗：①給初始規格/v1 ❯ ②生成程式及測試
🅑 調整規格：①給初始規格/v1 ❯ ③調整規格/v2 ❯ ④生成程式及測試
🅒 可執行的規格：①給初始規格/v1 ❯ ③調整規格/v2 ❯ ④生成程式及測試 ❯ ⑤將測試納入規格/v3
🅓 測試驅動：⑥根據規格/v3 重新生成程式，並予以測試檢驗
🅔 逆向工程：⑥根據規格/v3 重新生成程式，並予以測試檢驗 ❯ ⑦逆向工程規格/v4

有些實驗之間有先後關係，有些實驗需要開啟新的對話：

實驗🅐、基礎實驗

先從最基礎的實驗開始，直接叫 GenAI 根據我們給的初始規格，生成程式及測試。

基礎的實驗步驟如下：

①給初始規格/v1
②生成程式及測試

既然要叫 GenAI 生成測試，就順便叫它用比較好的方式來做：

Specification by Example 提到的 BDD (behavior driven design)。
Effective Software Testing 提到的 spec-based & property-based testing 角度以及 AAA (arrange-act-assert) 手法。

提示詞②如下：

Now, implement a Python program to display the current game score. Also, create specification-based tests using BDD-style and property-based testing. Each test case should follow the Arrange-Act-Assert (AAA) pattern, ensuring that only one AAA scenario is included per test case. Place Gherkin descriptions at the beginning of each test case using a docstring, annotation, or function name. Ensure these tests cover normal scenarios as well as edge cases, focusing on boundaries, partitions, and input dependencies.

吐出的主體程式碼，細節可能每次都有些不同，以下是較常見的版本：

主體程式碼看起來滿樸素的，程式邏輯大體上也是圍繞在初始規格的直接翻譯。

吐出的測試案例，每次的差異就比較大了，以下是其中一個版本：

形式上 OK，只是我認為還有些缺失：

對於 deuce 的比分，常常只測到 3-3 的情況，沒有進一步嘗試 4-4、5-5 甚至更多。這應該是 property-based testing 的強項。
在 arrange 及 act 階段，常常少了交錯得分的劇情，譬如說像是 0-0 ❯ 1-0 ❯ 2-0 ❯ 2-1 ❯ 3-1 ❯ 3-2 ❯ 3-3 (deuce) ❯ 3-4 ❯ 4-4 (deuce) 這種緊張的情節。
Property-based testing 有點偷懶。

實驗🅑、調整規格

我們一開始給的初始規格/v1，可能未必是最好的。這一次，讓我們邀請 AI 對我們提出疑問，一起協力把初始規格調整成較清晰、無歧義的內容，再拿這份調整過的規格去生成程式與測試。

調整規格的實驗步驟如下：

①給初始規格/v1
③調整規格/v2
④生成程式及測試

提示詞③如下：

Now, you can ask me questions about the previous specification to clarify its meaning and remove any ambiguity. If you don’t have any questions, please proceed to rephrase the specification for better clarity.

有時候，GenAI 的確會與我一來一往對話，請我釐清一些細節。

最後，我收到一份更清晰易懂的規格（就把它叫做「規格/v2」吧）：

Sinnet Game Rules

Winning the Game: A player wins by reaching at least four points and leading their opponent by at least two points.

Scoring System: Points are described using specific terms:

0 points → “Love”

1 point → “Fifteen”

2 points → “Thirty”

3 points → “Forty”

Deuce: If both players have at least three points and are tied, the score is called “Deuce.”

Advantage: If both players have at least three points, and one player leads by a single point, their score is called “Advantage.”

接下來，讓我們根據這份調整過的規格/v2，試試看 GenAI 有什麼不一樣的產出？

提示詞④如下：

Based on the rephrased specification, implement a Python program to display the current game score. Also create specification-based tests using BDD-style and property-based testing. Each test case should follow the Arrange-Act-Assert (AAA) pattern, ensuring that only one AAA scenario is included per test case. Place Gherkin descriptions at the beginning of each test case using a docstring, annotation, or function name. Ensure these tests cover normal scenarios as well as edge cases, focusing on boundaries, partitions, and input dependencies.

吐出的主體程式碼，細節可能每次都有些不同，以下是較常見的版本：

或許要歸功於調整後的規則/v2 較清晰易懂，程式邏輯也隨之變得更清晰易懂。

有時候我不滿意生成的測試案例，會叫它再做一次，加強測試涵蓋率。譬如說，如果它只考慮到 3-3 這個 deuce 情況，我也會叫它再多想一下：

Have you considered that deuce can also occur at 4-4, 5-5, and beyond? This appears to be a strong candidate for property-based testing.

這階段，我建議要跟 GenAI 多聊幾次，先一起弄出令人滿意的測試集，以作為下一個實驗的基礎。

以下是其中一個版本：

看起來還不錯。我們可以繼續往下進行最關鍵的實驗了。

實驗🅒、可執行的規格

在前一個實驗中，我們和 GenAI 一起調整出較清晰易讀的規格/v2，也一起調整出較周全的測試集。以敏捷開發陣營的最佳實踐來說，好的測試集，其實是可以逆向反饋到規格的，也就是所謂的「可執行的規格」(executable specification)。現在，我們就來弄出一份這樣的「規格/v3」吧。

「可執行的規格」實驗步驟如下：

①給初始規格/v1
③調整規格/v2
④生成程式及測試
⑤將測試納入規格/v3 ◄ 承上步驟，從這裡開始。

接續前一個實驗累積的測試集，讓我們請 GenAI 幫我們逆向轉譯為 Gherkin 形式的規格。

提示詞⑤如下：

Now, incorporate the newly generated tests (including property-based tests) into the specification under a section titled “Validation Scenarios,” expressed in Gherkin format. This will make the specification more comprehensive and allow it to function as an “executable specification.” Finally, add the test codes in an appendix section titled “Tests.”

最後會孵化出這樣的「規格/v3」：

Sinnet Game Specification

▋ GAME RULES

Winning the Game: A player wins by reaching at least four points and leading their opponent by at least two points.

Scoring System: Points are described using specific terms:

0 points → “Love”

1 point → “Fifteen”

2 points → “Thirty”

3 points → “Forty”

Deuce: If both players have at least three points and are tied, the score is called “Deuce.”

Advantage: If both players have at least three points, and one player leads by a single point, their score is called “Advantage.”

▋ VALIDATION SCENARIOS

Normal Gameplay Scenarios

Scenario: Initial game score is “Love - Love”

Given a new game has started

When no player has scored

Then the score should be “Love - Love”

Scenario: A player scores first

Given a new game has started

When a player scores once

Then the score should reflect the correct score for that player

Scenario: A player gains advantage

Given both players have scored at least three times

When a player scores once more

Then the score should be “Advantage” for that player

Scenario: A player wins the game

Given a player has three points and the other player has less than three points

When the player scores the winning point

Then the player should be declared the winner

Scenario: A player wins after having advantage

Given a player has advantage

When the player scores one more point

Then the player should be declared the winner

Scenario: A player gains advantage from deuce

Given the score is deuce

When a player scores a point

Then the score should be “Advantage” for that player

Edge Cases and Property-Based Testing

Scenario: Deuce occurs at 3-3, 4-4, 5-5, and beyond

Given both players have the same score of at least 3 points

When the scores are equal

Then the score should be “Deuce”

Property-Based Test: Deuce should occur at 3-3, 4-4, 5-5, and beyond

Given both players have at least 3 points and the same score

When the scores are equal

Then the score should be “Deuce”

▋ APPENDIX: TESTS [測試規格如下]
Python 版的測試規格
import pytest
from hypothesis import given, strategies as st

def test_initial_score():
    """
    Scenario: Initial game score is "Love - Love"
    Given a new game has started
    When no player has scored
    Then the score should be "Love - Love"
    """
    game = SinnetGame()
    assert game.score() == "Love - Love"

def test_player1_scores_once():
    """
    Scenario: Player 1 scores first
    Given a new game has started
    When Player 1 scores once
    Then the score should be "Fifteen - Love"
    """
    game = SinnetGame()
    game.player_scores(1)
    assert game.score() == "Fifteen - Love"

def test_deuce_extended():
    """
    Scenario: Deuce occurs at 3-3, 4-4, 5-5, and beyond
    Given both players have the same score of at least 3 points
    When the scores are equal
    Then the score should be "Deuce"
    """
    for score in range(3, 10):
        game = SinnetGame()
        game.scores = [score, score]
        assert game.score() == "Deuce"

def test_advantage_player1():
    """
    Scenario: Player 1 gains advantage
    Given both players have scored at least three times
    When Player 1 scores once more
    Then the score should be "Advantage Player 1"
    """
    game = SinnetGame()
    for _ in range(4):
        game.player_scores(1)
        game.player_scores(2)
    game.player_scores(1)
    assert game.score() == "Advantage Player 1"

def test_player1_wins():
    """
    Scenario: Player 1 wins the game
    Given Player 1 has three points and Player 2 has less than three points
    When Player 1 scores the winning point
    Then Player 1 should be declared the winner
    """
    game = SinnetGame()
    for _ in range(4):
        game.player_scores(1)
    assert game.score() == "Player 1 wins"

def test_transition_from_advantage_to_win():
    """
    Scenario: Player 1 wins after having advantage
    Given Player 1 has advantage
    When Player 1 scores one more point
    Then Player 1 should be declared the winner
    """
    game = SinnetGame()
    for _ in range(4):
        game.player_scores(1)
        game.player_scores(2)
    game.player_scores(1)
    game.player_scores(1)
    assert game.score() == "Player 1 wins"

def test_transition_from_deuce_to_advantage():
    """
    Scenario: Player 2 gains advantage from deuce
    Given the score is deuce
    When Player 2 scores a point
    Then the score should be "Advantage Player 2"
    """
    game = SinnetGame()
    for _ in range(4):
        game.player_scores(1)
        game.player_scores(2)
    game.player_scores(2)
    assert game.score() == "Advantage Player 2"

@given(st.integers(min_value=3, max_value=10))
def test_deuce_property_based(score):
    """
    Property-based test: Deuce should occur at 3-3, 4-4, 5-5, and beyond
    Given both players have at least 3 points and the same score
    When the scores are equal
    Then the score should be "Deuce"
    """
    game = SinnetGame()
    game.scores = [score, score]
    assert game.score() == "Deuce"

這份「規格/v3」，不僅有規格文章，還有測試集，又臭又長，要是換成是我自己一個人來寫，我多半沒這個耐心。

你可能會懷疑：比起之前的「初始規格/v1」及「調整規格/v2」，這份孵化出來的「規格/v3」是否真的有獨到之處？

且讓我們繼續看下去。

實驗🅓、測試驅動

在前一個實驗中，我們和 GenAI 一起孵化出「可執行的規格」。現在，我們就來根據這份「規格/v3」開啟全新的對話，叫 GenAI 重新生成程式，並根據「規格/v3」附錄的測試集予以測試檢驗。

且讓我們評估看看究竟值不值得如此大費周章。

測試驅動的實驗步驟如下：

①給初始規格/v1 ❯ ③調整規格/v2 ❯ ④生成程式及測試 ❯ ⑤將測試納入規格/v3 ◄ 之前的步驟
⑥根據規格/v3 重新生成程式，並予以測試檢驗 ◄ 現在的步驟

請 GenAI 開啟全新的對話，用這份熱騰騰的「規格/v3」重新生成程式。

提示詞⑥如下：

Now, generate a Python program based on the following specification. [貼上規格/v3 全文]

吐出的主體程式碼，細節可能每次都有些不同，以下是較常見的版本：

GenAI 有隨機性行為，未必每次都會比根據之前兩份規格（初始規格/v1 及調整規格/v2）產生的程式表現得更好，但至少這是可以通過測試的。萬一通不過測試，大可叫 GenAI 去 debug。這就是我們之前與 AI 協作，努力調教頭尾兩端（需求規格與測試驗收）的好處。只要向 AI 指出這項錯誤，它就會乖乖認錯，並且自動 debug。

只要程式是正確的，隨時都可以叫 GenAI 去重構，重構到滿意為止，如此即可兼顧正確性與各種內在品質標準。

這就是當初在「規格/v3」埋下「可執行的規格」伏筆的好處。

實驗🅔、逆向工程

敏捷宣言說：「可用的軟體 重於 詳盡的文件」。我們想看看，GenAI 是否真的能夠從「可用的軟體」逆向工程出「詳盡的文件」，讓我們評估兩者究竟誰比較重要，誰可以當成 “ground truth”。

逆向工程的實驗步驟如下：

⑥根據規格/v3 重新生成程式，並予以測試檢驗 ◄ 之前的步驟
⑦逆向工程規格/v4 ◄ 現在的步驟

請 GenAI 開啟全新的對話，拿前一個實驗的程式碼及測試，請它設法轉譯成人類的語言。

提示詞⑦如下：

Generate a specification based on the following Python code and tests. The specification should include two sections: 1. Sinnet Game Rules – Clearly structured and expressed in natural language. 2. Validation Scenarios – Presented in Gherkin format. [附上主體程式碼及測試]

我驚訝的發現，GenAI 劈頭就給我這麼一句話：

The Sinnet Game is a simplified scoring system inspired by tennis. The objective is for a player to win by reaching at least four points and having a two-point lead over the opponent.

看起來，可能是程式碼寫得太好了，測試寫得太好了，變數及函式名稱取得太好了，註解寫得太好了，讓 GenAI 有足夠的線索察覺到，這和現實世界的網球規則高度雷同。

這也證明了，清晰易懂的程式碼，不僅讓人類舒服，連 GenAI 也受益。

敏捷的再思

同樣的武器，在高手與菜鳥手上，能玩出的花樣自然會有高低落差之別。

儘管高手是草木竹石均可為劍的；但善用好的武器，能讓我們付出較少的精力，就做出近似高手的表現。

怎麼樣善用 GenAI 這項神兵利器？

前面幾個實驗中，儘管 GenAI 看起來非常厲害，但是別忘了，我所下的提示詞，其實已經將人類在程式設計領域的經驗高度濃縮進去。缺少這些知識，未必能夠和 GenAI 一來一往充分協作出好的成果：

Executable specification 很重要。
Specification-based testing 很重要。
Property-based testing 很重要。
AAA (arrange-act-assert) 很重要。
Refactoring 很重要。
Shift-left 很重要。

這些，其實都是敏捷開發領域一向重視的實踐。

因此，在 GenAI 時代，就算敏捷四大宣言之一「可用的軟體 重於 詳盡的文件」未必依然成立，但那也是因為藉由 GenAI 之力，我們終於可以把「可用的軟體」的內涵，用更有效率的方式轉寫到「詳盡的文件」裡面，讓它成為更有意義的 “ground truth”。

不是因為敏捷宣言過時了，而是 GenAI 成全了敏捷宣言。

根據 Google CEO 在 (2024) Q3 earnings call: CEO’s remarks 的說法：“Today, more than a quarter of all new code at Google is generated by AI, then reviewed and accepted by engineers. This helps our engineers do more and move faster.” ↩︎
GitHub CEO 在 2023 年中一場專訪時，大膽預言：“Sooner than later, 80% of the code is going to be written by Copilot.” ↩︎

文章目錄

實驗素材

實驗🅐、基礎實驗

實驗🅑、調整規格

實驗🅒、可執行的規格

實驗🅓、測試驅動

實驗🅔、逆向工程

敏捷的再思