CFS Sessions Race Condition Test

The CFS sessions race condition test validates the robustness and reliability of the Configuration Framework Service (CFS) API when handling concurrent operations on CFS sessions. This test is designed to detect race conditions, verify proper handling of parallel requests, and ensure data consistency when multiple API operations execute simultaneously.

This page provides information about the CFS sessions race condition test and describes how the test script works.

Test purpose

The CFS sessions race condition test serves several critical purposes:

  • Race condition detection: Identifies potential race conditions in CFS API operations when multiple concurrent requests access or modify session data.
  • Concurrent operation validation: Verifies that CFS correctly handles parallel delete and get requests without data corruption or unexpected failures.
  • API consistency verification: Ensures that CFS API responses are consistent and predictable when handling multiple simultaneous operations.
  • Stress testing: Tests CFS behavior under high-concurrency scenarios to identify potential bottlenecks or failure modes.

This test is particularly important after CFS upgrades, configuration changes, or when troubleshooting intermittent CFS issues that may be related to concurrent operations.

Test prerequisites

  • This test can be run on any master or worker NCN, but not the PIT node.
  • The test script uses the Kubernetes API gateway to access CSM services. The gateway must be properly configured to allow an access token to be generated by the script.
  • The test script is installed as part of the cray-cmstools RPM.
  • Sufficient system resources must be available to create and manage multiple CFS sessions simultaneously.
  • The CFS operator deployment must be running (the test will temporarily scale it to 0 replicas during execution).

Test overview

The script file location is /opt/cray/tests/integration/csm/cfs_sessions_rc_test. Review the Test prerequisites before proceeding.

If no parameters are specified, this script performs the following steps:

  1. Environment setup:

    • Obtains the Kubernetes API gateway access token.
    • Checks the current number of cray-cfs-operator deployment replicas.
    • Scales the cray-cfs-operator deployment to 0 replicas to prevent sessions from being processed (this keeps sessions in pending state for testing).
    • Records the original replica count for restoration after the test.
  2. CFS version configuration:

    • If using CFS v2, retrieves the current global page-size setting and calculates an appropriate value based on the test parameters.
    • Sets the global page-size if needed (for v2 only) and records the original value for restoration.
    • For CFS v3, calculates an appropriate page size for API requests (defaults to 10 × max-sessions).
  3. Pre-existing session cleanup:

    • Checks for any existing CFS sessions with the test name prefix (cfs-race-condition-test- by default).
    • If --delete-previous-sessions is specified, deletes any matching sessions found.
    • If --delete-previous-sessions is not specified and matching sessions exist, exits with an error.
  4. Session creation:

    • Creates a CFS configuration (if needed) for the test sessions.
    • Creates the specified number of CFS sessions (default: 20) with the configured name prefix.
    • All created sessions remain in pending state (due to the scaled-down operator).
  5. Subtest execution:

    • Runs the configured subtests (all by default, or as specified by --run-subtests or --skip-subtests).
    • Each subtest executes specific patterns of concurrent CFS API operations.
    • Validates the results of each subtest to ensure proper CFS behavior.
  6. Cleanup and restoration:

    • Deletes all CFS sessions created during the test.
    • Deletes the CFS configuration created for the test (if any).
    • Restores the cray-cfs-operator deployment to its original replica count.
    • Restores the global CFS page-size setting (for v2 only) if it was modified.

The script provides output along the way to report progress, and also provides a link to a log file with more detailed information. If the test fails, the place to begin the investigation is whatever service was being used at the time of the failure.

Each subtest may take several seconds to minutes, depending on the number of parallel requests and sessions configured.

Test subtypes

The CFS sessions race condition test includes six distinct subtests, each designed to test different concurrent operation patterns. By default, all subtests are executed. You can control which subtests run using the --run-subtests or --skip-subtests options (see Controlling which subtests run).

single_delete

Purpose: Tests concurrent deletion of the same individual CFS session by multiple parallel requests.

Behavior:

  • Creates a single CFS session.
  • Executes multiple parallel DELETE requests (default: 4) targeting the same session.
  • Each request attempts to delete the session by its specific name.

Expected results:

  • Exactly one DELETE request should succeed with a 2xx status code.
  • All other DELETE requests should receive HTTP 404 (Not Found) status codes.
  • No requests should time out or return unexpected error codes.

multi_delete

Purpose: Tests concurrent batch deletion operations using the CFS multi-delete API endpoint.

Behavior:

  • Creates multiple CFS sessions (default: 20).
  • Executes multiple parallel DELETE requests (default: 4) to the sessions endpoint with query parameters.
  • Each request attempts to delete all sessions matching the test name prefix and pending status.

Expected results:

  • All DELETE requests should complete successfully with appropriate status codes.
  • Sessions should be properly deleted without orphaned resources.
  • For CFS v3, the response should include the list of deleted sessions.
  • No requests should time out or return unexpected error codes.

single_delete_single_get

Purpose: Tests concurrent operations combining single-session deletion with single-session retrieval.

Behavior:

  • Creates a single CFS session.
  • Executes parallel operations with two types of requests:
    • Single DELETE requests targeting the specific session (default: 4 parallel).
    • Single GET requests retrieving the specific session (default: 4 parallel).

Expected results:

  • Exactly one DELETE request should succeed.
  • GET requests may succeed (returning session data) or fail with 404 (if executed after deletion).
  • The combination of GET and DELETE operations should not cause data inconsistencies.
  • No requests should time out or return unexpected error codes.

single_delete_multi_get

Purpose: Tests concurrent operations combining single-session deletion with multi-session retrieval.

Behavior:

  • Creates a single CFS session.
  • Executes parallel operations with two types of requests:
    • Single DELETE requests targeting the specific session (default: 4 parallel).
    • Multi-GET requests retrieving all sessions matching the test prefix (default: 4 parallel).

Expected results:

  • Exactly one DELETE request should succeed.
  • Multi-GET requests should return consistent data (session list may or may not include the target session depending on timing).
  • The session should not appear in GET results after successful deletion.
  • No requests should time out or return unexpected error codes.

multi_delete_single_get

Purpose: Tests concurrent operations combining batch deletion with single-session retrieval.

Behavior:

  • Creates multiple CFS sessions (default: 20).
  • Selects one session as the target for GET operations.
  • Executes parallel operations with two types of requests:
    • Multi-DELETE requests targeting all test sessions (default: 4 parallel).
    • Single GET requests retrieving the specific target session (default: 4 parallel).

Expected results:

  • All DELETE requests should complete successfully.
  • GET requests may succeed (returning session data) or fail with 404 (if executed after deletion).
  • All sessions should be properly deleted.
  • No requests should time out or return unexpected error codes.

multi_delete_multi_get

Purpose: Tests concurrent operations combining batch deletion with multi-session retrieval.

Behavior:

  • Creates multiple CFS sessions (default: 20).
  • Executes parallel operations with two types of requests:
    • Multi-DELETE requests targeting all test sessions (default: 4 parallel).
    • Multi-GET requests retrieving all sessions matching the test prefix (default: 4 parallel).

Expected results:

  • All DELETE requests should complete successfully.
  • Multi-GET requests should return consistent session lists.
  • After all operations complete, no test sessions should remain.
  • No requests should time out or return unexpected error codes.

Test options

(ncn-mw#) The script usage message can be displayed by running it with the --help argument.

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --help

The following sections cover the most commonly used options.

Controlling which subtests run

By default, the test runs all six subtests. You can control which subtests execute using one of two mutually exclusive options:

Run specific subtests only using the --run-subtests argument with a comma-separated list of subtest names:

(ncn-mw#) Example of running only the single_delete and multi_delete subtests:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --run-subtests single_delete,multi_delete

Skip specific subtests using the --skip-subtests argument with a comma-separated list of subtest names to exclude:

(ncn-mw#) Example of running all subtests except multi_delete_multi_get:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --skip-subtests multi_delete_multi_get

Valid subtest names are:

  • single_delete
  • multi_delete
  • single_delete_single_get
  • single_delete_multi_get
  • multi_delete_single_get
  • multi_delete_multi_get

Controlling session creation

The --max-sessions argument controls how many CFS sessions are created for subtests that use multiple sessions (default: 20).

(ncn-mw#) Example of creating 50 sessions for testing:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --max-sessions 50

The --name argument specifies the prefix for all session names created by the test (default: cfs-race-condition-test).

(ncn-mw#) Example of using a custom session name prefix:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --name my-test-prefix

Note: The session name prefix must be between 1 and 40 characters in length.

Controlling CFS version

The --cfs-version argument specifies which CFS API version to use for the test (default: v3).

(ncn-mw#) Example of testing with CFS API v2:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --cfs-version v2

Valid values are:

  • v2 - Uses CFS API version 2
  • v3 - Uses CFS API version 3 (default)

Important considerations for CFS v2:

  • The global CFS page-size option may be temporarily modified during the test and restored afterwards.
  • The minimum page-size for v2 should be equal to --max-sessions to prevent GET request failures.
  • If you specify a --page-size value less than --max-sessions when using v2, the test will fail with an error.

Controlling parallel request limits

Several arguments control the maximum number of parallel requests for different operation types:

Multi-delete requests using --max-multi-delete-reqs (default: 4):

(ncn-mw#) Example of allowing 8 parallel multi-delete requests:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --max-multi-delete-reqs 8

Multi-get requests using --max-multi-get-reqs (default: 4):

(ncn-mw#) Example of allowing 10 parallel multi-get requests:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --max-multi-get-reqs 10

Single-delete requests using --max-single-delete-reqs (default: 4):

(ncn-mw#) Example of allowing 6 parallel single-delete requests:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --max-single-delete-reqs 6

Single-get requests using --max-single-get-reqs (default: 4):

(ncn-mw#) Example of allowing 8 parallel single-get requests:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --max-single-get-reqs 8

Note: All parallel request limits must be at least 1.

Controlling page size

The --page-size argument specifies the page size for CFS API multi-get requests.

Default behavior:

  • For CFS v3: Defaults to 10 × --max-sessions (minimum: 1)
  • For CFS v2: Defaults to the greater of (10 × --max-sessions) or the current global page-size (minimum: equal to --max-sessions)

(ncn-mw#) Example of specifying a custom page size:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --page-size 100

Important considerations:

  • For CFS v2, if you specify --page-size, it must be at least equal to --max-sessions.
  • For CFS v3, page size can be as low as 1, but should generally be higher for efficiency.
  • When using CFS v2, the global CFS page-size option will be temporarily modified if needed and restored after the test completes.

Controlling test script output level

Output is directed to both the console calling the script as well as a log file that will hold more detailed information on the run and any potential problems found. The log file is written to /opt/cray/tests/integration/logs/csm/cmstools/cfs_sessions_rc_test/ with a timestamp-based filename (e.g., 20250101_120000.log). Each test run creates a new log file, preserving the history of previous test executions.

The messages output to the console and the log file may be controlled separately through environment variables. To control the information being sent to the console, set the variable CONSOLE_LOG_LEVEL. To control the information being sent to the log file, set the variable FILE_LOG_LEVEL. Valid values in increasing levels of detail are: CRITICAL, ERROR, WARNING, INFO, DEBUG. The default for the console output is INFO and the default for the log file is DEBUG.

(ncn-mw#) Here is an example of running the script with more information displayed on the console during the execution of the test:

CONSOLE_LOG_LEVEL=DEBUG /opt/cray/tests/integration/csm/cfs_sessions_rc_test

Managing pre-existing test sessions

By default, if the test finds any existing CFS sessions with the configured name prefix (default: cfs-race-condition-test) in pending state, it will exit with an error.

The --delete-previous-sessions flag instructs the test to automatically delete any such sessions before proceeding:

(ncn-mw#) Example of automatically cleaning up pre-existing test sessions:

/opt/cray/tests/integration/csm/cfs_sessions_rc_test --delete-previous-sessions

Note: Only sessions in pending state with the specified name prefix will be deleted. Sessions in other states or with different names are not affected.

Troubleshooting

Test fails with “Pre-existing CFS sessions found”

This error occurs when sessions with the configured name prefix already exist in pending state.

Solutions:

  • Run the test with --delete-previous-sessions to automatically clean them up.
  • Manually delete the sessions using the CFS CLI or API.
  • Use a different --name prefix that doesn’t conflict with existing sessions.

Test fails with page size errors (CFS v2)

This error occurs when the specified --page-size is less than --max-sessions when using CFS v2.

Solutions:

  • Increase --page-size to at least equal --max-sessions.
  • Decrease --max-sessions to match your desired page size.
  • Omit --page-size to let the test calculate an appropriate value automatically.

Test hangs or takes very long to complete

This can occur if the CFS operator is not properly scaled down or if there are connectivity issues.

Solutions:

  • Check that the cray-cfs-operator deployment can be scaled (sufficient permissions).
  • Verify network connectivity to the CFS API endpoints.
  • Reduce the number of parallel requests and sessions for troubleshooting.
  • Check the detailed log file at /opt/cray/tests/integration/logs/csm/cmstools/cfs_sessions_rc_test/ for more information.

Cleanup fails to restore original state

If the test is interrupted or fails unexpectedly, the CFS operator may remain scaled to 0 replicas.

Solutions:

  • Manually check the current replica count:

    kubectl get deployment -n services cray-cfs-operator
    
  • Manually restore the replica count (typically 1):

    kubectl scale deployment -n services cray-cfs-operator --replicas=1
    
  • If the global page-size was modified (CFS v2 only), it may need to be manually restored through the cray cli.

    (ncn-mw#) Example of restoring the default page size to 1000:

    cray cfs v3 options update --default-page-size 1000
    

    Example output:

    additional_inventory_source = ""
    additional_inventory_url = ""
    batch_size = 25
    batch_window = 60
    batcher_check_interval = 10
    batcher_disable = false
    batcher_max_backoff = 3600
    batcher_pending_timeout = 300
    debug_wait_time = 3600
    default_ansible_config = "cfs-default-ansible-cfg"
    default_batcher_retry_policy = 3
    default_page_size = 1000
    default_playbook = "site.yml"
    hardware_sync_interval = 10
    include_ara_links = true
    logging_level = "INFO"
    session_ttl = "7d"