Skip to content

Conversation

@haiyuan-eng-google
Copy link

@haiyuan-eng-google haiyuan-eng-google commented Jan 29, 2026

Refactor error handling in BigQuery write operations and add timeout for perform_write function.

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

  • Closes: #issue_number
  • Related: #issue_number

2. Or, if no issue exists, describe the change:

Problem:
The BigQuery BatchProcessor worker thread could hang indefinitely during write_client.append_rows() calls under certain failure conditions (e.g., silent connection drops or blocked RPCs). Because the worker waits for this call without a timeout, it stops processing new events. The internal queue eventually fills up (defaulting to 1000 items) and subsequent logs are dropped to prevent memory leaks, leading to a silent cessation of logging despite the application continuing to run. Additionally, a ReferenceError was occasionally observed during interpreter shutdown (_atexit_cleanup) when the batch_processor object had already been garbage collected.

Solution:

  1. Enforce Timeout: Wrapped the write_client.append_rows call within asyncio.wait_for with a 30-second timeout in _write_rows_with_retry.
  2. Robust Error Handling: Updated the exception handling to catch asyncio.TimeoutError. This ensures that if a write hangs, it fails fast, triggers the existing retry mechanism (with backoff), and eventually drops the problematic batch if retries are exhausted, allowing the worker to proceed to the next batch.
  3. Graceful Shutdown: Added a try-except ReferenceError block in _atexit_cleanup to prevent noisy errors during script termination.

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

Please include a summary of passed pytest results.

Manual End-to-End (E2E) Tests:

I verified the changes using a reproduction script that mocks a hanging BigQuery writer.

  1. Reproduction: Created a script that mocks append_rows to return a hanging iterator and asyncio.wait_for to simulate a timeout.
  2. Observation: Verified that without the fix, the worker hangs and the queue fills up.
  3. Verification: With the fix, confirmed that:
    • The TimeoutError is raised after the simulated timeout.
    • The retry logic is triggered (logs "BigQuery Batch Dropped after X attempts").
    • The queue continues to drain, ensuring subsequent events are processed.
    • The application does not hang.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

This change prevents a critical failure mode where observability data is silently lost due to transient network issues or service hangs.

Refactor error handling in BigQuery write operations and add timeout for perform_write function.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @haiyuan-eng-google, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability and robustness of BigQuery write operations within the analytics plugin. It refactors the existing error handling to be more structured, introduces a crucial timeout mechanism for write calls to prevent indefinite hangs, and improves the application's shutdown process by making the atexit cleanup more resilient to object lifecycle issues.

Highlights

  • Improved BigQuery Write Error Handling: The logic for handling BigQuery append_rows responses has been refactored into a dedicated perform_write asynchronous function, centralizing error processing.
  • Introduced Write Operation Timeout: A 30-second timeout has been added to the perform_write function using asyncio.wait_for, ensuring that BigQuery write operations do not hang indefinitely.
  • Enhanced Retry Mechanism: The retry block for BigQuery write operations now explicitly catches asyncio.TimeoutError, allowing timed-out operations to be retried gracefully.
  • Robust atexit Cleanup: The _atexit_cleanup static method now includes a try-except ReferenceError block when checking the batch_processor object, preventing potential crashes if the object has been garbage collected before cleanup.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Jan 29, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Jan 29, 2026

Response from ADK Triaging Agent

Hello @haiyuan-eng-google, thank you for creating this PR!

To help us review your contribution more effectively, could you please fill out the "Testing Plan" section in your PR description? This is required for all PRs that are not small documentation or typo fixes.

Additionally, could you please link to an existing issue or provide a more detailed description of the change in the PR body, following the structure of our issue templates?

This information will help reviewers to better understand and test your changes. Thanks!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the BigQuery write operation to include a timeout and improves the robustness of the _atexit_cleanup function. The changes are generally good, but I have a couple of suggestions to improve configurability and code clarity. Specifically, I recommend making the new write timeout configurable instead of hardcoded, and refactoring a confusing if True: block in the atexit handler.

return
return

await asyncio.wait_for(perform_write(), timeout=30.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeout for perform_write is hardcoded to 30.0 seconds. It's better to make this value configurable to allow adjustments for different environments without changing the code. I recommend adding a write_timeout attribute to the BigQueryLoggerConfig class and using it here, similar to other timeout configurations.

Suggested change
await asyncio.wait_for(perform_write(), timeout=30.0)
await asyncio.wait_for(perform_write(), timeout=self.config.write_timeout)

Comment on lines +1639 to +1648
try:
# Check if the batch_processor object is still alive
if batch_processor and not batch_processor._shutdown:
pass
else:
return
except ReferenceError:
return

if True: # Indentation anchor, logic continues below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try...except ReferenceError block is a good addition for robustness. However, the if/else with pass can be simplified. More importantly, the if True: on line 1648 is an anti-pattern used as an 'indentation anchor' which harms readability. It should be removed, and the subsequent code block (lines 1650-1694) should be unindented.

Here's a suggestion for a cleaner implementation:

    try:
      # Check if the batch_processor object is still alive and not shut down.
      if not batch_processor or batch_processor._shutdown:
        return
    except ReferenceError:
      # The weak reference is no longer valid, so there's nothing to clean up.
      return

    # Emergency Flush: Rescue any logs remaining in the queue
    # ... (rest of the function, unindented)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants