Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Error Design in Production vs Development Environments #828

Open
5 tasks
johnwroge opened this issue Nov 13, 2024 · 5 comments
Open
5 tasks

API Error Design in Production vs Development Environments #828

johnwroge opened this issue Nov 13, 2024 · 5 comments
Assignees
Labels
Complexity: Medium Issue is clearly defined and documentation is provided. Solution has room for interpretation. Feature: Infrastructure Changes to site technical Architecture Feature: Refactoring points: 1 Can be done in 4-6 hours Role: Back End

Comments

@johnwroge
Copy link
Member

johnwroge commented Nov 13, 2024

This is a current action item in Porting Flask to FastAPI #789.

Overview

Currently, API handles errors for production and development environments in the same way. This issue is meant to be a point of discussion to track how the team wants to keep a consistent API response for production and development environments while also bringing up logging as a potential solution to centralize development errors.

Research

The production API should be limiting information sent to the client for security purposes. Sending specific server errors and response codes to the client can open the system to vulnerabilities by exposing detailed information to bad actors. At the same time, using specific error codes with detailed messages for exceptions/errors allows for developers to maintain the existing system more efficiently. While the current MVP system is not completely functional, agreeing on how these errors can be propagated to the client will help mitigate some of the complexity as the backend continues to scale and more developers are added to the team.

One approach, is to distinguish different environments using environment variables while maintaining a consistent error response for the production environment.

Example Error Response for Production:

raise HTTPException(
    status_code=500,
    detail={
        "code": "INTERNAL_SERVER_ERROR",
        "message": "An unexpected error occurred. Please try again later."
    }

This is a basic example of how this would look in the controllers that return specific error messages.

except Exception as e:
    if ENVIRONMENT == "production":
        # Optionally log the error with detailed information on the server side
        logger.error("An error occurred: %s", e, exc_info=True)

        # Send a generic error message to the frontend
        raise HTTPException(
            status_code=500,
            detail={
                "code": "INTERNAL_SERVER_ERROR",
                "message": "An unexpected error occurred. Please try again later."
            }
        )
    else:
        # Optionally log detailed error info for debugging in development
        logger.debug("An error occurred: %s", e, exc_info=True)

        # Send the actual error details to the frontend for easier debugging 
        raise HTTPException(
            status_code=500,
            detail={
                "code": type(e).__name__,
                "message": str(e)
            }
        )

Logging (Optional)

Including a basic logging setup will help centralize server side errors for developers as well. These logs can be visible in development and also accessible through EC2 console. Since the MVP of the application is not complete and the number of users will be limited this may not be necessary at the moment, but would be a nice to have in the future.

Logging Resources

Python Logging How To
Python Logging API
Logging levels

Action Items

  • Introduce environment variable for production, and development environments.
  • Update server endpoints to respond with generic error messages in the case of exceptions in production environment.
  • Review controller responses to verify http status codes and messages are relevant for debugging.

Optional Logging Items

  • Create a global logging module that sets up logging and handles exceptions
  • Integrate logging in exception clauses
@johnwroge johnwroge added the Complexity: Medium Issue is clearly defined and documentation is provided. Solution has room for interpretation. label Nov 13, 2024
@johnwroge johnwroge self-assigned this Nov 13, 2024
@github-project-automation github-project-automation bot moved this to New Issue Approval in P: HUU: Project Board Nov 13, 2024
@paulespinosa
Copy link
Member

@johnwroge This is great! Thanks for documenting how we can approach errors.

For clarification, is the research done here and our decisions focused only on server errors ("HTTP Status 5xx") or will it also cover "domain"/client errors.

For example:

  • technical server error: A KeyError exception is thrown and the 500 Internal Server Errors in the examples.
  • "domain"/client error: "Guest was already invited."

Does FastAPI have a global exception handler that can check for server errors and perform the same environment logic in the example code snippet? Or, would we have to include this code in each request handler?

@johnwroge
Copy link
Member Author

johnwroge commented Nov 14, 2024

Thanks @paulespinosa! Yes, those are all really good points and I should have been more explicit in those cases.

We would be able to use FastAPI's built in exception handler. We can register this globally for all the routes in the main.py file after the other routers.

In main.py

from global_error_module import global_exception_handler

app = FastAPI(lifespan=lifespan)
app.include_router(api_router, prefix="/api")
app.include_router(health_router, prefix="/api/health", tags=["health"])

app.add_exception_handler(Exception, global_exception_handler)

Exception Handler

I think integrating a global exception handler that handles both client and server errors would be ideal because it would centralize how errors are handled. We could use guard clauses to check the environment in this section and then depending on the status code respond with limited information in production or the regular response in development. This message information when the exception is raised could be passed to the exception handler in the endpoints so we don't need to write out the logic above. I'll post an example below.

# Function to get the current environment
def get_environment():
    return os.getenv("ENVIRONMENT", "development")

def configure_logging():
    environment = get_environment()
    log_level = logging.DEBUG if environment == "development" else logging.ERROR
    logging.basicConfig(
        level=log_level,
        format="%(asctime)s [%(levelname)s] %(message)s",
        handlers=[
            logging.StreamHandler()
        ]
    )

configure_logging()

@app.exception_handler(Exception)
async def global_exception_handler(request: Request, exc: Exception):

    environment = get_environment()
    status_code = 500
    error_response = {
        "error": "Internal Server Error"
    }
# we should send a more generic message so it doesn't say anything is wrong. 
# We are unable to respond to your request at this time

    # Handle HTTPException differently based on environment
   if isinstance(exc, HTTPException):
        status_code = exc.status_code
        if environment == "production":
            error_response = {
                "error": "An error occurred. Please try again later."
            }
            logging.error(f"HTTPException occurred: {exc.detail}")
        else:
            error_response = {
                "error": exc.detail
            }
            logging.debug(f"HTTPException occurred: {exc.detail}")
    else:
        # Customize the response based on the environment
        if environment == "production":
            if status_code == 401:
                error_response = {
                    "error": "You are not authorized to perform this action."
                }
            elif status_code == 403:
                error_response = {
                    "error": "You are not allowed to access this resource."
                }
            elif status_code == 404:
                error_response = {
                    "error": "The requested resource was not found."
                }
            elif status_code >= 500:
                error_response = {
                    "error": "An internal server error occurred. Please try again later."
                }
            # Return generic error responses for all other cases in production
            elif 400 <= status_code < 500:
                status_code = 400
                error_response = {
                    "error": "Bad request. Please check your input and try again."
                }
             else # do something else here in case nothing else is invoked. 

        else:
            # Return more detailed error responses for development/staging
            if hasattr(exc, "detail"):
                error_response = {
                    "error": str(exc.detail)
                }
            else:
                error_response = {
                    "error": str(exc)
                }
           logging.debug(f"Exception occurred: {exc}")

    return JSONResponse(status_code=status_code, content=error_response)

There are a few cons in this approach, one being the global exception handler may need to be updated for every case which means this function could become very large and more difficult to manage for error specific handling. The other downside is if we were to implement logs in this file, it could result in latency for the application every time an exception is raised.

Do you think it would be better if we only use the global error handler in certain cases and handle the rest inside of the individual functions? All the other projects I've worked on used a global error handler in all cases, but there may be a better approach.

@lola3736
Copy link
Member

@johnwroge please add the applicable points label for this issue, let me know if you have any questions

@johnwroge johnwroge added the points: 1 Can be done in 4-6 hours label Nov 14, 2024
@lola3736
Copy link
Member

lola3736 commented Nov 14, 2024

@johnwroge thanks for adding the points label. Confirming that you are currently working on this issue and if so, can move to "in progress". Do you need to have this issue reviewed and approved by Dev lead?

@johnwroge johnwroge removed the points: 1 Can be done in 4-6 hours label Nov 14, 2024
@johnwroge
Copy link
Member Author

Thanks @lola3736. I moved this to in progress. I wrote this for documentation purposes and I will update the points once we know more detailed action items and how much time it will take to implement.

@johnwroge johnwroge added the points: 1 Can be done in 4-6 hours label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complexity: Medium Issue is clearly defined and documentation is provided. Solution has room for interpretation. Feature: Infrastructure Changes to site technical Architecture Feature: Refactoring points: 1 Can be done in 4-6 hours Role: Back End
Projects
Status: In Progress
Development

No branches or pull requests

3 participants