Paper: PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Listen to this article.

Problem

Large language model (LLM) agents are being deployed to tackle increasingly complex, real-world tasks. These tasks often involve interacting with numerous tools – think of navigating a retail environment and needing to use various APIs or functions to find products, manage orders, track shipments, etc. Existing benchmarks haven’t adequately tested these agents’ ability to effectively plan across long sequences of tool usage, especially when dealing with limited visibility into which tools are available and reliable at any given moment.

Paper: SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Listen to this article.

Problem

Developing effective skills for AI agents – those specific instructions or knowledge bases that guide them in performing tasks – is currently a difficult and inconsistent process. Existing methods involve manually crafting skills, generating them once (“one-shot”), or allowing skills to evolve through unpredictable self-revision. These approaches lack the rigor of deep learning optimization and often fail to produce consistently improved skills over time.

Tech Brief: AI Agent Adoption Accelerates: Marketing, Infrastructure, and Robustness Drive Investment

Tech Brief: AI Agent Adoption Accelerates: Marketing, Infrastructure, and Robustness Drive Investment

Image: The latest AI news we announced in May 2026 — Google AI Blog

Listen to this article.

Overview

This week’s tech news is heavily focused on the intersection of AI and business operations, particularly in marketing and backend development. We’re seeing increased adoption – and anxieties around – AI detection alongside significant investment in AI infrastructure and application frameworks. A recurring theme is how organizations are adapting to evolving technologies while simultaneously navigating challenges like security breaches and shifting regulatory landscapes. Finally, there’s the ongoing evolution of distributed systems, evident through both incident retrospectives and new tools designed for robustness and scalability.

Tech Brief: Agentic AI Emerges: New Architectures Demand Rethinking Evaluation and Risk Mitigation

Tech Brief: Agentic AI Emerges: New Architectures Demand Rethinking Evaluation and Risk Mitigation

Image: EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments — Apple ML Research

Listen to this article.

Overview

This week’s headlines showcase a complex and evolving landscape for data scientists and ML engineers. We’re seeing continued debates around autonomous systems (Tesla’s Autopilot), growing scrutiny over corporate responsibility in the face of public safety concerns (Uber lawsuits), and increasingly sophisticated AI architectures pushing the boundaries of agentic AI (“loopy” agents). Alongside these developments are tangible impacts on infrastructure costs, hardware limitations, and emerging security threats. OpenAI continues its flurry of product releases aimed at bolstering enterprise cybersecurity while also aiding broader innovation through initiatives like Patch the Planet.

Tech Brief: AI Regulation Tightens as Apple Embeds Generative Models Within iOS

Tech Brief: AI Regulation Tightens as Apple Embeds Generative Models Within iOS

Image: NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance — NVIDIA Developer Blog

Listen to this article.

Overview

This week’s tech news highlights the accelerating integration of AI across various sectors, alongside continuing concerns about ethical practices and security vulnerabilities. Apple’s iOS 27 features are generating significant buzz with on-device generative AI capabilities. We’re seeing increasing adoption of LLMs internally within companies like Anthropic and Atlassian to streamline operations. The landscape is also shaped by external pressures: government oversight of AI development, legal battles over emerging transportation technologies, and ongoing debates about responsible data usage in areas like advertising and healthcare.

Paper: Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

Listen to this article.

Problem

Large Language Models (LLMs) are known to harbor biases, but these biases are tricky to pin down due to the random nature of how they generate text. Traditional methods for checking LLM fairness often just look at a single output or use automated metrics that don’t reveal the full picture—they miss biases lurking in less common generation pathways.

Method

The paper introduces “TreeTracer,” a visual analytics tool designed to tackle this issue. Here’s how it works:

Tech Brief: Data Governance Tensions Rise as Anthropic’s Reversal Highlights AI Control Challenges

Tech Brief: Data Governance Tensions Rise as Anthropic’s Reversal Highlights AI Control Challenges

Image: Temporary Cloudflare Accounts for AI agents — Cloudflare Blog

Listen to this article.

Overview

This week’s tech news is layered with cautious reflections on AI, coupled with intriguing developments in hardware innovation and platform updates. There’s a growing tension around data sharing for AI training, particularly highlighted by Anthropic’s recent requirements for Claude Fable 5 users on Bedrock, while OpenAI continues to improve its models with an eye toward practical enterprise use cases and addressing critical needs within healthcare. Finally, we see continued discussions about efficiency and developer experience—from monorepo migrations at Block to architectural improvements in Atlassian’s Forge platform—a clear signal that even with AI dominating headlines, core engineering challenges remain paramount.

Tech Brief: AI Regulation Tightens as Robotics, Agents Drive Data & Infrastructure Shifts

Tech Brief: AI Regulation Tightens as Robotics, Agents Drive Data & Infrastructure Shifts

Image: How A2A is Building a World of Collaborative Agents — Google Developers Blog

Listen to this article.

Overview

This week’s headlines highlight the ongoing intersection of robotics, cybersecurity regulations, and the evolving landscape of applied AI. The rise of hardware control via software infrastructure (like Kyber), combined with complex regulatory pressures surrounding AI development and deployment, creates a tricky environment for practitioners. Meanwhile, we’re seeing significant investment in physical-world applications—from robotaxis leveraging Japan’s IPO boom to advancements in fusion energy—and a continued refinement of user experience, as demonstrated by e-ink displays and specialized audio players. Finally, the rapid progress in AI agent development showcased through OpenAI’s work is truly worth observing; it’s driving shifts in tooling, data analysis, and potentially even code generation workflows.

빅데이터 분석기사 실기 (Python)

과정 개요

빅데이터 분석기사 실기 시험을 완벽하게 대비하는 Python 기반 실전 과정입니다. 실제 시험 환경과 동일한 조건에서 데이터 분석, 모델링, 평가까지 전 과정을 학습합니다.

과정 정보

학습 목표

  • 빅데이터 분석기사 실기 시험의 3가지 유형 완벽 마스터
  • Python 라이브러리(Pandas, NumPy, Scikit-learn)를 활용한 데이터 분석

커리큘럼

1단계: 작업형 1유형 - 데이터 전처리

  • 데이터 읽기 및 탐색
  • 결측치 처리
  • 이상치 탐지 및 처리
  • 데이터 변환 및 인코딩
  • 그룹화 및 집계

2단계: 작업형 2유형 - 머신러닝 모델링

  • 분류 모델 (로지스틱 회귀, 의사결정나무, 랜덤포레스트 등)
  • 회귀 모델 (선형회귀, Ridge, Lasso 등)
  • 교차 검증 및 하이퍼파라미터 튜닝
  • 모델 평가 지표 (정확도, F1-score, ROC-AUC, RMSE 등)
  • 예측 결과 제출 형식

3단계: 작업형 3유형 - 통계 분석

  • 기술통계 분석
  • 가설 검정
  • 상관분석 및 회귀분석
  • 통계적 유의성 해석

실습 환경

  • 언어: Python

ADsP 회귀분석 상호작용 예시

회귀분석 상호작용 예시

라이브러리 가져오기

  • reshape2 → 데이터 구조 변환(wide↔long), tips 데이터 포함
  • ggplot2 → 시각화(산점도, 회귀선, 상호작용 그래프)
  • lmtest → 회귀 가정 검정(등분산성, 독립성 등)
  • car → 공선성 점검(VIF), 회귀 진단 도구
  • broom → 회귀 결과를 깔끔한 데이터프레임으로 정리
  • emmeans → 상호작용 효과·부분효과(기울기) 통계 검정
library(reshape2)
library(ggplot2)
library(lmtest)
library(car)
library(broom)
library(emmeans)

Tips 데이터 가져오기

  • 데이터 설명 : 미국 식당에서 수집된 팁 관련 표본 데이터
  • 관측치 수: 244
변수명타입설명
total_billnumeric총 결제 금액(달러)
tipnumeric팁 금액(달러)
sexfactor (2)성별 — Female / Male
smokerfactor (2)흡연 여부 — No / Yes
dayfactor (4)요일 — Fri / Sat / Sun / Thur
timefactor (2)식사 시간 — Dinner / Lunch
sizeinteger일행 인원 수
data("tips")                 
str(tips)
## 'data.frame':    244 obs. of  7 variables:
##  $ total_bill: num  17 10.3 21 23.7 24.6 ...
##  $ tip       : num  1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
##  $ smoker    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ time      : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ size      : int  2 3 3 2 4 4 2 4 2 2 ...

상호작용이 없는 모델 만들기

  • 먼저 상호작용이 없는 모델을 만든다.
m1 <- lm(tip ~ total_bill * sex, data = tips)
summary(m1)
## 
## Call:
## lm(formula = tip ~ total_bill * sex, data = tips)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2232 -0.5660 -0.0977  0.4796  3.6675 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.048020   0.272498   3.846 0.000154 ***
## total_bill          0.098878   0.013808   7.161 9.75e-12 ***
## sexMale            -0.195872   0.338954  -0.578 0.563892    
## total_bill:sexMale  0.008983   0.016417   0.547 0.584778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 240 degrees of freedom
## Multiple R-squared:  0.4574, Adjusted R-squared:  0.4506 
## F-statistic: 67.43 on 3 and 240 DF,  p-value: < 2.2e-16

계수 해석

  • 계수 해석에 대한 설명은 다음과 같다.
계수 항목추정값(Estimate)표준오차(Std. Error)p-value해석
total_bill0.09890.0138<0.001여성 그룹에서 총금액 1달러 증가 시 팁이 약 $0.099 증가
sexMale-0.19590.33900.564남성은 여성보다 팁이 평균 $0.196 낮지만 통계적으로 유의하지 않음
total_bill:sexMale0.00900.01640.585남성의 기울기가 여성보다 0.009 더 크지만 통계적으로 유의하지 않음
  • 위 표에 대한 해석 가이드는 다음과 같다.
    • (Intercept) : 기준집단(여성)에서 total_bill = 0일 때 팁의 평균값(절편). 실제 상황에서 해석보다는 기준점 역할에 가까움.
    • total_bill : 여성(Female) 그룹 기준으로, 총 결제금액이 1달러 증가할 때 팁이 평균 얼마 증가하는지를 나타냄. 여기서는 0.099달러 증가 → 유의(p<0.001).
    • sexMale : 총 결제금액이 0일 때 남성이 여성보다 팁을 얼마나 더(또는 덜) 주는지의 차이. 여기서는 남성이 여성보다 $0.196 낮지만, 유의하지 않음.
    • total_bill:sexMale : 성별에 따라 총금액이 팁에 미치는 기울기 차이(상호작용). 남성의 기울기가 여성보다 약간(0.009) 높지만 통계적으로 유의하지 않음.

모델 시각화

관측점 + 집단별 loess/선형선(간단)

  • 그래프 코드는 다음과 같다.
ggplot(tips, aes(x = total_bill, y = tip, color = sex)) +
  geom_point(alpha = .6) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Interaction: total_bill × sex",
       x = "Total bill", y = "Tip", color = "Sex") +
  theme_minimal(base_size = 13)
## `geom_smooth()` using formula = 'y ~ x'

image.png