A selection of recent work across data engineering, AI-powered document automation, and cloud-based ETL development.
1. Inspection Report Automation (OCR + Structured Data Extraction)
Industry: Property / Facility Management
Documents: Building inspection reports, routine inspections, defect reports
Technologies: Azure Document Intelligence, OCR, Layout analysis, Python, Azure OpenAI
Overview
Developed an end-to-end workflow to convert inspection report PDFs into structured JSON/Excel outputs.
The workflow performs OCR, layout parsing, table extraction, field normalisation, defect identification, and summary generation.
Outcome / Impact
- Reduced manual review time from 20–40 minutes to 1–2 minutes per report
- Achieved 83–92% extraction accuracy after tuning against client formats
- Enabled automatic defect summaries and action lists
- Integrated 10,000+ historical reports into a searchable knowledge base
2. Contract & Policy Document Extraction / Classification
Industry: Corporate / Legal / Real Estate
Documents: Tenancy agreements, supplier contracts, internal policies, compliance documents
Technologies: OCR, Azure OpenAI embeddings, Vector Search, Python
Overview
Built a document ingestion pipeline that extracts key clauses, dates, obligations, risks, and renewal conditions from contract PDFs.
Classification and metadata tagging were added for large document archives.
Outcome / Impact
- Automated extraction of 20–40 key fields per contract
- Implemented standard JSON schema for downstream systems
- Achieved consistent clause identification across multiple contract formats
- Enabled instant retrieval of similar clauses using semantic search
3. Routine Property Inspection Summary Generator
Industry: Real Estate Property Management
Documents: Routine inspection reports (property condition, photos, notes)
Technologies: OCR, Python, Image–Text validation, Summarisation models
Overview
Developed a system that combines text from inspectors’ notes with room-by-room photos.
The workflow detects inconsistencies, extracts condition ratings, and generates a structured summary.
Outcome / Impact
- Standardised reporting quality across inspectors
- Automated creation of summary sheets for owners/landlords
- Reduced manual summarisation time by 70–90%
- Improved consistency in property condition evaluations
4. ETL Pipeline Development for National Telecom (OPTUS)
Industry: Telecommunications
Technologies: Databricks, PySpark, Airflow, SQL, Delta Lake, AWS
Overview
Participated in the development of large-scale ETL pipelines for nationwide operational data.
Responsibilities included pipeline design, job optimisation, data quality monitoring, and Airflow orchestration.
Outcome / Impact
- Improved job reliability and reduced processing time
- Implemented robust data quality checks
- Handled multi-terabyte distributed processing
- Ensured production-grade deployment and documentation
5. Supplier Data Warehouse for a Trading & Manufacturing Group (Marubeni)
Industry: Trading / Manufacturing
Technologies: AWS (Lambda, DynamoDB, S3, Glue), Python, SQL
Overview
Designed and implemented an ETL pipeline migrating supplier data from DynamoDB into a structured warehouse schema.
Developed Python modules for data transformation and schema alignment.
Outcome / Impact
- Automated daily ingestion and transformation
- Designed star schema for analytics
- Enabled downstream Power BI dashboards
- Improved data consistency across business units
6. BI Integration for ServiceNow and External SaaS
Industry: IT Service / CRM
Technologies: ServiceNow, API Integration, Azure SQL, Data Factory, Python
Overview
Built real-time integrations between ServiceNow and external SaaS platforms.
Performed data cleansing, transformation, and modelling for BI reporting.
Outcome / Impact
- Reduced manual reconciliation work
- Delivered real-time operational dashboards
- Improved reliability of cross-system data
7. Large-Scale Web Scraping & Distributed Processing (DataSection)
Industry: E-commerce / Market Intelligence
Technologies: Python, Java, AWS EC2, Redis, S3, SQS, Distributed crawling
Overview
Designed and built distributed web scraping architecture collecting data for millions of products.
Implemented proxy pools, retry logic, failover mechanisms, and automated pipelines.
Outcome / Impact
- Scaled to millions of records per day
- Reduced scraping failure rate with smart retry logic
- Achieved stable and cost-efficient distributed crawling
8. ML-Based Document Search (Azure + LangChain)
Industry: Professional Services
Technologies: Azure OpenAI, Azure Search, Python, LangChain
Overview
Developed a hybrid search system that combines embeddings-based retrieval, keyword fallback, and metadata filtering.
Used for document discovery across multiple departments.
Outcome / Impact
- Improved search precision compared to keyword-only systems
- Reduced time spent locating relevant documents
- Supported bilingual content (English/Japanese)
Core Technical Capabilities
- OCR / layout analysis
- Document classification and extraction
- LLM-based summarisation and validation
- Databricks / PySpark ETL
- Airflow / orchestration
- AWS / Azure cloud pipelines
- API integrations
- Data modelling and warehouse design
Contact
For detailed case studies or sample outputs, please reach out:
📧 nueki@manoriworks.com
🌐 AI-Powered Document Automation : https://manoriworks.com/en