이메일 용 PHP에서 HTML을 일반 텍스트로 변환

IT박스

이메일 용 PHP에서 HTML을 일반 텍스트로 변환

itboxs 2020. 10. 14. 07:38

이메일 용 PHP에서 HTML을 일반 텍스트로 변환

TinyMCE 를 사용 하여 내 사이트 내에서 최소한의 텍스트 서식을 허용합니다. 생성 된 HTML에서 전자 메일 용 일반 텍스트로 변환하고 싶습니다. html2text 라는 클래스를 사용해 왔지만 다른 것들 중에서도 UTF-8 지원이 정말 부족합니다. 그러나 저는 특정 HTML 태그를 일반 텍스트 형식으로 매핑하는 방식을 사용합니다. 이전에 HTML에서 태그가 있었던 텍스트 주위에 밑줄을 두는 것과 같습니다.

누구든지 PHP에서 HTML을 일반 텍스트로 변환하는 유사한 접근 방식을 사용합니까? 그렇다면 : 내가 사용할 수있는 타사 클래스를 추천합니까? 아니면이 문제를 어떻게 가장 잘 해결합니까?

Eclipse Public License에 따라 라이센스가 부여 된 html2text (예제 HTML to text ) 를 사용하십시오 . PHP의 DOM 메서드를 사용하여 HTML에서로드 한 다음 결과 DOM을 반복하여 일반 텍스트를 추출합니다. 용법:

// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);

// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);

불완전하지만 오픈 소스이며 기여를 환영합니다.

다른 변환 스크립트 문제 :

이후 html2text (GPL)는 EPL-호환되지 않습니다.
lkessler의 링크 (속성)는 대부분의 오픈 소스 라이선스와 호환되지 않습니다.

여기 또 다른 해결책이 있습니다.

$cleaner_input = strip_tags($text);

살균 기능의 다른 변형은 다음을 참조하십시오.

https : // RunFor github.com/tazotodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php

DOMDocument를 사용하여 HTML에서 텍스트로 변환 하는 것은 실행 가능한 솔루션입니다. PHP5가 필요한 HTML2Text를 고려하십시오.

UTF-8과 관련하여 "howto"페이지의 글에는 다음과 같은 내용이 있습니다.

유니 코드에 대한 PHP의 자체 지원은 매우 열악하며 항상 utf-8을 올바르게 처리하지는 않습니다. html2text 스크립트는 (mbstring 모듈이 필요하지 않은) 유니 코드 안전 메서드를 사용하지만 PHP 자체의 인코딩 처리에 항상 대처할 수는 없습니다. PHP는 utf-8과 같은 유니코 드나 인코딩을 실제로 이해하지 못하며 시스템의 기본 인코딩을 사용합니다. 이는 ISO-8859 제품군 중 하나 인 경향이 있습니다. 결과적으로 텍스트 편집기에서 utf-8 또는 단일 바이트로 표시되는 유효한 문자가 PHP에서 잘못 해석 될 수 있습니다. 따라서 html2text에 유효한 문자를 제공한다고 생각하더라도 그렇지 않을 수 있습니다.

저자는이를 해결하기위한 몇 가지 접근 방식을 제공하고 HTML2Text 버전 2 (DOMDocument 사용)가 UTF-8을 지원한다고 설명합니다.

상업적 사용에 대한 제한 사항에 유의하십시오.

신뢰할 수있는 strip_tags 기능이 있습니다. 그래도 예쁘지 않습니다. 소독 할뿐입니다. 멋진 밑줄을 얻기 위해 문자열 바꾸기와 결합 할 수 있습니다.


<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));

// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));

?>

-stdin 및 -dump 옵션과 함께 lynx를 사용하여이를 달성 할 수 있습니다.

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to htmp2txt.log

    $stdin = $pipes[0];
    fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
 <title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
    );
    fclose($stdin);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}

이 기능을 테스트 할 수 있습니다.

function html2text($Document) {
    $Rules = array ('@<script[^>]*?>.*?</script>@si',
                    '@<[\/\!]*?[^<>]*?>@si',
                    '@([\r\n])[\s]+@',
                    '@&(quot|#34);@i',
                    '@&(amp|#38);@i',
                    '@&(lt|#60);@i',
                    '@&(gt|#62);@i',
                    '@&(nbsp|#160);@i',
                    '@&(iexcl|#161);@i',
                    '@&(cent|#162);@i',
                    '@&(pound|#163);@i',
                    '@&(copy|#169);@i',
                    '@&(reg|#174);@i',
                    '@&#(d+);@e'
             );
    $Replace = array ('',
                      '',
                      '',
                      '',
                      '&',
                      '<',
                      '>',
                      ' ',
                      chr(161),
                      chr(162),
                      chr(163),
                      chr(169),
                      chr(174),
                      'chr()'
                );
  return preg_replace($Rules, $Replace, $Document);
}

간단한 HTML 이메일을 간단한 일반 텍스트 파일에 적용하는 기존 솔루션을 찾지 못했습니다.

이 저장소를 열었습니다. 누군가에게 도움이되기를 바랍니다. 그런데 MIT 라이센스 :)

https://github.com/RobQuistNL/SimpleHtmlToText

예:

$myHtml = '<b>This is HTML</b><h1>Header</h1><br/><br/>Newlines';
echo (new Parser())->parseString($myHtml);

보고:

**This is HTML**
### Header ###


Newlines

HTML 특수 문자 를 변환 하고 단순히 제거하지 않고 항목을 제거하고 일반 텍스트를 준비 하려는 경우 이것이 저 에게 효과적이었습니다 ...

function htmlToPlainText($str){
    $str = str_replace('&nbsp;', ' ', $str);
    $str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
    $str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
    $str = html_entity_decode($str);
    $str = htmlspecialchars_decode($str);
    $str = strip_tags($str);

    return $str;
}

$string = '<p>this is (&nbsp;) a test</p>
<div>Yes this is! &amp; does it get "processed"? </div>'

htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`

html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like ' htmlspecialchars_decode converts things like & html_entity_decode converts things like '< and strip_tags removes any HTML tags left over.

Markdownify converts HTML to Markdown, a plain-text formatting system used on this very site.

Markdownify worked wonderful for me! what have to be mentioned about it: it supports perfectly utf-8, what was the main reason why i was searching for another solution than html2text (what was mentioned earlier in this thread).

I came around the same problem as the OP, and trying some solutions from the top answers above didn't prove to work for my scenarios. See why at the end.

Instead, I found this helpful script, to avoid confusion let's call it html2text_roundcube, available under GPL:

https://github.com/mtibben/html2text

It's actually an updated version of an already mentioned script - http://www.chuggnutt.com/html2text.php - updated by RoundCube mail.

Usage:

$h2t = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');
echo $h2t->getText(); // prints Hello, "WORLD"

Why html2text_roundcube proved better than the others:

Script http://www.chuggnutt.com/html2text.php didn't work out of the box for cases with special HTML codes/names (eg ä), or unpaired quotes (eg 25" Monitor).
Script https://github.com/soundasleep/html2text had no option to hide or group the links at the end of the text, making a usual HTML page look bloated with links when in text-plain format; customizing the code for special treatment of how the transformation is done is not as straight forward as simply editing an array in html2text_roundcube.

public function plainText($text)
{
    $text = strip_tags($text, '<br><p><li>');
    $text = preg_replace ('/<[^>]*>/', PHP_EOL, $text);

    return $text;
}

$text = "string 1 string 2 <ul><li>string 3</li><li>string 4</li></ul>string 5";

echo planText($text);

output
string 1
string 2
string 3
string 4
string 5

I have just found a PHP function "strip_tags()" and its working in my case.

I tried to convert the following HTML :

<p><span style="font-family: 'Verdana','sans-serif'; color: black; font-size: 7.5pt;">&nbsp;</span>Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry's lackluster performance during this time,  revenue has grown at an average annual rate&nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&nbsp; So despite the downturn, how were we  able to manage growth as an industry?</p>

After applying strip_tags() function, I have got the following output :

&amp;nbsp;Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&amp;nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry&#039;s lackluster performance during this time,  revenue has grown at an average annual rate&amp;nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&amp;nbsp; So despite the downturn, how were we  able to manage growth as an industry?

If you don't want to strip the tags completely and keep the content inside the tags, you can use the DOMDocument and extract the textContent of the root node like this:

function html2text($html) {
    $dom = new DOMDocument();
    $dom->loadHTML("<body>" . strip_tags($html, '<b><a><i><div><span><p>') . "</body>");
    $xpath = new DOMXPath($dom);
    $node = $xpath->query('body')->item(0);
    return $node->textContent; // text
}

$p = 'this is <b>test</b>. <p>how are <i>you?</i>. <a href="#">I\'m fine!</a></p>';
print html2text($p);
// this is test. how are you?. I'm fine!

One advantage of this approach is that it does not require any external packages.

For texts in utf-8, it worked for me mb_convert_encoding. To process everything regardless of errors, make sure you use the "@".

The basic code I use is:

$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

$body = $dom->getElementsByTagName('body')->item(0);
echo $body->textContent;

If you want something more advanced, you can iteratively analyze the nodes, but you will encounter many problems with whitespaces.

I have implemented a converter based on what I say here. If you are interested, you can download it from git https://github.com/kranemora/html2text

It may serve as a reference to make yours

You can use it like this:

$html = <<<EOF
<p>Welcome to <strong>html2text<strong></p>
<p>It's <em>works</em> for you?</p>
EOF;

$html2Text = new \kranemora\Html2Text\Html2Text;
$text = $html2Text->convert($html);

참고URL : https://stackoverflow.com/questions/1884550/converting-html-to-plain-text-in-php-for-e-mail

'IT박스' 카테고리의 다른 글

Docker가 시스템 시작시 컨테이너를 자동으로 시작하지 않도록하는 방법은 무엇입니까? (0)	2020.10.14
.NET Framework 버전을 쉽게 확인할 수있는 방법이 있습니까? (0)	2020.10.14
동일한 SHA1 해시를 얻을 수 있습니까? (0)	2020.10.14
Java에서 기본 배열 값을 가정 할 수 있습니까? (0)	2020.10.14
Factory Girl에서 배열 / 해시를 정의하는 방법은 무엇입니까? (0)	2020.10.14

현재글이메일 용 PHP에서 HTML을 일반 텍스트로 변환

itboxs

이메일 용 PHP에서 HTML을 일반 텍스트로 변환

이메일 용 PHP에서 HTML을 일반 텍스트로 변환

'IT박스' 카테고리의 다른 글

'IT박스'의 다른글

티스토리툴바

이메일 용 PHP에서 HTML을 일반 텍스트로 변환

이메일 용 PHP에서 HTML을 일반 텍스트로 변환

'IT박스' 카테고리의 다른 글

'IT박스'의 다른글

관련글

티스토리툴바